Hacker News new | ask | show | jobs
by sgk284 2376 days ago
At my previous startup we put everything we could into the database (work queues and pup/sub included). It was magical.

It minimized complexity (moving parts, boundaries, etc...), everything was transactional, and a single dump of the database gave you a copy of the entire persistent state of your system at that moment (across all services).

We had several services running, but any notion of persistent state was stored in a single database. We didn't have particularly high demands, but processed ~140M jobs per month in addition to our normal query load.

Postgres handled this like a champ, had far lower p50 and p95 latency than SNS/SQS, and rarely went above single-digit percentage CPU usage on a fairly cheap DigitalOcean box.

I've worked at a lot of big tech companies (Google, Microsoft, Twitter, Salesforce) in many varieties of giant distributed systems and the most valuable lesson I've learned over and over again is: Distributed systems are hard, avoid them until you can't.

That's the best advice I can give to any startup. Until you're regularly sustaining 1,000's of TPS or have grown to dozens of TB of data, it is generally a distraction to even think about anything other than a single relational database.

10 comments

That's great to read, and a perfect example of solving real problems instead of wasting time on exotic infrastructure.

The vast majority of companies and their scale will never go above a single midsize database server. All of the transactional, backup, and querying functionalities you list make it much more productive than using fancy AWS/cloud services.

We did a very similar thing where I used to work. Didn't want to bother with a separate task queue and wanted a consistent SQL interface for everything.

We wrote a very simple task queue (<500 lines of python) using a postgres table and inbuilt pub/sub. Ran millions of jobs, scheduled and unscheduled, without a hitch.

> Distributed systems are hard, avoid them until you can't.

thank you. May I borrow this?

Yes, but why architect yourself into a more painful corner when you reach "can't"?
Please do!
This! So many wasted years of engineering making correct applications on top of Cassandra/Kafka/Elasticsearch when a single Postgres server could have handled everything no problem.

Scalability for a tech product should be a concern at some point, but if you don’t have product market fit it doesn’t matter. The complexity of building a distributed system right out of the gate will cripple rapid iteration or lead to a broken product.

> put everything we could into the database (work queues and pup/sub included). It was magical.

I've been building systems on some pretty similar ideas for the last 5+ years, and have a similarly optimistic outlook on it. Just to add another datapoint here, we've been able to scale them up to 20-40x higher than your figure (a couple thousand jobs/sec) on a single (albeit beefy) MySQL instance. It definitely takes a fair bit of care & we certainly have some time invested in it, but it can be done.

I guess my services haven't had (gotten to? :-) ) to share databases with other services for quite some time -- they did when I started my job! -- but even with that concession, the benefits of DB guarantees have been amazing. There's nothing like a transaction for preserving your sanity.

The best tactic I've learned from these systems is to think really hard about the batching & partitioning behavior to keep msgs/sec high enough & transactions/sec low enough to stay afloat.

To add to this, the numerous, impossible to solve issues that happen when your data starts hopping all over the network are proven mathematically:

https://groups.csail.mit.edu/tds/papers/Lynch/podc89.pdf

For some perhaps-off-topic contrarianism, what do you make of Uncle Bob railing against database-oriented architectures? [0] (He takes a good few minutes to get his point across.)

[0] https://youtu.be/o_TH-Y78tt4?t=2566

I can't watch that at work right now. But is it pretty much this? https://blog.cleancoder.com/uncle-bob/2012/05/15/NODB.html

I've heard of him before but never really read anything much by him. Is this guy for real? He worked at one startup, got angry about a database, became a consultant, and screams about everyone being wrong about everything?

Yes he makes similar points there. A few scattered critical thoughts:

* I don't see any problem with stored procedures. They can make good sense for, say, auditing, as well as for performance.

* People describe their software systems as "Using Oracle" because it matters, not because their design is stupid. It tells me that Oracle skills are relevant, for instance.

* NoSQL is generally, in my experience, awful and chaotic, rather than liberating. Lots of good work has gone into making serious grown-up relational databases. Schemas, normal forms, constraints, rigorous work on the ACID properties. Something like MongoDB is just sloppy amateur-hour by comparison.

* In the YouTube video, he says that solid-state drives render relational databases obsolete. This strikes me as absurd. Not even ultra-fast SSD storage technologies will do that. The relational model is effective, and the associated DBMSs still well justified. Joins belong in a query language, not in imperative code. Why would you want to try managing a huge complex dataset manually?

[0] https://en.wikipedia.org/wiki/3D_XPoint

(Edit: formatting)

Yeah, I have changed my opinion of stored procs to :

If you have a proprietary database, stored procedures are awful. AHEM ORACLE.

Yeah, seems like he vastly overstates things. ssds are great for relational databases. They all use them.
I think the point is that you might be able to build something else than a standard relational db and make it faster.

Something like VoltDB but there are other solutions. There used to be some blogs about their architecture and how they could avoid a couple of steps a normal RMDB needs that takes a lot of time.

https://www.voltdb.com/why-voltdb/

> I think the point is that you might be able to build something else than a standard relational db and make it faster.

I sincerely doubt that.

For it to be useful in real world situations, you'd have to build your own highly flexible, scalable, rock-solid, high-performance data-management solution, ideally one which enables the user to use a declarative means of expressing queries.

This is, of course, a DBMS.

If you build one which imposes structure on the data, you've got either a relational DBMS, or an object DBMS, or some other kind of well-studied database solution... except you've rolled your own from scratch, without input from database experts, so it's going to be a disaster.

Unless you're someone like Microsoft, Google, or Amazon, you pretty much can't build one of those. It costs tens of millions. It takes a huge amount of testing, for obvious reasons.

I really don't see any argument for not using a mature database system for managing a typical company's data.

Of course, if we aren't talking about a full-scale database that has to cope with a huge amount of messy mission-critical data in a changing business environment, then the game changes, and sure, you might have a chance just writing something yourself.

Netflix's video-streaming CDN ('OpenConnect'), for instance, obviously isn't powered by an RDMS. They put in a huge amount of highly technical work building their own finely tuned technologies to pull data off the disk and get it to the NIC with minimal glue in between. But that's Netflix, and virtually no real companies face that kind of challenge.

Also, VoltDB looks like a streaming DB technology. Isn't that both an DBMS (rather than a means of rolling-your-own), and very niche? I don't see how it's relevant here. If it's really able to serve business needs better, then great, but it's still a complex-DBMS-as-a-product.

> became a consultant

To me, he sounds more like a snake oil salesman.

I think his take on databases is moronic with a dash of common sense:

He rails against a strawman where people put SQL everywhere, into views, into application logic ("mail merge" is one he mentions), and DBAs gatekeep everything.

The common sense part: no, your views shouldn't be composing SQL with string interpolation.

The moronic parts:

- He venerates application logic and regards the data model as a detail, but data models aren't details — they tend to outlive application logic.

- He holds up in-memory data structures as a platonic ideal, but doesn't address the things databases provide for you: schemas, constraints, transactional semantics and error recovery, a clear concurrency story.

Fascinating, was there ever more written about your tech stack?
Not OP, but we do something similar it our startup. PostgREST (http://postgrest.org) definitely helps if you want to go down this path as it exposes everything in the database (including functions etc) via a RestFUL api.

A bit outdated now, but some details here: https://paul.copplest.one/blog/nimbus-tech-2019-04.html#tech...

Of course, at the point where you do need to break out beyond a single DB, life is more painful because you've got dependencies on the database all over the place...
That's incorrect. It's trivial If you have a clearly defined API with which to extract or insert data.

It only starts to leak if everyone accesses the the DB through raw sql, mishmashing modules and creating insane dependencies.

Give each module ownership to a table and demand to only access it through there and the backend storage system becomes irrelevant

If you put everything in the database, and use it for all it's able to do, as the comment I replied to suggested, it is practically inevitable under startup feature pressure conditions that total modularity won't hold. And in fact I think it would be irresponsible to pursue total modularity; it would court failure.

I believe you're being deeply naive about modules owning single tables. You forego relational integrity and almost the whole power of relational algebra if you take that approach. That's heaps of functionality to leave behind when you and one or two other guys need to crank out a feature a day.

You obviously need to write queries which span tables (I.e. joins).

I can see how you misunderstood me there however, it was admittedly poorly worded.

What I was talking about was basic data hygine as it's often called. If you need specific data, you define a clear way to get this data and only access it through that API. This can be a class, method or anything your language of choice prefers.

If you skip that step and directly access the DB everywhere in your code, you'll create another unmaintainable dumbsterfire as soon as your team goes beyond the initial programmers.

Having scaled up the tech in a company from $0 to $10+MM ARR, from zero customers to over 300 million records a day, I think you are simply incorrect and uninformed about the trade-offs that make sense in a startup.
so you are the only one with this knowledge, sensei?
You hit on the real reason architects push microservices in smaller companies: it limits options to engineers and to the startup by putting up walls in the form of network rules that are resistant to change.

A startup needs all the options on the table because the alternative is they go out of business and there is no startup. This idea that we can enforce “good architecture” (subject to interpretation) with technology by isolating junior engineers with networking rules needs to at least be more transparent about its motivations.

Services permit scaling development teams. In a startup where you're all in the same room, lots of separate services don't really make sense, and you probably don't know where to make the right cuts even if you tried. When you want to grow beyond the one big room, with a lot of teams, potentially in other time zones and countries, then you want to be able to carve off services so teams can own them.
The problem here is that an unmitigated monolith with no domain partitioning in its data model can't be transformed into a service architecture by carving off pieces.

For pieces to be able to be carved off, they already have to be autonomous.

What usually happens is that devs without experience in service architectures presume that service architectures are probably just the things they are used to seeing, ie: monolithic entities.

It's usually then that we hear things like "extract the product service". The problem is that product is an entity, and entities are the last thing to build services around. That's how we end up with distributed monoliths, and subsequently failed service architecture projects.

In order for an app to be able to transition to services, the app has to be designed this way from the beginning.

And yes, it's definitely possible to know what the model partitions should be up front. They're very natural divisions. But they can't be arrived at by looking through an entity-centric lens. And unfortunately, forms-over-data apps very rarely provide us with an opportunity to learn about the "other" way to do it.

Message DB happens to be implemented using a RDBMS, but the streams in it end up being very clear partition points. Some thought would be required to move data to a different database, but an event-sourced model isn't the same as coupling through a traditional RDBMS schema.

Edit: Fixed a typo

Running a PG instance on DO for a startup is a really bad idea, I don't understand how can choose that over a managed solution.

If you're a startup just use SNS / Kinesis / Google pub sub ect ...

Or managed Postgres on AWS Aurora, AWS RDS, Google Cloud SQL, Heroku, etc :)

SNS, Kenesis (Kafka), Google Pub/Sub are awesome. Not the same problem/solution fit as an event store, but awesome for the scenarios and architectures they're targeted at.

Is there a difference from a programming perspective between, say Kafka, and the event store linked in this post?

I understand that Kafka can scale horizontally and can handle crazy throughput, but I mean from a programming point of view, the idea of a unified log as a data model applies to both, correct?

There are some similarities, and there are definitely worse technologies you could choose for a message store than Kafka. It's worth calling out the difference between event-sourced and event-based. The former is necessarily the latter, but that doesn't go in the opposite direction.

Event-based just means that communication happens over events. Event-sourced means that the authoritative state of the system is sourced from events. If the events are literally the state, then how those are retrieved begins to matter.

Kafka breaks down as a message store in 2 key ways that I mentioned elsewhere in all these threads.

> The first is that one generally has a separate stream for each entity in an event-sourced system. Streams are sort of like topics in Kafka, but it would be quite challenging to, say, make a topic per user in Kafka. The second is Kafka's lack of optimistic concurrency support (see https://issues.apache.org/jira/browse/KAFKA-2260). The decision to not support expected offsets makes perfect sense for what Kafka is, but it does make it unsuitable for event sourcing.

If my only tool were Kafka, then I wouldn't be able to use the messages in the same way that I can with something like Message DB. And that's okay, different tools for different jobs.

Can you elaborate on why it is a really bad idea?
Because running you're own DB on a VPS means:

- Dealing with sec upgrade for PG and Linux

- Dealing with backups

- Dealing with HA, so settings up slaves / replica, then what do you do when something goes wrong? Do you manually SSH and do some magic?

- Network security / usernames / passwords

- ect ...

Clearly what startup don't want to do and usually lacks expertise into.