Hacker News new | ask | show | jobs
by danpalmer 1815 days ago
I think it's important to know about different data storage options and their trade-offs. Managing state is one of the hardest parts of backend development, particularly at scale, so an understanding of the trade-offs in databases/caches/blob-storage/queues, and when each is useful is important.

I'd pay close attention to speed and "correctness". What's the consistency model of a system? Can we lose data and if so how? What's the throughput? Latency?

These help choose good technology for backend systems, and helps answer questions like:

- Can we do this in-band while serving a user request?

- Can we do it 100 times to serve a request?

- If it completes successfully can we trust it or do we still need to handle failure?

- Can we trust it immediately or eventually?

There are lots of technologies and terms for all of this but I've specifically avoided them because the important bit is the mental model of how these things fit together and the things they allow/prevent.

1 comments

Absolutely this.

Also, I’ve had to explain this to so many other engineers, both junior and senior to me: most data is inherently relational. This next statement is a bit opinionated: 9 times out of 10 you probably want an RDBMS. I’ve seen so many attempts to shoehorn some ElasticSearch/Mongo/Neo4j/whatever database into a design because the developer wanted to work on CoolDatabaseTech. Then you’re stuck dealing with joins in CoolDatabase that it wasn’t really designed to do and frustrated at CoolDatabase’s lack of drivers in X language. Later on you’re dealing with stability and scalability issues you would never see with BattleTestedRDBMS.

The amount of capability a well designed Postgres instance can output is insane. I’ve seen a single vertically scaled Postgres instance compete with 100+ node Spark clusters on computations.

Exactly, but it goes a lot further than RDBMSs. For example does the application expect that all items on a queue will be processed? If so then you need a durable queue and Redis probably isn't a good idea, and this will likely reduce the throughput of the queue which might change how it needs to be used.

One I've been bitten by several times is expecting APIs to allow me to read-my-writes, only to find that their underlying data store is eventually consistent. The integration point/API client on our end may end up being twice as complex or more just to handle that.

> 9 times out of 10 you probably want an RDBMS.

And the last 1 can be done (modeled) in a RDBMS when the scale/pressure/volume is low. In other words, wait until you feel the heat.

Yeah, Postgres off the top of my head does NoSQL (JSONB), graph and time series stuff either natively or through some cheap or freely available add ons. It really can do anything. It’s not gonna be the best at that non-relational stuff, but it will do a “good enough” job for most use cases until you introduce heavy scaling.