Hacker News new | ask | show | jobs
by moehm 1987 days ago
> One of things that has really helped our DB scale is the fact that we’ve moved a lot of sorting, merging, filtering from the DB into the code itself.

This goes against everything I have learned until now. Maybe an experienced DBA can explain when moving stuff out of the database is helpful? I mean sorting, merging and filtering is the core discipline of a RDBMS and many, many man hours went into it. How and when can a custom version be better?

7 comments

If you have 1m clients making DB requests and are hitting performance constraints you could do any of:

  1. Add more ram, CPU, etc to the DB host
  2. Create DB read replicas for higher read volume
  3. Shard the DB (reads and writes)
  4. Offload logic to stateless clients, who you can easily scale horizontally
So option 4 is reasonable if you don't want to or can't do the other options.
I would need to know a little more about those 1M requests before making any of your points valid. If 1M are solely reads, you can do real-time transient offload onto Redis without calling it cache and breaking your bank.
True. I would rephrase option 2 as "cache and/or replicate your query results somewhere".
It's very application-specific. If I'm building a message board/forum software where I'm paginating tens of thousands to millions of rows, I'm probably going to do my sorting in the DB. If I'm working with a smaller data set like a users music playlist or favorite movies, I may opt to bring that data in unsorted from the DB and do everything else in app code. Even with small result sets, databases often over-allocate memory for sorting (sort buffers are usually pre-allocated) and even though it's often tunable, a lot of concurrent sorts on small data sets can waste memory.
I can break it down quickly, here: - Scaling web servers is much easier. Queries are (mostly, on a single host) executed one after another, so if one is slow, you suddenly have a queued of queries waiting to be executed. The goal is to execute them as quickly as possible. - 2 separate queries, both using perfect indexes will be much faster (insanely faster on this scale) than 1 query with a join. So, we just join them in code - Sorting is often a problem, since in MySQL only one index can be used per query. - No foreign keys increases insert/update query speeds, and decreases server load. - Etc, etc.

And thanks for brining this up, I've added a disclaimer that these are not to be taken for granted, and they work for us, on our scale.

I did plan a whole new article about this. With all the benchmarks we've gathered over the years.

It is easier to scale up application servers than DBs. You want the DB to spend its CPU cycles on what its real value which is ACID.
A single DB machine can easily become a performance bottleneck.

It doesn't matter if it sorts 1000x faster than client code, when there aren't cycles left to perform this (in theory) very performant sorting.

This might not be your actual problem, but I've seen it hit a completely oblivious dev team before :)

They didn't say it's done more efficiently, they said it helped their DB scale.

(It's probably cheaper/easier to scale their application servers as well.)

I'd certainly be tempted to move anything I could to the client...the end user's web browser. I imagine a fair amount of the load is from activity that never generates any revenue.