Hacker News new | ask | show | jobs
by heyplanet 3747 days ago
I would assume that when you query billions of rows, the disk is the bottleneck, not the CPU. What am I missing?
2 comments

Several things:

- disks (SSDs) are very fast now, so cores are saturated more easily (when queries actually process the data instead of just reading it)

- multiple parallel (random) reads will likely be faster on HDD and SDD to some extent (esp. on larger RAID setups)

- the best optimization is still lots of RAM and people have that these days, so 100% CPU utilization during queries happens more often than not (the benchmarking setup seems suitable for more than 1 billion rows...).

Query engines typically aggressively cache data in memory (mmap, CreateFileMapping or such). Their testbed has 256GB of RAM meaning that the cache is likely hot the majority of the time. Even if you don't have that much memory, there might be a chance that the parallel workers are working on the same pages resulting in a non-proportional relationship between worker count and I/O ops.