| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Jweb_Guru 3525 days ago

What you're asking for is just really difficult. Even with a fairly good compression ratio (e.g. 57 bits / tuple, which I chose at random but is similar to what's achievable on real data according to http://i.stanford.edu/~adityagp/courses/cs598/papers/constan...) you're talking 83.4 GB / s of read memory bandwidth just to read the tuples at all within 10 seconds for a single query. The state of the art in terms of what you're likely to be able to actually buy (IBM Power8 machines) can do about 91.5 GB / s. in Stream Triad benchmarks. That's still nowhere close to supporting 100s of queries a minute (more like 6 or 7) and you haven't even started calculating anything yet (Triad is pretty simplistic). It also assumes you can actually achieve that data rate while keeping everything in memory (which is why GPUs probably won't help much for your use case, despite NVLink; they don't have enough RAM).

Your biggest issue with aggregation / group by is going to be the memory bandwidth to/from a hash table to store all the results. Once the hash table no longer fits in L3, its access time will rise dramatically, so your query's performance will depend heavily on how many buckets you need (if you're grouping by 2 columns, maybe not that many; if you're grouping by 20, you'll probably have almost no buckets with more than one entry). Another potential issue is going to be having to look up the values for that column in a hash table if they are highly compressed (since you need to perform aggregation on them, presumably something like sum; if it's count, this doesn't matter).

If you can find a total attribute order such that the columns you group by are always to the left of the columns you aggregate by, you can sort the rows to enable efficient delta encoding; you can also then perform aggregation in the same order as your scan, which eliminates the hash table lookup problem for output (not input, though). You can also pre-materialize the results for common subsets.

To do this with multiple queries at once, you'd probably need to batch query execution (because of the memory bandwidth issue I alluded to earlier). While the data access requirements would remain similar on read, they'd get worse and worse on write (again, depending on how many buckets you had and how cleverly sorted your data were).

Alternately, you could get a bunch of Power8s (or commodity machines, but to maximize bang for your hardware buck you really want stuff with tons of memory bandwidth) and give each of them a slice of the data (but still apply all the above optimizations). The commodity version of this is the Redshift solution. If you went this route, you could also look at specialized solutions like GPUs with NVLink or the KNL Xeon Phis, which have "fast" memory with tons of extra bandwidth which help mitigate the aforementioned hash table / query result access costs.

I still haven't talked about how you're supposed to actually get the results back. If you're trying to do ethernet through anything commodity you're going to be very limited in terms of data rate. Even 100 Gbps Infiniband only gets you 12.5 GB/s out, and even if you hook two of those up to each machine you're still only at 25 GB/s out. So either you get even more machines, your clients process the data on the machine, or you have to limit the output somehow... and if you're thinking "we'll sort it!" guess what that's going to be bound by (unless you can store the input data presorted)? Probably memory bandwidth (assuming radix sort)!

tl;dr One way or another, you're paying out the nose to satisfy the requirements you just outlined. Also worth noting that you're paying for this read performance on write.

1 comments

mej10 3525 days ago

Yeah, that is what it seems like.

Using a column store and doing a bunch of pre-aggregation is the only way we've been able to to come close to these requirements, but I keep hoping there is a system that will do what we've done without having to write code. Trading space for computation is the only thing that has worked reliably.

We also have the benefit that most queries just want a small subset of time and have things sorted to take advantage of that. But occasionally people want to do something over a large range and if isn't a set of group-bys we've pre-aggregated then they just have to be patient.

link