Hacker News new | ask | show | jobs
by paulasmuth 3519 days ago
>> Well, yeah. That's 28,108.80 a month

If you can't (or don't want to!) afford that kind of money for data analytics, please consider giving the FOSS alternative EventQL [0] a try some time.

It's super simple to set up and tries to be efficient on commodity hardware, so you can run large clusters (>100TB scale) for a couple hundred dollars a month.

[0] https://eventql.io/

DISC: I'm one of the EventQL authors

2 comments

How fast are group bys? Like say I have 120 billion rows, 25 not-sparse columns, and want to group by between 2-20 columns (5 of which are varchar), aggregating the other 5 columns?

What kind of hardware would I need to do that interactively? Or consistently sub-10 second, with 100s of queries per minute.

I have built a thing on Redshift that can do some of this, but it has been new territory for me and I am not sure I've done it "right". Constantly looking for alternatives.

Have you tried this on BigQuery yet? It's built for this kind of extremely large dataset.

You can also look at MemSQL for a distributed relational database with a columnstore. Run enough nodes and you might be able to hit your performance goals.

BigQuery makes it pretty easy to calculate how much you'll spend based on your bandwidth needs; unfortunately, if you do the math for the above use case, the answer is pretty discouraging... $5 / TB data processed, and the size in this context is explicitly the uncompressed size. Even if we're extremely generous and assume 4 bytes / column (almost certainly an underestimate given how many bytes they reserve for integers and timestamps), that is potentially 12 TB for the database, so you're paying $60 for a single query that hits the entire table. If you have 500 queries per month that hit the entire table, you're already paying as much as you would be for Redshift on 8 nodes, without even looking at any of the other queries.

The flat rate may improve things dramatically, of course, but the documentation's ambiguity about what you get with a "slot" makes it hard to say. And none of this is taking required bandwidth into account, because AFAICT there are no promises made on query response time... so I don't know if any of this would satisfy the 10 second requirement.

Either way, no matter how Google, MemSQL, or anyone else might try to satisfy these requirements, they can't get around the required hardware costs. At best, they can amortize them by buying in bulk and partitioning all their clients across lots of servers.

I don't work for any of these database providers, BTW, or use any of their products; I have no skin in this game. And I'm not even saying that paying $60 for that kind of query is necessarily a bad deal (when you consider what may be required under the hood). I'm just saying if you're looking for a cheap solution here you're not going to find it.

> I'm just saying if you're looking for a cheap solution here you're not going to find it.

That's a given. I think in this context the user was asking how to meet performance goals with cost possibly secondary or not a concern... and in that case BQ has proven to be incredible at churning through large datasets within seconds.

What you're asking for is just really difficult. Even with a fairly good compression ratio (e.g. 57 bits / tuple, which I chose at random but is similar to what's achievable on real data according to http://i.stanford.edu/~adityagp/courses/cs598/papers/constan...) you're talking 83.4 GB / s of read memory bandwidth just to read the tuples at all within 10 seconds for a single query. The state of the art in terms of what you're likely to be able to actually buy (IBM Power8 machines) can do about 91.5 GB / s. in Stream Triad benchmarks. That's still nowhere close to supporting 100s of queries a minute (more like 6 or 7) and you haven't even started calculating anything yet (Triad is pretty simplistic). It also assumes you can actually achieve that data rate while keeping everything in memory (which is why GPUs probably won't help much for your use case, despite NVLink; they don't have enough RAM).

Your biggest issue with aggregation / group by is going to be the memory bandwidth to/from a hash table to store all the results. Once the hash table no longer fits in L3, its access time will rise dramatically, so your query's performance will depend heavily on how many buckets you need (if you're grouping by 2 columns, maybe not that many; if you're grouping by 20, you'll probably have almost no buckets with more than one entry). Another potential issue is going to be having to look up the values for that column in a hash table if they are highly compressed (since you need to perform aggregation on them, presumably something like sum; if it's count, this doesn't matter).

If you can find a total attribute order such that the columns you group by are always to the left of the columns you aggregate by, you can sort the rows to enable efficient delta encoding; you can also then perform aggregation in the same order as your scan, which eliminates the hash table lookup problem for output (not input, though). You can also pre-materialize the results for common subsets.

To do this with multiple queries at once, you'd probably need to batch query execution (because of the memory bandwidth issue I alluded to earlier). While the data access requirements would remain similar on read, they'd get worse and worse on write (again, depending on how many buckets you had and how cleverly sorted your data were).

Alternately, you could get a bunch of Power8s (or commodity machines, but to maximize bang for your hardware buck you really want stuff with tons of memory bandwidth) and give each of them a slice of the data (but still apply all the above optimizations). The commodity version of this is the Redshift solution. If you went this route, you could also look at specialized solutions like GPUs with NVLink or the KNL Xeon Phis, which have "fast" memory with tons of extra bandwidth which help mitigate the aforementioned hash table / query result access costs.

I still haven't talked about how you're supposed to actually get the results back. If you're trying to do ethernet through anything commodity you're going to be very limited in terms of data rate. Even 100 Gbps Infiniband only gets you 12.5 GB/s out, and even if you hook two of those up to each machine you're still only at 25 GB/s out. So either you get even more machines, your clients process the data on the machine, or you have to limit the output somehow... and if you're thinking "we'll sort it!" guess what that's going to be bound by (unless you can store the input data presorted)? Probably memory bandwidth (assuming radix sort)!

tl;dr One way or another, you're paying out the nose to satisfy the requirements you just outlined. Also worth noting that you're paying for this read performance on write.

Yeah, that is what it seems like.

Using a column store and doing a bunch of pre-aggregation is the only way we've been able to to come close to these requirements, but I keep hoping there is a system that will do what we've done without having to write code. Trading space for computation is the only thing that has worked reliably.

We also have the benefit that most queries just want a small subset of time and have things sorted to take advantage of that. But occasionally people want to do something over a large range and if isn't a set of group-bys we've pre-aggregated then they just have to be patient.

Hadn't heard of EventQL before. Took a look at the documentation, so I have a brief idea of what it offers and how it works. But I couldn't find anything in way of performance numbers.

While I might try it out myself to see how well it performs, it would be nice if some figures were readily available.

I've been keeping an eye on Postgres-XL and CitusDB for distributed SQL. Would be interesting to compare.

Add MemSQL to the list.
Proprietary. Even the free community edition is closed source. EventQL, Postgres-XL, and CitusDB are all open source.
Honestly, why is that a problem? What open-source benefits are you taking advantage of?
Running an open source stack means you're not completely at the mercy of someone else. I'd cite Windows 10 desktop editions as a paragon of "you buy it, but you don't own it, then you suffer" proprietary products, but it's not a server component. So I have to cite FoundationDB, the proprietary database, which was seriously awesome, but so awesome that the downloadable binaries vanished overnight.

I have at least the assurance that such a thing cannot happen with Postgres (there are a multitude of for-profit companies working on it full-time, in addition to pro-bono community contributors), or with Apache Cassandra; and it's a similar assurance that keeps users of RethinkDB able to continue to operate despite the company being bought and folded.

----

Other advantages of open source: the very freedom to inspect and modify the software we run. This freedom is how OpenResty was born. In fact, Postgres itself comes from another GPLed database.

On a minor note: this freedom is what allowed us at my last job to create a patch for an internally required function in Nginx, in ten minutes. Had Nginx been closed source, we'd have had to request the makers, wait days/weeks/months, and if accepted they'd include it.