Hacker News new | ask | show | jobs
by manigandham 3520 days ago
Have you tried this on BigQuery yet? It's built for this kind of extremely large dataset.

You can also look at MemSQL for a distributed relational database with a columnstore. Run enough nodes and you might be able to hit your performance goals.

1 comments

BigQuery makes it pretty easy to calculate how much you'll spend based on your bandwidth needs; unfortunately, if you do the math for the above use case, the answer is pretty discouraging... $5 / TB data processed, and the size in this context is explicitly the uncompressed size. Even if we're extremely generous and assume 4 bytes / column (almost certainly an underestimate given how many bytes they reserve for integers and timestamps), that is potentially 12 TB for the database, so you're paying $60 for a single query that hits the entire table. If you have 500 queries per month that hit the entire table, you're already paying as much as you would be for Redshift on 8 nodes, without even looking at any of the other queries.

The flat rate may improve things dramatically, of course, but the documentation's ambiguity about what you get with a "slot" makes it hard to say. And none of this is taking required bandwidth into account, because AFAICT there are no promises made on query response time... so I don't know if any of this would satisfy the 10 second requirement.

Either way, no matter how Google, MemSQL, or anyone else might try to satisfy these requirements, they can't get around the required hardware costs. At best, they can amortize them by buying in bulk and partitioning all their clients across lots of servers.

I don't work for any of these database providers, BTW, or use any of their products; I have no skin in this game. And I'm not even saying that paying $60 for that kind of query is necessarily a bad deal (when you consider what may be required under the hood). I'm just saying if you're looking for a cheap solution here you're not going to find it.

> I'm just saying if you're looking for a cheap solution here you're not going to find it.

That's a given. I think in this context the user was asking how to meet performance goals with cost possibly secondary or not a concern... and in that case BQ has proven to be incredible at churning through large datasets within seconds.