Hacker News new | ask | show | jobs
by Cidan 3517 days ago
"...8-node DC1.8XL Amazon Redshift cluster for the tests."

Well, yeah. That's 28,108.80 a month if you're running queries on demand and don't want a delay/coordination in Amazon instance creation/destruction.

BQ may or may not be as fast, but it's truly a managed service; I give it our data and it just works. I don't have to worry about instances, boot up time, maintenance, hourly costs, etc. It's silly to focus just on query speed when there's a whole layer of management and cost that comes with it.

6 comments

>> Well, yeah. That's 28,108.80 a month

If you can't (or don't want to!) afford that kind of money for data analytics, please consider giving the FOSS alternative EventQL [0] a try some time.

It's super simple to set up and tries to be efficient on commodity hardware, so you can run large clusters (>100TB scale) for a couple hundred dollars a month.

[0] https://eventql.io/

DISC: I'm one of the EventQL authors

How fast are group bys? Like say I have 120 billion rows, 25 not-sparse columns, and want to group by between 2-20 columns (5 of which are varchar), aggregating the other 5 columns?

What kind of hardware would I need to do that interactively? Or consistently sub-10 second, with 100s of queries per minute.

I have built a thing on Redshift that can do some of this, but it has been new territory for me and I am not sure I've done it "right". Constantly looking for alternatives.

Have you tried this on BigQuery yet? It's built for this kind of extremely large dataset.

You can also look at MemSQL for a distributed relational database with a columnstore. Run enough nodes and you might be able to hit your performance goals.

BigQuery makes it pretty easy to calculate how much you'll spend based on your bandwidth needs; unfortunately, if you do the math for the above use case, the answer is pretty discouraging... $5 / TB data processed, and the size in this context is explicitly the uncompressed size. Even if we're extremely generous and assume 4 bytes / column (almost certainly an underestimate given how many bytes they reserve for integers and timestamps), that is potentially 12 TB for the database, so you're paying $60 for a single query that hits the entire table. If you have 500 queries per month that hit the entire table, you're already paying as much as you would be for Redshift on 8 nodes, without even looking at any of the other queries.

The flat rate may improve things dramatically, of course, but the documentation's ambiguity about what you get with a "slot" makes it hard to say. And none of this is taking required bandwidth into account, because AFAICT there are no promises made on query response time... so I don't know if any of this would satisfy the 10 second requirement.

Either way, no matter how Google, MemSQL, or anyone else might try to satisfy these requirements, they can't get around the required hardware costs. At best, they can amortize them by buying in bulk and partitioning all their clients across lots of servers.

I don't work for any of these database providers, BTW, or use any of their products; I have no skin in this game. And I'm not even saying that paying $60 for that kind of query is necessarily a bad deal (when you consider what may be required under the hood). I'm just saying if you're looking for a cheap solution here you're not going to find it.

> I'm just saying if you're looking for a cheap solution here you're not going to find it.

That's a given. I think in this context the user was asking how to meet performance goals with cost possibly secondary or not a concern... and in that case BQ has proven to be incredible at churning through large datasets within seconds.

What you're asking for is just really difficult. Even with a fairly good compression ratio (e.g. 57 bits / tuple, which I chose at random but is similar to what's achievable on real data according to http://i.stanford.edu/~adityagp/courses/cs598/papers/constan...) you're talking 83.4 GB / s of read memory bandwidth just to read the tuples at all within 10 seconds for a single query. The state of the art in terms of what you're likely to be able to actually buy (IBM Power8 machines) can do about 91.5 GB / s. in Stream Triad benchmarks. That's still nowhere close to supporting 100s of queries a minute (more like 6 or 7) and you haven't even started calculating anything yet (Triad is pretty simplistic). It also assumes you can actually achieve that data rate while keeping everything in memory (which is why GPUs probably won't help much for your use case, despite NVLink; they don't have enough RAM).

Your biggest issue with aggregation / group by is going to be the memory bandwidth to/from a hash table to store all the results. Once the hash table no longer fits in L3, its access time will rise dramatically, so your query's performance will depend heavily on how many buckets you need (if you're grouping by 2 columns, maybe not that many; if you're grouping by 20, you'll probably have almost no buckets with more than one entry). Another potential issue is going to be having to look up the values for that column in a hash table if they are highly compressed (since you need to perform aggregation on them, presumably something like sum; if it's count, this doesn't matter).

If you can find a total attribute order such that the columns you group by are always to the left of the columns you aggregate by, you can sort the rows to enable efficient delta encoding; you can also then perform aggregation in the same order as your scan, which eliminates the hash table lookup problem for output (not input, though). You can also pre-materialize the results for common subsets.

To do this with multiple queries at once, you'd probably need to batch query execution (because of the memory bandwidth issue I alluded to earlier). While the data access requirements would remain similar on read, they'd get worse and worse on write (again, depending on how many buckets you had and how cleverly sorted your data were).

Alternately, you could get a bunch of Power8s (or commodity machines, but to maximize bang for your hardware buck you really want stuff with tons of memory bandwidth) and give each of them a slice of the data (but still apply all the above optimizations). The commodity version of this is the Redshift solution. If you went this route, you could also look at specialized solutions like GPUs with NVLink or the KNL Xeon Phis, which have "fast" memory with tons of extra bandwidth which help mitigate the aforementioned hash table / query result access costs.

I still haven't talked about how you're supposed to actually get the results back. If you're trying to do ethernet through anything commodity you're going to be very limited in terms of data rate. Even 100 Gbps Infiniband only gets you 12.5 GB/s out, and even if you hook two of those up to each machine you're still only at 25 GB/s out. So either you get even more machines, your clients process the data on the machine, or you have to limit the output somehow... and if you're thinking "we'll sort it!" guess what that's going to be bound by (unless you can store the input data presorted)? Probably memory bandwidth (assuming radix sort)!

tl;dr One way or another, you're paying out the nose to satisfy the requirements you just outlined. Also worth noting that you're paying for this read performance on write.

Yeah, that is what it seems like.

Using a column store and doing a bunch of pre-aggregation is the only way we've been able to to come close to these requirements, but I keep hoping there is a system that will do what we've done without having to write code. Trading space for computation is the only thing that has worked reliably.

We also have the benefit that most queries just want a small subset of time and have things sorted to take advantage of that. But occasionally people want to do something over a large range and if isn't a set of group-bys we've pre-aggregated then they just have to be patient.

Hadn't heard of EventQL before. Took a look at the documentation, so I have a brief idea of what it offers and how it works. But I couldn't find anything in way of performance numbers.

While I might try it out myself to see how well it performs, it would be nice if some figures were readily available.

I've been keeping an eye on Postgres-XL and CitusDB for distributed SQL. Would be interesting to compare.

Add MemSQL to the list.
Proprietary. Even the free community edition is closed source. EventQL, Postgres-XL, and CitusDB are all open source.
Honestly, why is that a problem? What open-source benefits are you taking advantage of?
Running an open source stack means you're not completely at the mercy of someone else. I'd cite Windows 10 desktop editions as a paragon of "you buy it, but you don't own it, then you suffer" proprietary products, but it's not a server component. So I have to cite FoundationDB, the proprietary database, which was seriously awesome, but so awesome that the downloadable binaries vanished overnight.

I have at least the assurance that such a thing cannot happen with Postgres (there are a multitude of for-profit companies working on it full-time, in addition to pro-bono community contributors), or with Apache Cassandra; and it's a similar assurance that keeps users of RethinkDB able to continue to operate despite the company being bought and folded.

----

Other advantages of open source: the very freedom to inspect and modify the software we run. This freedom is how OpenResty was born. In fact, Postgres itself comes from another GPLed database.

On a minor note: this freedom is what allowed us at my last job to create a patch for an internally required function in Nginx, in ten minutes. Had Nginx been closed source, we'd have had to request the makers, wait days/weeks/months, and if accepted they'd include it.

> 28,108.80 a month

maybe i'm completely out of touch, but i'm really wondering who can afford this kind of stuff without $20M series A money in the bank. and i'm also wondering what they're going to do when they run out of that money and have zero database expertise in-house because they outsourced everything to amazon.

Maybe I'm jaded, but 330k a year or so for the very backbone of your entire real time query analytics infrastructure just doesn't seem unreasonable? This is highly specialized software, tuned and architected for the purpose of running relatively complex analytics queries and aggregates fast enough to allow interactive real time exploratory analysis.

I dont think it's insane to believe that a profitable mature data informed business should be expected to spend 1/3 of a million a year for the ability to have all that data live and available without deep latency.

I can also agree that this is quite a bit for a startup, but if the unit economics of the startup require that this kind of data be available for interactive exploring, there's going to be a deep challenge down the road in scaling it I think?

To your point, if a startup runs through 20 million and their biggest problem is paying AWS bills, then they are probably a failed startup and folks should move on towards whatever next exciting thing is on the horizon.

With 10TB of data, you could also do it a hell of a lot cheaper and faster on premise.
One engineer/DBA to deal with your cluster is those 330k/year.
Baby startups don't have a DBA, they have a dev that kinda knows SQL.
Don't those computers cost a lot? Like $10k per blade? Granted it's probably amortized but still...

On the other hand the Redshift cluster doesn't run itself, despite what Amazon says. You still need at least a part time DBA guy.

Last time I worked with physical servers the rule was to avoid blades and run rackmount until space or termal was an issue.

And as long as you were not into GPUs you could get some really decent HP servers at somewhere around $2000 - $7000 depending on your storage needs.

In your part of the world, maybe. The Valley does not reflect average DBA salaries around the world.
there are DBA services that do an excellent job for a fraction of that price. they also do AWS stuff too.

full-time DBA's work for either enormous corporations, or database consulting firms. it's very rare you will see some random small-medium company with even one DBA on payroll.

You dont really need a full time DBA to handle 10TB these days.
I give you just one example: a security vendor collects multiple terabytes per day worth of antivirus data and sells the insight that can be gathered from this to large enterprises to support their security operations. I do not know exactly how much they make out of this but is guess it is in the 200K-500K / year / customer range.

Startups on the other side, rarely have the amount of data that could not be hosted on anything for more than 1000-2000 USD/month.

Redshift absolutely dominated the other benchmarks, so while the author chose the largest instance, it's not likely to be necessary to still beat BQ.

I don't work for or have any relationship with either, but I do use Redshift.

> That's 28,108.80 a month

That is right. We had ~100K / month with on prem Hadoop (energy, networking, OPS, enterprise support, etc. included) that could have been moved to Redshift to save money. However, when we did the comparison a different cloud based solution came out as the winer for our DWH needs. The point is, showing a single number without showing how much it would be with other solutions is pointless. There are several companies that can easily afford a ~29K/month DWH in the cloud.

edit: My original comment's data was misleading due to conversion error. I edited out the comment to avoid confusion for anyone else who might not spot the error and be misinformed. Thank you for those who pointed it out!
You're math is wrong - DC1.8XL costs $4.80 per hour, not $4800 (which would be ridiculous pricing).

8 instances of DC1.8XL = $38.4 which means you can run the redshift cluster for 28 hours (edit: for $1100, which is the cost of scanning 220TB of data with BQ, used in previous comment).

Redshift does win on price when you are continuously running queries and scanning entire datasets, although there is enterprise pricing available for BQ if you hit a certain amount.

BQ offers a flat rate pricing package now.

https://cloud.google.com/bigquery/pricing#flat_rate_pricing

That's what I meant by enterprise pricing.
Completely correct! I apologize, used to seeing a different notation and the extra 0 in 4.800 threw me off. I removed the misleading data because judging by the points, not everyone caught it.
> and a DC1.8XL costs $4800/hour to "compute" the data.

Is this a typo? A single DC1.8XL is $4.80/hour. 8, like the article, would be $38.40/hour.

As stated, on demand pricing was used. Both products have special pricing at higher usages and flat-pricing tiers. It would start to get compilicated to compare them at that point.

Numbers on https://aws.amazon.com/redshift/pricing/ used.

dc1.8xlarge - $4.800 per Hour

> $4.800 per Hour

That's four point eight dollars, not four thousand eight hundred.

Fixed! Used to seeing 4.800 being 4800 apologies!
I figured. It's easy to forget that there's really no international/intercultural agreement on number punctuation, particularly combined with a service that wants to price down to tenths of cents.
I don't understand how you calculated how much it costs to compute on the Redshift cluster. $1100 would be ~28 hours of the 8-node cluster.
I give it our data and it just works

Could you please talk a bit more about your use cases? what domain, volume of data, how you manage updates etc, if it is possible for you to share?