| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Cidan 3564 days ago

"...8-node DC1.8XL Amazon Redshift cluster for the tests."

Well, yeah. That's 28,108.80 a month if you're running queries on demand and don't want a delay/coordination in Amazon instance creation/destruction.

BQ may or may not be as fast, but it's truly a managed service; I give it our data and it just works. I don't have to worry about instances, boot up time, maintenance, hourly costs, etc. It's silly to focus just on query speed when there's a whole layer of management and cost that comes with it.

6 comments

paulasmuth 3563 days ago

>> Well, yeah. That's 28,108.80 a month

If you can't (or don't want to!) afford that kind of money for data analytics, please consider giving the FOSS alternative EventQL [0] a try some time.

It's super simple to set up and tries to be efficient on commodity hardware, so you can run large clusters (>100TB scale) for a couple hundred dollars a month.

[0] https://eventql.io/

DISC: I'm one of the EventQL authors

mej10 3563 days ago

How fast are group bys? Like say I have 120 billion rows, 25 not-sparse columns, and want to group by between 2-20 columns (5 of which are varchar), aggregating the other 5 columns?

What kind of hardware would I need to do that interactively? Or consistently sub-10 second, with 100s of queries per minute.

I have built a thing on Redshift that can do some of this, but it has been new territory for me and I am not sure I've done it "right". Constantly looking for alternatives.

manigandham 3563 days ago

Have you tried this on BigQuery yet? It's built for this kind of extremely large dataset.

You can also look at MemSQL for a distributed relational database with a columnstore. Run enough nodes and you might be able to hit your performance goals.

Jweb_Guru 3563 days ago

BigQuery makes it pretty easy to calculate how much you'll spend based on your bandwidth needs; unfortunately, if you do the math for the above use case, the answer is pretty discouraging... $5 / TB data processed, and the size in this context is explicitly the uncompressed size. Even if we're extremely generous and assume 4 bytes / column (almost certainly an underestimate given how many bytes they reserve for integers and timestamps), that is potentially 12 TB for the database, so you're paying $60 for a single query that hits the entire table. If you have 500 queries per month that hit the entire table, you're already paying as much as you would be for Redshift on 8 nodes, without even looking at any of the other queries.

The flat rate may improve things dramatically, of course, but the documentation's ambiguity about what you get with a "slot" makes it hard to say. And none of this is taking required bandwidth into account, because AFAICT there are no promises made on query response time... so I don't know if any of this would satisfy the 10 second requirement.

Either way, no matter how Google, MemSQL, or anyone else might try to satisfy these requirements, they can't get around the required hardware costs. At best, they can amortize them by buying in bulk and partitioning all their clients across lots of servers.

I don't work for any of these database providers, BTW, or use any of their products; I have no skin in this game. And I'm not even saying that paying $60 for that kind of query is necessarily a bad deal (when you consider what may be required under the hood). I'm just saying if you're looking for a cheap solution here you're not going to find it.

manigandham 3563 days ago

> I'm just saying if you're looking for a cheap solution here you're not going to find it.

That's a given. I think in this context the user was asking how to meet performance goals with cost possibly secondary or not a concern... and in that case BQ has proven to be incredible at churning through large datasets within seconds.

Jweb_Guru 3563 days ago

What you're asking for is just really difficult. Even with a fairly good compression ratio (e.g. 57 bits / tuple, which I chose at random but is similar to what's achievable on real data according to http://i.stanford.edu/~adityagp/courses/cs598/papers/constan...) you're talking 83.4 GB / s of read memory bandwidth just to read the tuples at all within 10 seconds for a single query. The state of the art in terms of what you're likely to be able to actually buy (IBM Power8 machines) can do about 91.5 GB / s. in Stream Triad benchmarks. That's still nowhere close to supporting 100s of queries a minute (more like 6 or 7) and you haven't even started calculating anything yet (Triad is pretty simplistic). It also assumes you can actually achieve that data rate while keeping everything in memory (which is why GPUs probably won't help much for your use case, despite NVLink; they don't have enough RAM).

Your biggest issue with aggregation / group by is going to be the memory bandwidth to/from a hash table to store all the results. Once the hash table no longer fits in L3, its access time will rise dramatically, so your query's performance will depend heavily on how many buckets you need (if you're grouping by 2 columns, maybe not that many; if you're grouping by 20, you'll probably have almost no buckets with more than one entry). Another potential issue is going to be having to look up the values for that column in a hash table if they are highly compressed (since you need to perform aggregation on them, presumably something like sum; if it's count, this doesn't matter).

If you can find a total attribute order such that the columns you group by are always to the left of the columns you aggregate by, you can sort the rows to enable efficient delta encoding; you can also then perform aggregation in the same order as your scan, which eliminates the hash table lookup problem for output (not input, though). You can also pre-materialize the results for common subsets.

To do this with multiple queries at once, you'd probably need to batch query execution (because of the memory bandwidth issue I alluded to earlier). While the data access requirements would remain similar on read, they'd get worse and worse on write (again, depending on how many buckets you had and how cleverly sorted your data were).

Alternately, you could get a bunch of Power8s (or commodity machines, but to maximize bang for your hardware buck you really want stuff with tons of memory bandwidth) and give each of them a slice of the data (but still apply all the above optimizations). The commodity version of this is the Redshift solution. If you went this route, you could also look at specialized solutions like GPUs with NVLink or the KNL Xeon Phis, which have "fast" memory with tons of extra bandwidth which help mitigate the aforementioned hash table / query result access costs.

I still haven't talked about how you're supposed to actually get the results back. If you're trying to do ethernet through anything commodity you're going to be very limited in terms of data rate. Even 100 Gbps Infiniband only gets you 12.5 GB/s out, and even if you hook two of those up to each machine you're still only at 25 GB/s out. So either you get even more machines, your clients process the data on the machine, or you have to limit the output somehow... and if you're thinking "we'll sort it!" guess what that's going to be bound by (unless you can store the input data presorted)? Probably memory bandwidth (assuming radix sort)!

tl;dr One way or another, you're paying out the nose to satisfy the requirements you just outlined. Also worth noting that you're paying for this read performance on write.

mej10 3563 days ago

Yeah, that is what it seems like.

Using a column store and doing a bunch of pre-aggregation is the only way we've been able to to come close to these requirements, but I keep hoping there is a system that will do what we've done without having to write code. Trading space for computation is the only thing that has worked reliably.

We also have the benefit that most queries just want a small subset of time and have things sorted to take advantage of that. But occasionally people want to do something over a large range and if isn't a set of group-bys we've pre-aggregated then they just have to be patient.

pritambaral 3563 days ago

Hadn't heard of EventQL before. Took a look at the documentation, so I have a brief idea of what it offers and how it works. But I couldn't find anything in way of performance numbers.

While I might try it out myself to see how well it performs, it would be nice if some figures were readily available.

I've been keeping an eye on Postgres-XL and CitusDB for distributed SQL. Would be interesting to compare.

manigandham 3563 days ago

Add MemSQL to the list.

pritambaral 3563 days ago

Proprietary. Even the free community edition is closed source. EventQL, Postgres-XL, and CitusDB are all open source.

manigandham 3562 days ago

Honestly, why is that a problem? What open-source benefits are you taking advantage of?

pritambaral 3562 days ago

Running an open source stack means you're not completely at the mercy of someone else. I'd cite Windows 10 desktop editions as a paragon of "you buy it, but you don't own it, then you suffer" proprietary products, but it's not a server component. So I have to cite FoundationDB, the proprietary database, which was seriously awesome, but so awesome that the downloadable binaries vanished overnight.

I have at least the assurance that such a thing cannot happen with Postgres (there are a multitude of for-profit companies working on it full-time, in addition to pro-bono community contributors), or with Apache Cassandra; and it's a similar assurance that keeps users of RethinkDB able to continue to operate despite the company being bought and folded.

----

Other advantages of open source: the very freedom to inspect and modify the software we run. This freedom is how OpenResty was born. In fact, Postgres itself comes from another GPLed database.

On a minor note: this freedom is what allowed us at my last job to create a patch for an internally required function in Nginx, in ten minutes. Had Nginx been closed source, we'd have had to request the makers, wait days/weeks/months, and if accepted they'd include it.

beachstartup 3564 days ago

> 28,108.80 a month

maybe i'm completely out of touch, but i'm really wondering who can afford this kind of stuff without $20M series A money in the bank. and i'm also wondering what they're going to do when they run out of that money and have zero database expertise in-house because they outsourced everything to amazon.

earino 3564 days ago

Maybe I'm jaded, but 330k a year or so for the very backbone of your entire real time query analytics infrastructure just doesn't seem unreasonable? This is highly specialized software, tuned and architected for the purpose of running relatively complex analytics queries and aggregates fast enough to allow interactive real time exploratory analysis.

I dont think it's insane to believe that a profitable mature data informed business should be expected to spend 1/3 of a million a year for the ability to have all that data live and available without deep latency.

I can also agree that this is quite a bit for a startup, but if the unit economics of the startup require that this kind of data be available for interactive exploring, there's going to be a deep challenge down the road in scaling it I think?

To your point, if a startup runs through 20 million and their biggest problem is paying AWS bills, then they are probably a failed startup and folks should move on towards whatever next exciting thing is on the horizon.

tw04 3564 days ago

With 10TB of data, you could also do it a hell of a lot cheaper and faster on premise.

trhway 3564 days ago

One engineer/DBA to deal with your cluster is those 330k/year.

brianwawok 3564 days ago

Baby startups don't have a DBA, they have a dev that kinda knows SQL.

foobarian 3564 days ago

Don't those computers cost a lot? Like $10k per blade? Granted it's probably amortized but still...

On the other hand the Redshift cluster doesn't run itself, despite what Amazon says. You still need at least a part time DBA guy.

eitland 3563 days ago

Last time I worked with physical servers the rule was to avoid blades and run rackmount until space or termal was an issue.

And as long as you were not into GPUs you could get some really decent HP servers at somewhere around $2000 - $7000 depending on your storage needs.

rad_gruchalski 3564 days ago

In your part of the world, maybe. The Valley does not reflect average DBA salaries around the world.

beachstartup 3564 days ago

there are DBA services that do an excellent job for a fraction of that price. they also do AWS stuff too.

full-time DBA's work for either enormous corporations, or database consulting firms. it's very rare you will see some random small-medium company with even one DBA on payroll.

manigandham 3564 days ago

You dont really need a full time DBA to handle 10TB these days.

neom 3564 days ago

https://en.wikipedia.org/wiki/Fortune_500

StreamBright 3563 days ago

I give you just one example: a security vendor collects multiple terabytes per day worth of antivirus data and sells the insight that can be gathered from this to large enterprises to support their security operations. I do not know exactly how much they make out of this but is guess it is in the 200K-500K / year / customer range.

Startups on the other side, rarely have the amount of data that could not be hosted on anything for more than 1000-2000 USD/month.

dahdum 3564 days ago

Redshift absolutely dominated the other benchmarks, so while the author chose the largest instance, it's not likely to be necessary to still beat BQ.

I don't work for or have any relationship with either, but I do use Redshift.

StreamBright 3563 days ago

> That's 28,108.80 a month

That is right. We had ~100K / month with on prem Hadoop (energy, networking, OPS, enterprise support, etc. included) that could have been moved to Redshift to save money. However, when we did the comparison a different cloud based solution came out as the winer for our DWH needs. The point is, showing a single number without showing how much it would be with other solutions is pointless. There are several companies that can easily afford a ~29K/month DWH in the cloud.

mooman219 3564 days ago

edit: My original comment's data was misleading due to conversion error. I edited out the comment to avoid confusion for anyone else who might not spot the error and be misinformed. Thank you for those who pointed it out!

manigandham 3564 days ago

You're math is wrong - DC1.8XL costs $4.80 per hour, not $4800 (which would be ridiculous pricing).

8 instances of DC1.8XL = $38.4 which means you can run the redshift cluster for 28 hours (edit: for $1100, which is the cost of scanning 220TB of data with BQ, used in previous comment).

Redshift does win on price when you are continuously running queries and scanning entire datasets, although there is enterprise pricing available for BQ if you hit a certain amount.

cobookman 3564 days ago

BQ offers a flat rate pricing package now.

https://cloud.google.com/bigquery/pricing#flat_rate_pricing

manigandham 3563 days ago

That's what I meant by enterprise pricing.

mooman219 3564 days ago

Completely correct! I apologize, used to seeing a different notation and the extra 0 in 4.800 threw me off. I removed the misleading data because judging by the points, not everyone caught it.

mej10 3564 days ago

> and a DC1.8XL costs $4800/hour to "compute" the data.

Is this a typo? A single DC1.8XL is $4.80/hour. 8, like the article, would be $38.40/hour.

mooman219 3564 days ago

As stated, on demand pricing was used. Both products have special pricing at higher usages and flat-pricing tiers. It would start to get compilicated to compare them at that point.

Numbers on https://aws.amazon.com/redshift/pricing/ used.

dc1.8xlarge - $4.800 per Hour

zerocrates 3564 days ago

> $4.800 per Hour

That's four point eight dollars, not four thousand eight hundred.

mooman219 3564 days ago

Fixed! Used to seeing 4.800 being 4800 apologies!

zerocrates 3564 days ago

I figured. It's easy to forget that there's really no international/intercultural agreement on number punctuation, particularly combined with a service that wants to price down to tenths of cents.

mej10 3564 days ago

I don't understand how you calculated how much it costs to compute on the Redshift cluster. $1100 would be ~28 hours of the 8-node cluster.

vijayr 3563 days ago

I give it our data and it just works

Could you please talk a bit more about your use cases? what domain, volume of data, how you manage updates etc, if it is possible for you to share?