Hacker News new | ask | show | jobs
by mixedbit 4859 days ago
Which off-the-shell RDBMS can handle queries over 3 billion rows?
5 comments

In 2007 I worked for a firm with a 4 billion row join table in PostgreSQL. Might've been 7 or 8, I don't recall which. It ran on a quad core server with 16Gb of RAM. Joins going through this table took about 2-3 seconds to complete.
But I suspect the join must have been over an indexed column, so it did not touched 4bln rows, otherwise 2-3 seconds would be hard to believe. The group by query in the article must access all 3bln rows, which makes a huge difference.
All the columns were indexed.

I remember it well, because I was trying to explain why having tens of gigabytes of indexes wouldn't help them much if they only had 16Gb of RAM.

In terms of group-by performance, it depends a lot on the kind of data and how it's stored. For example, taking a sum on a columnar store is quite amenable to parallel solutions and a lot of databases will do that way.

I can't think of an off the shelf RDBMS which can't handle queries on 3 billion rows.

SQL Server can

Oracle can

Postgres can

Even MySQL can (!)

The limitations are almost always in the hardware, not the software.

If you're looking at column based systems, you can look at Greenplum (does both row and column-based storage), InfiniDB (MySQL based), and all sorts of expensive but very fast appliance options like Netezza, Teradata, etc.

postgres?
Counter-question: Which startup has a actual data table with over 3 billion rows?
We have just crossed 2 billion items in our datastore. While not 3 billion yet, I expect that to happen later this year.

Too bad Redshift can't handle JSON files: Converting everything will be annoying.

Our idea is to change from JSON on loading to Redshift, continuously. http://www.hapyrus.com/pages/flydata-for-redshift
When you log every mousedown because the founder misunderstands A/B testing, 3 billion rows is easy to come by. Besides - you're busy changing the world, so you should expect to use the same technology as Facebook and Google.
SAP HANA would be one but it is basically in memory so very, vey expensive.
And it's not a RDBMS. It's basically the same technology as RedShift, but not cloud based (yet).
Actually HANA One is available in the AWS Marketplace: https://aws.amazon.com/marketplace/pp/B009KA3CRY/ref=mkt_ste...