| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by mixedbit 4859 days ago
	Which off-the-shell RDBMS can handle queries over 3 billion rows?

5 comments

jacques_chester 4859 days ago

In 2007 I worked for a firm with a 4 billion row join table in PostgreSQL. Might've been 7 or 8, I don't recall which. It ran on a quad core server with 16Gb of RAM. Joins going through this table took about 2-3 seconds to complete.

link

mixedbit 4858 days ago

But I suspect the join must have been over an indexed column, so it did not touched 4bln rows, otherwise 2-3 seconds would be hard to believe. The group by query in the article must access all 3bln rows, which makes a huge difference.

link

jacques_chester 4858 days ago

All the columns were indexed.

I remember it well, because I was trying to explain why having tens of gigabytes of indexes wouldn't help them much if they only had 16Gb of RAM.

In terms of group-by performance, it depends a lot on the kind of data and how it's stored. For example, taking a sum on a columnar store is quite amenable to parallel solutions and a lot of databases will do that way.

link

EwanToo 4858 days ago

I can't think of an off the shelf RDBMS which can't handle queries on 3 billion rows.

SQL Server can

Oracle can

Postgres can

Even MySQL can (!)

The limitations are almost always in the hardware, not the software.

If you're looking at column based systems, you can look at Greenplum (does both row and column-based storage), InfiniDB (MySQL based), and all sorts of expensive but very fast appliance options like Netezza, Teradata, etc.

link

jbverschoor 4859 days ago

postgres?

link

meritt 4859 days ago

Counter-question: Which startup has a actual data table with over 3 billion rows?

link

PanMan 4859 days ago

We have just crossed 2 billion items in our datastore. While not 3 billion yet, I expect that to happen later this year.

Too bad Redshift can't handle JSON files: Converting everything will be annoying.

link

fujibee 4859 days ago

Our idea is to change from JSON on loading to Redshift, continuously. http://www.hapyrus.com/pages/flydata-for-redshift

link

nkohari 4859 days ago

We do. http://adzerk.com/

link

badgar 4859 days ago

When you log every mousedown because the founder misunderstands A/B testing, 3 billion rows is easy to come by. Besides - you're busy changing the world, so you should expect to use the same technology as Facebook and Google.

link

taligent 4859 days ago

SAP HANA would be one but it is basically in memory so very, vey expensive.

link

mbesto 4859 days ago

And it's not a RDBMS. It's basically the same technology as RedShift, but not cloud based (yet).

link

res0nat0r 4859 days ago

Actually HANA One is available in the AWS Marketplace: https://aws.amazon.com/marketplace/pp/B009KA3CRY/ref=mkt_ste...

link