| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kiyoto 4402 days ago
	>Need a beefy RDBMS for 15mm rows? Maybe if you want to store the whole denormalized table in memory, but if you're just indexing a small field (or even partial-indexing a larger field) you should have no problem. Good point. Honestly, I don't have that much experience with using row-based RDBMS for analytics purposes (my background is mostly in finance where folks use expensive proprietary columnar databases) and Hadoop. Any good resources on testing the limits of using MySQL/PostgreSQL for analytics?

1 comments

zopf 4400 days ago

I've spoken to friends who've played with billion+ row Oracle RDBMS installs, and we (at Next Big Sound) have an offline snapshot MySQL instance with tables of up to about a hundred million rows (with over a hundred columns).

That said, I agree that distributed columnar stores end up being much more useful for large-scale analytics, and the power of high computation parallelism seals the deal. We've mostly moved on from those snapshot MySQL databases to Impala running on top of our Hadoop cluster, so you're preaching to the choir :)

That said, a hell of a lot of analytics can be done in a properly-structured SQL database, and schema changes aren't a big deal as long you don't need to do them online in a production system.

More info: http://stackoverflow.com/questions/14733462/can-mysql-handle...

kiyoto 4398 days ago

Thanks a lot!

Yea, I felt like a total n00b when I came to the web startup world a few years ago. This sounds ridiculous, but one and only database I had used until that point is kdb+ (kx.com). I had no idea about the performance/tradeoffs of any other databases.

I agree with you that properly-structured SQL databases can scale horizontally/vertically. That said, I've noticed that the set of people who know SQL performance well and the set of data analysts/statistically inclined folks do not overlap much (myself included), and frankly, data analysts should be able to focus on analysis, not SQL optimizations.

In a way, this is the problem Impala (and other MPP databases) solves at many companies: it's not that their data analysis cannot be handled with MySQL/Oracle, but it's cheaper and quicker to throw all the data in HDFS and query via Impala (sans some cost associated with setting up/maintaining Impala).