Hacker News new | ask | show | jobs
by zopf 4402 days ago
Cool analysis. I wonder if you could show something like a LOESS curve fitted across all the articles' timeseries? Or if they're all roughly linear descents, I wonder if you could show the distribution of slopes - do some descend faster than others? Why?

And then, a bone to pick:

Need a beefy RDBMS for 15mm rows? Maybe if you want to store the whole denormalized table in memory, but if you're just indexing a small field (or even partial-indexing a larger field) you should have no problem. The table will just spill to disk and page in as necessary, and you're mostly appending anyway so you shouldn't have much trouble. Plus, you could normalize the data: store the (large) article title in an Articles table with an id (hash of title?) and then just store the ranks in a Ranks table for less overall storage than the NoSQL database (thus needing a less-beefy machine).

Nothing against modern Not-only-SQL solutions or document stores, but don't discount RDBMS. Schemas aren't so scary or unwieldy that you should never use them.

Anyway, thanks for an informative post!

2 comments

>Need a beefy RDBMS for 15mm rows? Maybe if you want to store the whole denormalized table in memory, but if you're just indexing a small field (or even partial-indexing a larger field) you should have no problem.

Good point. Honestly, I don't have that much experience with using row-based RDBMS for analytics purposes (my background is mostly in finance where folks use expensive proprietary columnar databases) and Hadoop. Any good resources on testing the limits of using MySQL/PostgreSQL for analytics?

I've spoken to friends who've played with billion+ row Oracle RDBMS installs, and we (at Next Big Sound) have an offline snapshot MySQL instance with tables of up to about a hundred million rows (with over a hundred columns).

That said, I agree that distributed columnar stores end up being much more useful for large-scale analytics, and the power of high computation parallelism seals the deal. We've mostly moved on from those snapshot MySQL databases to Impala running on top of our Hadoop cluster, so you're preaching to the choir :)

That said, a hell of a lot of analytics can be done in a properly-structured SQL database, and schema changes aren't a big deal as long you don't need to do them online in a production system.

More info: http://stackoverflow.com/questions/14733462/can-mysql-handle...

Thanks a lot!

Yea, I felt like a total n00b when I came to the web startup world a few years ago. This sounds ridiculous, but one and only database I had used until that point is kdb+ (kx.com). I had no idea about the performance/tradeoffs of any other databases.

I agree with you that properly-structured SQL databases can scale horizontally/vertically. That said, I've noticed that the set of people who know SQL performance well and the set of data analysts/statistically inclined folks do not overlap much (myself included), and frankly, data analysts should be able to focus on analysis, not SQL optimizations.

In a way, this is the problem Impala (and other MPP databases) solves at many companies: it's not that their data analysis cannot be handled with MySQL/Oracle, but it's cheaper and quicker to throw all the data in HDFS and query via Impala (sans some cost associated with setting up/maintaining Impala).

He works for Treasure Data. This post, while providing some information, is most likely a shill for their NoSQL platform.

If not, I genuinely hope the rest of the NoSQL crowd isn't so incredibly ignorant about what a RDBMS is capable of, nor posses such a strong aversion to what would be a very straightforward normalized schema.

You can say it's a bit of advertisement, but I won't call it a "shill". If I were doing that, I would have pretended that I magically stumbled on Treasure Data.

Honestly, for data of this scale, I could have totally used any RDBMS. Scale really would have not been an issue. But I do like the schema-flexibility that Treasure Data provides.

Then again, I could have used MongoDB (and this point is clearly indicated in the post).