Hacker News new | ask | show | jobs
by nelhage 4881 days ago
(I wrote MoSQL)

PostgreSQL scales surprisingly well for this purpose, and is much nicer for interactive queries than Hadoop/Hive. We use Impala[1] for some larger datasets, but Impala is comparatively new, and it's nice to have something as battle-tested as postgres here.

As for the "why do we need realtime?": In my mind the benefit of a near-realtime replica is not that you actually often need it, but that it means you never have to ask the question of "Was this snapshot refreshed recently enough?", and never end up having to wait several hours for an enormous dump/load operation, when you realize you did need newer data.

[1] http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-t...

4 comments

Hey! Always cool when the author responds :)

I do agree that PostgreSQL would be nicer for interactive queries. Waiting for a M/R to spin off is a bit of a buzzkill.

With regards to your usecases, what sort of questions have you found yourself answering the most? Do you have analytics applications running off of this?

I agree with your points that PostgreSQL (or RDMS in general) is really good for certain type of reporting / analytics use cases while hadoop/hive is awesome for handling billions or rows + TBs of data.

How was your overall experience with impala ? Did you guys have a fairly new hive cluster to try it out or did you just spin up a new one since impala can only read certain file formats (i.e. no custom SerDe).

Also, for hive/hadoop datasets, is that more for just data exploration, while this PostgreSQL solution is for smaller datasets which return in a few seconds and would not perform well in hive due to the cost of setting up a mapreduce job ?

Nice work. I spent much time last year building a system that imported MongoDB data into Oracle, having to do everything to speed up bulk loading new data into Oracle (and archiving old data). Something like this would have worked much better, and I suspect it might not be that hard to make this tool work with Oracle.
Out of curiosity, did you guys consider running SQL on MongoDB using Postgres foreign tables? What were the pros and cons of that approach for your use-cases?

(In full disclosure, I wrote mongo_fdw for PostgreSQL.)

I actually prototyped our PostgreSQL solution using mongo_fdw (Incidentally, I throw together Debian packaging here, if you're interested: https://github.com/nelhage/mongo_fdw).

Our experience was that mongo_fdw doesn't (yet?) give postgres enough information and knobs to plan JOINs efficiently, which is one of the things we wanted. I got a decent amount of leverage out of using mongo_fdw and then cloning to native tables using SELECT INTO, though :)

That's neat. :) For mongo_fdw, we're expecting better join performance once we upgrade to PostgreSQL 9.2 FDW APIs and start collecting table statistics.

I'm sure there are other factors involved for MoSQL, but they are probably outside the scope of this post. I'd love chat about them offline.