Hacker News new | ask | show | jobs
by physcab 4886 days ago
This is pretty cool but I'm struggling to see what the use cases are, atleast for analysis. There might be quite a bit of benefits for running application code that I'm not aware of. With regards to analysis though, their own example question is "what happened last night?" but then they go on to say that it is a near real-time data store. Does it matter that it is a real-time mirror then?

I've always liked the paradigm of doing analysis on "slower" data stores, such as Hadoop+Hive or Vertica if you have the money. Decoupling analysis tools from application tools is both convenient and necessary as your organization and data scales.

3 comments

(I wrote MoSQL)

PostgreSQL scales surprisingly well for this purpose, and is much nicer for interactive queries than Hadoop/Hive. We use Impala[1] for some larger datasets, but Impala is comparatively new, and it's nice to have something as battle-tested as postgres here.

As for the "why do we need realtime?": In my mind the benefit of a near-realtime replica is not that you actually often need it, but that it means you never have to ask the question of "Was this snapshot refreshed recently enough?", and never end up having to wait several hours for an enormous dump/load operation, when you realize you did need newer data.

[1] http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-t...

Hey! Always cool when the author responds :)

I do agree that PostgreSQL would be nicer for interactive queries. Waiting for a M/R to spin off is a bit of a buzzkill.

With regards to your usecases, what sort of questions have you found yourself answering the most? Do you have analytics applications running off of this?

I agree with your points that PostgreSQL (or RDMS in general) is really good for certain type of reporting / analytics use cases while hadoop/hive is awesome for handling billions or rows + TBs of data.

How was your overall experience with impala ? Did you guys have a fairly new hive cluster to try it out or did you just spin up a new one since impala can only read certain file formats (i.e. no custom SerDe).

Also, for hive/hadoop datasets, is that more for just data exploration, while this PostgreSQL solution is for smaller datasets which return in a few seconds and would not perform well in hive due to the cost of setting up a mapreduce job ?

Nice work. I spent much time last year building a system that imported MongoDB data into Oracle, having to do everything to speed up bulk loading new data into Oracle (and archiving old data). Something like this would have worked much better, and I suspect it might not be that hard to make this tool work with Oracle.
Out of curiosity, did you guys consider running SQL on MongoDB using Postgres foreign tables? What were the pros and cons of that approach for your use-cases?

(In full disclosure, I wrote mongo_fdw for PostgreSQL.)

I actually prototyped our PostgreSQL solution using mongo_fdw (Incidentally, I throw together Debian packaging here, if you're interested: https://github.com/nelhage/mongo_fdw).

Our experience was that mongo_fdw doesn't (yet?) give postgres enough information and knobs to plan JOINs efficiently, which is one of the things we wanted. I got a decent amount of leverage out of using mongo_fdw and then cloning to native tables using SELECT INTO, though :)

That's neat. :) For mongo_fdw, we're expecting better join performance once we upgrade to PostgreSQL 9.2 FDW APIs and start collecting table statistics.

I'm sure there are other factors involved for MoSQL, but they are probably outside the scope of this post. I'd love chat about them offline.

That's your preference, but I've not often found an occasion were someone (a client/stakeholder) said, "Yeah, go ahead and give that to me slower rather than faster." It just never seems to happen.

I'm thinking of this as something like polyglot memoization. Pretty cool when you think about it. Frequently need something that is slow in NoSQL, but fast in SQL? Memoize it to your SQL datastore. The alternative has always been to write it to two places. I kind of dig moving this out to the datastore to figure out.

I'm thinking that plenty of people will find this useful.

That's why I'm curious what sort of questions they are answering with this tool. If the bulk of their questions are a variant of SELECT COUNT(DISTINCT user_id) FROM table, then yes, this would be convenient to have. But if their questions start to revolve around transaction cohorting or path analysis where there are potentially hundreds of millions to billions of transaction_ids with some gnarly JOINs thrown in for good measure, I would be surprised to see this scale.
I totally agree with this. I don't understand why you would want to translate MongoDB to MySQL for? What are the advantages to doing a full data dump? There's an overhead converting from MySQL to MongoDB as the two formats are completely different. What about the edge cases when one format doesn't support the other.
They each have different advantages and strengths.

MongoDB is great for failover and for rapid development or prototyping. SQL is great for reporting or analytics, since you can do all kinds of aggregates and JOINs right in the database.

The edge cases where you can't represent the data perfectly aren't a huge deal for this use case -- because it's a one-way export, you don't have to be able to round-trip the data, and as long as you can export the data you want to run analysis on, it doesn't matter if there's some you can't get.