Hacker News new | ask | show | jobs
by lkrubner 4143 days ago
> Both examples would have been easier and faster in SQL.

That's easy, so long as we mean the whole entire project when we say "faster". When I worked at Timeout.com they were importing information about hotels from a large number of sources. For some insane reason, they were storing the data in MySql. Processing was 2 step:

1.) the initial import was done with PHP

2.) a later phase normalized all the data to the schema that we wanted, and this was written in Scala

The crazy thing was that, during the first phase, we simply pulled in the data and stored it in the form that the 3rd party was using. That meant that we had a separate schema for every 3rd party that we imported data from. I think we pulled data from 8 sources, so we had 8 different schemas. When they 3rd party changed their schema, we had to change ours. If we added a 9th source of information, then we would have to create a 9th schema in MySQL. We also checked the 3rd party schema at this phase, which struck me as silly because this did not mean that step 2 could be innocent of the schema, rather, both step 1 and step 2 would have to know the structure of those foreign schemas, but it was necessary because we were writing to a database that had a schema.

The system struck me as verbose and too complicated.

It's important to note that most of the work involved with step 1 could be skipped entirely if we used MongoDB. Simply import documents, and don't care about their schema. Dump all the data we get in MongoDB. Then we can move straight to step 2, which is taking all those foreign schemas and normalizing them to the schema that we wanted to use.

For ETL situations like, NoSQL document stores offer a huge convenience. Just grab data and dump it somewhere. Simplify the process. Your transformation phase is the only phase that should have to know about schemas, the import phase should be allowed to focus on the details of getting data and saving it.

1 comments

You can do that in SQL databases too: store the XML / JSON / whatever in a blob. There is no need to have a normalized import schema, especially since you are doing the transformation using an external application (Scala program).