| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by fauigerzigerk 3597 days ago

For instance, we need to retrieve statistical data on various macro economic indicators from various statistics offices and international organisations. There is considerable overlap in the fields they use but it's rarely exact and often you can't merge them because they do not refer to the exact same entity or the data uses incompatible units. It's impossible to properly model all of it before storing it because so much changes all the time and it's all noisy and partly broken.

A similar thing happens when you retrieve data on securities and companies from various exchanges, from the SEC, from national registries all over the world or you try to include XBRL from different countries.

And then you often have documents (like quarterly reports) that contain structured fields and tables but not in a formally specified syntax. You don't know exactly what fields will be in those documents before you parse them. So you parse the documents, store key/value pairs, and then you clean them up gradually.

There are tons of situations like this in data integration. It's a never ending cleanup and merge process. You can use RDBMS for all of that but they're not always the best tool for the job (but they are still my preferred tool most of the time).

1 comments

kedean 3596 days ago

Having worked on that sort of process many times, I'm of the opinion that a message queue is the ideal solution there, not a database. If you're storing the data for the purpose of processing it again later, it should probably be ephemeral and fast, rather than long-lived and flexible.

link

fauigerzigerk 3596 days ago

That doesn't work for us (beyond the first stage), because the fields we extract from the original source are not ephemeral.

We need to store the key/value pairs and explore them in a reasonably productive fashion (i.e using queries) in order to come up with machine learning algorithms. And any new algorithms we write need access to all historical data.

link