Hacker News new | ask | show | jobs
by st553 3597 days ago
>Not everything needs a schema.

Any good examples?

3 comments

One example would be ingesting structured or semi structured data from sources that you don't control.

You may know some invariants, but much can change without notice. So you want to be able to work with the structure you have without preventing non-conforming data from entering your system.

In some cases schema conformance is just delayed, in other cases it is never achieved completely or not even a goal.

> one example would be ingesting data from structured or semi-structured sources that you don't control

Can you give a more specific example?

For instance, we need to retrieve statistical data on various macro economic indicators from various statistics offices and international organisations. There is considerable overlap in the fields they use but it's rarely exact and often you can't merge them because they do not refer to the exact same entity or the data uses incompatible units. It's impossible to properly model all of it before storing it because so much changes all the time and it's all noisy and partly broken.

A similar thing happens when you retrieve data on securities and companies from various exchanges, from the SEC, from national registries all over the world or you try to include XBRL from different countries.

And then you often have documents (like quarterly reports) that contain structured fields and tables but not in a formally specified syntax. You don't know exactly what fields will be in those documents before you parse them. So you parse the documents, store key/value pairs, and then you clean them up gradually.

There are tons of situations like this in data integration. It's a never ending cleanup and merge process. You can use RDBMS for all of that but they're not always the best tool for the job (but they are still my preferred tool most of the time).

Having worked on that sort of process many times, I'm of the opinion that a message queue is the ideal solution there, not a database. If you're storing the data for the purpose of processing it again later, it should probably be ephemeral and fast, rather than long-lived and flexible.
That doesn't work for us (beyond the first stage), because the fields we extract from the original source are not ephemeral.

We need to store the key/value pairs and explore them in a reasonably productive fashion (i.e using queries) in order to come up with machine learning algorithms. And any new algorithms we write need access to all historical data.

Metrics. A metric has a known source, a timestamp, a name and a value. It can also ship with any arbitrary number of descriptive fields.

Similarly, events.

The known source might very well be expressed as a relation between two entities: a metric entity and a source entity.

The source entity, more often than not, is also complemented with other data that needs to remain a part of the persistence layer.

Metadata can be quite variable. Library, catalogue, picture tags. The majority of terms are common, but some can be pretty specific and (as a developer) you'd need to store them. You might not have control over the schema or even have a "finite" set of possibilities.

Imagine you want to store random metadata from a digital camera picture, or perhaps even XML/HTML attributes. You can create another table and add each new attribute – join on query – but if you don't plan to search for that data directly, it's easier to skip normalisation and dump the original set into a JSON(B) or HStore field. You don't have to add every possible attribute to your data model or schema, you can carry data along and not analyse it if it's not relevant to you.

At a previous company I worked I worked at there was a table that maxed out postgresql's column limit. That did not need to be that wide. It was much better suited as a ~30 column table with a single hstore column (does the key exist? return the value, otherwise? null), as 99% of each of the rows for those columns were completely empty, and PGSQL does not support sparse tables (the "right" solution here).