Hacker News new | ask | show | jobs
by ocdnix 4248 days ago
"JSON documents are stored relationally, not as a blob/jsonb."

How is the transformation designed, to go from a structured document to a flat set of tables, akin to what an object-relational mapper would do?

3 comments

The transformation is performed at the ToroDB layer. Each document is analyzed and several steps are performed:

- Document is received in BSON format (as per the MongoDB wire protocol) and transformed into a KVDocument. KVDocument is an internal implementation, an abstraction of the concrete representation of a JSON document (i.e., hierarchical, nested sets of key-pairs).

- Then, KVDocuments are split by levels (called sub-documents).

- Each subdocument is further split into a subdocument type and the data. The subdocument type is basically an ordered set of the data types of that subdocument.

- Subdocuments are matched 1:1 to tables. If there is an existing table for the given subdocument type, the document is stored directly there. If there isn't, a new table with that type is directly created. This means that there is also a 1:1 mapping between the attribute names (columns) and key names, and makes it very readable from a SQL user perspective.

- There is a table called structure that is basically a representation of the JSON objetct but without the (scalar) data. Think of the JSON object but only the braces and square brackets, plus all the keys (or entries in arrays) that contain objects. There is, per level, a key in this structure that cointains the name of the table where the data for this object is stored. This table uses a jsonb field to store this structure, but note that there's no actual data in this jsonb field.

- There's finally a root table which matches structure with the current document. This is used as structures are frequently re-used for many documents. This is in part one of the biggest factors which contributes to significantly reduce the storage required compared to, for example, MongoDB, as the "common" information of that "types of documents" is stored only once.

This information and more will be shortly added to the project's wiki. However, it's very easy to see if you run ToroDB and look at the created tables :)

Note: I'm one of the authors of ToroDB

Can you briefly explain the advantage of tearing apart the JSON document, rather than just storing it as JSONB and using the various JSONB functions/operators/indexes?
He did explain that - it can yield significant space savings where there is a common "type of document" because the field names do not need to be stored in each document instance, the way they would have to be in JSONB. Also, it means you can query these tables with normal Postgres query tools and they will actually make sense.
This is a pretty interesting approach, and it doesn't seem to be tied to any particular database implementation.

Would it be possible to factor this up into a JSON -> SQL translation function, which could than be used by various backends (effectively consume the JSON and spit out the CREATE TABLE and INSERT statements)?

Indeed.

It's more than a JSON to SQL translation, but it could definitely use various backends. It has some plpgsql Postgres code and some data types to speed operations (saving some round trips to the database), but it won't be hard to port it to other backends :)

Is it possible to quantify the savings of storage and IO?
There are some benchmarks in the following presentation: http://www.slideshare.net/8kdata/toro-db-pgconfeu2014 where ToroDB was presented (PostgreSQL Europe Conference, Madrid, 2014, 2014.pgconf.eu).
I'd be very interested to know more about this too. There is considerable overhead for each row in PostgreSQL (24 byte header, which must be aligned). Obviously you'll save a bit if there are repeated keys higher up, but you'll pay for the foreign key references to those rows in the children.
It is true that PostgreSQL has such a high row overhead. But the storage savings are significant, in any way. Please see the presentation pointed out earlier.
Can this be made to use HStore?
This is my question. How is the data actually broken down into tables/columns? I can't find a schema anywhere in the source, but maybe I just don't know where to look.

Also the code talks an awful lot about DB cursors, which indicates that this is not really taking advantage of either SQL or the relational model at all.

You might find Revenj (https://github.com/ngs-doo/revenj) interesting too. It maps DSL schema to various Postgres objects and offer 1:1 mapping between view in the DB and incoming/outgoing Json.
zrail, it's difficult to find the schema and/or tables in the source code as all of them are 100% dynamic. See my coments above about the internal workings of ToroDB. Or give it a try, run it and check how data is laid out!

Regarding the use of cursors, they are absolutely necessary, as there is -in MongoDB- the concept of a session, and queries may be asked in subsequent packets to return the next results. However, I don't see how this impedes to take advantages of the relational model. ToroDB definitely does that, if you look at the created tables schema.

I would also be very interested in how this kind of a breakdown occurs. It is not uncommon I have to store arbitrary structure, and if there is a better way than a blob, which can be abstracted away by an app, that would be quite useful on its own, without the MongoDB protocol.