Hacker News new | ask | show | jobs
by ahachete 4248 days ago
The transformation is performed at the ToroDB layer. Each document is analyzed and several steps are performed:

- Document is received in BSON format (as per the MongoDB wire protocol) and transformed into a KVDocument. KVDocument is an internal implementation, an abstraction of the concrete representation of a JSON document (i.e., hierarchical, nested sets of key-pairs).

- Then, KVDocuments are split by levels (called sub-documents).

- Each subdocument is further split into a subdocument type and the data. The subdocument type is basically an ordered set of the data types of that subdocument.

- Subdocuments are matched 1:1 to tables. If there is an existing table for the given subdocument type, the document is stored directly there. If there isn't, a new table with that type is directly created. This means that there is also a 1:1 mapping between the attribute names (columns) and key names, and makes it very readable from a SQL user perspective.

- There is a table called structure that is basically a representation of the JSON objetct but without the (scalar) data. Think of the JSON object but only the braces and square brackets, plus all the keys (or entries in arrays) that contain objects. There is, per level, a key in this structure that cointains the name of the table where the data for this object is stored. This table uses a jsonb field to store this structure, but note that there's no actual data in this jsonb field.

- There's finally a root table which matches structure with the current document. This is used as structures are frequently re-used for many documents. This is in part one of the biggest factors which contributes to significantly reduce the storage required compared to, for example, MongoDB, as the "common" information of that "types of documents" is stored only once.

This information and more will be shortly added to the project's wiki. However, it's very easy to see if you run ToroDB and look at the created tables :)

Note: I'm one of the authors of ToroDB

4 comments

Can you briefly explain the advantage of tearing apart the JSON document, rather than just storing it as JSONB and using the various JSONB functions/operators/indexes?
He did explain that - it can yield significant space savings where there is a common "type of document" because the field names do not need to be stored in each document instance, the way they would have to be in JSONB. Also, it means you can query these tables with normal Postgres query tools and they will actually make sense.
This is a pretty interesting approach, and it doesn't seem to be tied to any particular database implementation.

Would it be possible to factor this up into a JSON -> SQL translation function, which could than be used by various backends (effectively consume the JSON and spit out the CREATE TABLE and INSERT statements)?

Indeed.

It's more than a JSON to SQL translation, but it could definitely use various backends. It has some plpgsql Postgres code and some data types to speed operations (saving some round trips to the database), but it won't be hard to port it to other backends :)

Is it possible to quantify the savings of storage and IO?
There are some benchmarks in the following presentation: http://www.slideshare.net/8kdata/toro-db-pgconfeu2014 where ToroDB was presented (PostgreSQL Europe Conference, Madrid, 2014, 2014.pgconf.eu).
I'd be very interested to know more about this too. There is considerable overhead for each row in PostgreSQL (24 byte header, which must be aligned). Obviously you'll save a bit if there are repeated keys higher up, but you'll pay for the foreign key references to those rows in the children.
It is true that PostgreSQL has such a high row overhead. But the storage savings are significant, in any way. Please see the presentation pointed out earlier.
Can this be made to use HStore?