Hacker News new | ask | show | jobs
by leothekim 3575 days ago
"For datasets with many optional values, it is often impractical or impossible to include each one as a table column."

Honest question - what settings would many optional values be impractical or impossible? Is it purely space/performance constraints? If so, it doesn't sound like JSONB gives you wins in either of those cases.

7 comments

Let me give an example: We process some large forms (say 100+ fields). We care a lot about maybe 15 fields off these forms, but we use the others sometimes. These forms change every so often, with some fields disappearing and new fields joining in. The fields we care a lot about don't change, but the ones we care a little about do. A few possibilities here:

1. You could have a table for every variant of the form

2. You could have a table with every field that's ever been used as a nullable field and add new columns every time a new version of the form appears

3. You could have a table with columns for the important fields and a JSONB column for the variable data

Of the three options, 3 seems the most elegant to me. The other two are basically allowing the ugliness of the source data to spew complexity all over your database.

I use JSONB columns for similar use-cases, but to play devil's advocate, you can accomplish that a fourth way, which is almost certainly better than 1 or 2. A table for fields, one row per field. A table for forms, which has a many-to-many relation to fields. Entries in a link table compose a form of arbitrary fields. Answers can be stored in a separate responses table, indexed by form_id and column_id. I don't know enough about database implementation to speculate on how that would perform at scale, but conceptually that's how I think of the problem.
This is roughly EAV. EAV is where you have a with a schema:

    entity_id | attribute_id | value
EAV is typically considered to be an anti-pattern for several reasons: it becomes very expensive to rematerialize a entity with all of its attributes/values, it becomes difficult to have any kind of constraints on the values (this is also a problem with jsonb), and it's hard for the database to maintain good statistics on the values of different attributes as most databases don't keep multi-column statistics. Don't worry, I've had similar ideas before.
EAV is an anti-pattern?

If you don't have JSONB, EAV is the only remotely-reasonable way to implement user-defined fields (e.g. product-specific attributes in e-commerce).

If you want to allow user-defined fields in a relational database, your realistic are either EAV or stuff json into a text column. EAV, if done extremely carefully, can be a good solution, but 99% of the time, it's going to be a huge pain.
> stuff json into a text column

You did see the article was about JSONB, which is significantly more sophisticated than "json in a text column", yes?

Sounds kind of like the EAV relational data model: https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80....
I would expect that to perform poorly at scale.

There is also a fifth option, though it is not very space efficient. That is to store every field in the JSONB column. Personally, that is what I would do.

This approach is known to be used successfully by many high scale companies. It ensures the highest degree of flexibility and still allows for full indexing of the fields. The schema would need to be enforced by your application, but that should happen anyway.

> 2. You could have a table with every field that's ever been used as a nullable field and add new columns every time a new version of the form appears

Depending on your RDBMS this is not actually so bad. SQL Server has sparse column support which helps to make this sort of schema practical. It comes with some strings attached, however [1].

[1] https://msdn.microsoft.com/en-AU/library/cc280604.aspx

When you say 1. You could have a table for every variant of the form do you actually mean one table for each version of the form, or more like one table for the 15 fields you care about+1 field for the form version # (sort of a base class) and then one extra table to capture extra fields when needed?

The latter is a sort of "inheritance" done in SQL and I had used it with good results in the past.

For me a great example is additional/diagnostic information related to other systems...

In one such example would be a transaction table, where the transaction info I care about are dedicated columns, but additional details from say paypal, or another payment processor would be stored as JSON, as there is little value to parsing the detailed information into a unified format, that will always have "extra" information for that processor, and without the need of dedicated tables for each payment processor.

Another example would be logging, where some information is consistent, but other additional information is mainly just stored. I also like jsonl (line terminated json, one record per line) in a gzip stream works very well... as an aside from a db server.

There are other considerations as well, in some cases a document store such as mongo or rethinkdb may be closer to what you ideally want, for that matter you may need something closer to Cassandra.

Space constraints can be a big problem with column keys, at sufficient scale.

Traditional SQL database have column names that are stored separately from the table data, but in JSON, the keys are in the data.

I've worked with MongoDB systems that held 100 terabytes of data. At that scale, we had to re-write all of the keys so they were only a single character. When JSON is small, it is pleasant to have keys that make sense, such as:

username : mary

but as JSON gets big, this needs to change to:

u : mary

If you have 10 million records, the difference (in this trivial example) is 70 million characters, and if you have 100 different collections with an average of 20 keys each, then you are talking about 140 billion characters saved.

Most of the companies that I know, if they work with JSON document stores at scale (such as MongoDB) eventually take steps to make the keys smaller.

I was also wondering about this one.

This may make sense if you have custom fields, i.e. the set of keys is user-defined and ever increasing.

Apart from that, there is no reason to do this. If it is all about avoiding having "too many" columns and/or "too may" null values, then I'd say: Don't worry. Just use as many columns as you need.

I would even go one step further and say: It is a common anti-pattern to introduce generic key-value stores (through a separate table, JSON structures or XML structures) without a compelling reason.

I'll add another reason: subtle application issues creep in with value states.

With JSONB I can receive a boolean element with five states: null, true, false, invalid (i.e. actually a string, number or array value), or simply nonexistent, and it's up to the application to deal with all such cases. You may have to treat the database as a source of potentially invalid data that must be sanitized.

With regular DB columns you can reasonably assume that a boolean not null column will a) exist for all records and b) return either true or false. You need only scan the database schema at application start to verify this.

On the other hand, if you're using JSONB for an options structure then the possibility of key nonexistence may actually be useful, since it implies "use system default". (Yes this can differ from the meaning of a NULL value, especially if you are merging options structures).

You might question whether it is wise to have a "boolean" key/value pair with four semantically distinct states. I can only say I have done this and I am not proud of it.

To be fair, you can write CHECK conditions which are arbitrary expressions involving JSONB values. Though at that point you might as well just use a normal column.
By the way, another related anti-pattern is the introduction of an additional table that has a 1:1 relationship to the original table. Because the original table was "full" of columns, or something, so a second table had to be opened to contain the remaining columns.

This is an anti-pattern because you can't model "1:1" that way. Instead, it will be "1:0..1", and now you have some nasty corner cases when the first table has an entry whose counterpart in the second table is missing. Also, when using a column now you always have to think about which table had it - the first or the second table?

I had to work with such a design in a real-world project and it was really annoying.

The way I have seen for the issue of which table is to introduce a view and setup triggers on the view to run a stored proc when you insert or update it to handle all the tables....
Ick. I use 1:0..1 tables commonly. But not 1:1. Ick.
Just a note that NULL columns take up very little space in Postgres. IIRC there is a NULL bitmap for every row. So yes, don't worry and just use columns, if the columns are actually part of your schema.
This is correct. Postgres keeps a bitmap in each row that designates which of the rows are null. Postgres initially allocates one byte for the bitmap so it will work for up to eight columns. Once your table has more than 8 columns, postgres will allocate eights bytes for an additional 64 columns. That means if you have more than 8 columns, you can keep making "optional" fields proper columns until you have a total of 72 columns because null values are completely free.
At Heap, we allow users to send custom event properties through our API. Since we don't know what properties users will send in advance, we need to use something like JSONB to store them.
Maybe I'm missing something, but I think of optional columns as nullable (but declared) values. It sounds like you use JSONB to store arbitrarily declared values. If so, then I'm still confused then by how you're able to hoist values from JSONB data to save on perf and space. That implies these values weren't that arbitrary to begin with.
My bad. As of right now, we use jsonb to store all of the properties of events besides the user_id, the event_id, and the time of the event. We have lots of builtin properties which are stored in some events but not others (for example, the text of a button that someone clicks on). We thought storing these properties in jsonb was a good idea because at the time we didn't know any of the downsides of jsonb. The issues brought up in Dan's post are things we wish we knew before deciding to use jsonb.

We are currently planning on moving all of the builtin fields into true postgres columns instead of keeping them in jsonb. All of the custom properties will remain in jsonb.

Got it. I got hung up on the meaning of "optional values", JSONB makes sense for storing arbitrary metrics. Thank you for clarifying!
The user-provided data is arbitrary but not random. Certain keys like "name", "email", "browser", etc will show up very often, and might even be included automatically by the client-side libraries and therefore be present in almost all rows. There may be other values that only appear very rarely - say, "my_super_special_id", and you want to be able to support those as well.
That's perfectly reasonable. But then you also let them query on those arbitrary custom properties and that's where the performance issues are? If so, that's a fairly hard problem to solve.

Taking the well-defined subset of searchable properties and making them columns, as described in the article, is the really the best solution.

As of right now, our schema is literally:

    user_id | event_id | time | data
where data is a JSONB blob that contains every other piece of information about an event. Currently, we get a row estimate of one for pretty much every query. We've been able to work around the lack of statistics by using a very specific indexing strategy (discussed about in a talk Dan gave[0]) that gives the planner very few options in terms of planning and additionally by turning off nested loop joins.

We are planning on pulling out the most common properties that we store in the data column, which will give us proper statistics on all of those fields. I am currently experimenting with what new indexing strategies we will be able to use thanks to better statistics.

[0] https://www.youtube.com/watch?v=NVl9_6J1G60

actually we have a table which would contain over 100 fields. while we only need to query 10.

actually we store only 6 things inside the table and the rest inside the jsonb.

however we sill miss like 8 values which we are using inside a list, which are slow, but materialized view to the rescue. however we may pull them out at some point, still need to figuring out since the jsonb data set is also the value of a hash. that checks the validity of the data inside their which we use for change detection against other stuff

You can index the contents of lists in Postgres using GIN indexes.
I would think of any system where users can store additional 'custom fields.' In these cases, it is impossible to include columns for these as they aren't known yet.