Hacker News new | ask | show | jobs
by chc 3575 days ago
Let me give an example: We process some large forms (say 100+ fields). We care a lot about maybe 15 fields off these forms, but we use the others sometimes. These forms change every so often, with some fields disappearing and new fields joining in. The fields we care a lot about don't change, but the ones we care a little about do. A few possibilities here:

1. You could have a table for every variant of the form

2. You could have a table with every field that's ever been used as a nullable field and add new columns every time a new version of the form appears

3. You could have a table with columns for the important fields and a JSONB column for the variable data

Of the three options, 3 seems the most elegant to me. The other two are basically allowing the ugliness of the source data to spew complexity all over your database.

3 comments

I use JSONB columns for similar use-cases, but to play devil's advocate, you can accomplish that a fourth way, which is almost certainly better than 1 or 2. A table for fields, one row per field. A table for forms, which has a many-to-many relation to fields. Entries in a link table compose a form of arbitrary fields. Answers can be stored in a separate responses table, indexed by form_id and column_id. I don't know enough about database implementation to speculate on how that would perform at scale, but conceptually that's how I think of the problem.
This is roughly EAV. EAV is where you have a with a schema:

    entity_id | attribute_id | value
EAV is typically considered to be an anti-pattern for several reasons: it becomes very expensive to rematerialize a entity with all of its attributes/values, it becomes difficult to have any kind of constraints on the values (this is also a problem with jsonb), and it's hard for the database to maintain good statistics on the values of different attributes as most databases don't keep multi-column statistics. Don't worry, I've had similar ideas before.
EAV is an anti-pattern?

If you don't have JSONB, EAV is the only remotely-reasonable way to implement user-defined fields (e.g. product-specific attributes in e-commerce).

If you want to allow user-defined fields in a relational database, your realistic are either EAV or stuff json into a text column. EAV, if done extremely carefully, can be a good solution, but 99% of the time, it's going to be a huge pain.
> stuff json into a text column

You did see the article was about JSONB, which is significantly more sophisticated than "json in a text column", yes?

Yes, the person I was replying to was asking about what to do if you don't have jsonb.
Sounds kind of like the EAV relational data model: https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80....
I would expect that to perform poorly at scale.

There is also a fifth option, though it is not very space efficient. That is to store every field in the JSONB column. Personally, that is what I would do.

This approach is known to be used successfully by many high scale companies. It ensures the highest degree of flexibility and still allows for full indexing of the fields. The schema would need to be enforced by your application, but that should happen anyway.

> 2. You could have a table with every field that's ever been used as a nullable field and add new columns every time a new version of the form appears

Depending on your RDBMS this is not actually so bad. SQL Server has sparse column support which helps to make this sort of schema practical. It comes with some strings attached, however [1].

[1] https://msdn.microsoft.com/en-AU/library/cc280604.aspx

When you say 1. You could have a table for every variant of the form do you actually mean one table for each version of the form, or more like one table for the 15 fields you care about+1 field for the form version # (sort of a base class) and then one extra table to capture extra fields when needed?

The latter is a sort of "inheritance" done in SQL and I had used it with good results in the past.