Hacker News new | ask | show | jobs
by knucklesandwich 3574 days ago
Definitely have been bitten with the query statistics issue before. I worked with a colleague once who was adamant that we build our backend on MongoDB, but I was able to convince him to build on Postgres because of it's JSONB support. I don't get why, since schema updates are generally very cheap with databases like Postgres (adding columns without a default or deleting columns is basically just a metadata change), but some developers believe its worth the headache of going schema-less to avoid migrations. In a sense, that suggestion kind of bit me in the ass when we started having some painfully slow report generation queries that should have been using indexes, but were doing table scans because of the lack of table statistics. In a much larger sense, I'm still thankful we never used MongoDB.

Protip: Use the planner config settings[1] (one of which is mentioned in this article) with SET LOCAL in a transaction if you're really sure the query planner is giving you guff. On more structured data that Postgres can calculate statistics on, let it do its magic.

[1]: https://www.postgresql.org/docs/current/static/runtime-confi...

3 comments

You need some migrations anyway or you'll get cruft in the db or worse. Think of old documents with extra fields your sw don't use anymore or without fields that are needed. Multiply by embedded documents and you get ton of problems you can solve only by taking care of data.

This happens even in development before going live for the first time, and way more often as you keep changing sw. Even if you throw away the data every time you still have to update the seeding scripts (in a relational db you have seeding + schema changes).

Anyway, what did you do? Did you keep using JSONB with that planner config setting or did you extract some data to ordinary columns?

Totally agree. At a previous job, some of the senior engineers decided to use MongoDB as the main data store, and doing migrations was among the worst things about it. I think some engineers envision that they'll just be able to do read repair and things will magically work. In practice, you can only really do read repair when you have a workload oriented to reading single records at a time and you have strict controls on concurrent access to prevent weird A-B-A errors with read repair. Complex aggregate queries are almost always impossible with read repair. Even with single records, read repair is still a pain in the ass. You often have to maintain unmarshalling code for several versions of a record formats. In the end, one of the engineers ended up having to write some internal migration tool (which was of course strictly worse than migrations via Postgres, because schema changes did require rewriting a table with update queries, so we ended up needing a bit of downtime). Even with the migration tool, shipping always required a lot of people on call, since migrations would inevitably break during the release process due to frequently brittle migration code.

As for the above story, that engineer was sort of on his way out at the time, so I used the above method to provide query hints as a short term fix. After he left, I was able to restructure the event data schema to make more use of columns. Some of the ancillary attributes that weren't used for row selection stayed as jsonb, but things like timestamp, event name, user id, etc. were moved to columns.

> Protip: Use the planner config settings[1] (one of which is mentioned in this article) with SET LOCAL in a transaction

We've wanted to do this but the last I checked, Citus, the software we use to shard our postgres databases, isn't able to handle setting configs in a query.

Ah, bummer, yeah that's a convenient short term fix. It sounds like you all have this handled pretty well though, I think you'll definitely appreciate the move to a more traditional schema. For events data, JSONB support can be nice for infrequently accessed attributes, so its not an all or nothing proposition, but I found I had a lot less headaches after adding more table structure.
One of the tricks I have found for Postgres to manage analytics and unstructured data is using its inheritance with check constraints aka partitioning features. I mention it because not many people seem to know about this bad ass feature of postgres.

We use inheritance (with check constraints) [1] for both time partitioning as well as for custom (aka unstructured) events. Most events have several (100s in our case) common columns. When you see a new custom event with custom fields you want to separate out you can create a subtable and have it inherit from the base event table.

Now querying for those custom events is extremely fast and you can query from the base event type or the subtable directly.

[1]: https://www.postgresql.org/docs/current/static/ddl-partition...

How do you manage the partitioning? We've thought of partitioning events (by time and/or by type of event) but we haven't found an easy way to migrate to partitioned tables and we aren't exactly sure how we would maintain the partitioning.
It depends on how you partition. If you partition with no extra columns which is the typical case for partitioning by time you can use triggers and insert on the base table. The postgresql doc covers this.

However if you add columns to your subtables (which is usually the case for adding events by type) you will need to insert into the correct table.

You can enforce this by making it so you can never insert into the parent table

    -- FORCE programmatic dispatch to table (v9.2).
    ALTER TABLE event ADD CONSTRAINT event_no_insert CHECK (false) no inherit;
Now you have to insert in the chid tables. This requires business logic on your end.

As far as refactoring goes the typically approach is to create the new tables and insert the data into them and then do a name swap and then add the check constraints.

And what about time partitioning? Do you have a job that creates new tables and merges old ones together?
The nice thing about using postgres is that when you realize "Mongo style" was a bad idea you can write some migrations and take your common fields into proper columns quite easily.