| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by drob 3578 days ago

Author here. Curious what experiences y'all have had with JSONB.

We're in the process of switching to a more balanced schema (mentioned in this post) and the results have been pretty good so far.

Another win has been that the better stats make it possible to reliably get bitmap joins from the planner. Our configuration uses ~12 RAIDed ebs drives, so the i/o concurrency is really high and prefetching for a bitmap scan works particularly well.

4 comments

iEchoic 3578 days ago

I've found a few good applications for jsonb so far:

1) Using jsonb to store user-generated forms and submissions to those forms. As an example, you can create a form with text inputs, checkboxes, etc., and others can submit responses with that form. I find that these forms and their submissions are best stored as jsonb because their contents are largely opaque (I don't care about the contents except where they are rendered on the client), their structure is highly dynamic, and their schema changes frequently.

2) As a specialized case on #2, applying filters to user-generated form submissions. jsonb supports subset operators (@> and <@, if i remember correctly), which makes easy work of dynamically filtering form submissions on custom form fields even for complex filter conditions.

3) Storing/munging/slicing relatively low-volume log data is fantastic with jsonb. This is always for admin/diagnostic reasons, so it's not as performance-critical, and the ability to group on and do subset operations on jsonb fields makes slicing your data really easy.

mcdee 3578 days ago

I'm using JSONB and the downsides on performance are not noticeable for most cases. For those where there are real problems then crafting a custom index usually fixes the issue.

Using ->> (or ->) in a WHERE statement is generally a bad idea, and certainly a terrible idea without an explicit index. Use @> instead.

malisper 3578 days ago

Using @> instead of ->> only causes the selectivity estimate of the predicate to be a different hard coded estimate. It doesn't fix the underlying problem of Postgres not keeping statistics on JSONB.

mcdee 3578 days ago

True it doesn't solve the problem of not having statistics on the values, but it does bring the query response time down to the same order of magnitude as the non-JSON table.

malisper 3578 days ago

> but it does bring the query response time down to the same order of magnitude as the non-JSON table.

In the specific example given it might, but you will still wind up with a handful of queries that are planned wrong and are orders of magnitude slower.

mcdee 3578 days ago

There are two separate issues. The lack of statistics is one thing, but the use of ->> instead of @> is another. Look at https://explain.depesz.com/s/zJiT Vs https://explain.depesz.com/s/ihwk for the difference.

malisper 3578 days ago

Your queries are executing different plans. The first one is executing a nested loop join which filters out 1,246,035,384 intermediate rows. The second one is executing a index join which doesn't filter out any intermediate rows at all. This seems like it was caused either by the scientist_labs_pkey index not being there in the first trial or just random luck due to a difference in statistics.

ngrilly 3577 days ago

I have an unrelated question :-)

I read a presentation titled "Powering Heap" by Dan Robinson, Lead Engineer at Heap, which contains interesting info about how you use PostgreSQL. [1]

At Heap, do you try to keep rows belonging to the same customer_id contiguous on disk, in order to minimize disk seeks?

If yes, how do you it? Do you use something like pg_repack?

If no, don't you suffer from reading heap pages that contain only one or a few rows belonging to the requested customer_id?

[1] http://info.citusdata.com/rs/235-CNE-301/images/Powering_Hea...

malisper 3577 days ago

Most of our queries depend more on the time of the events rather than the user the events belong to. For example, let's say you want to know how many users signed up on Monday and logged in again before Friday. That query would fetch all sign up events and all log in events over the rest of the week, do a group by user_id, and use a custom udf to perform the aggregation. We never actually fetch multiple events from a user at a time. Instead we look for specific types of events in a given time range and group by the user. Clustering by time winds up being a much bigger win (benchmarks showed 10x compared to sorted by user for some uncached queries) here as almost all of our queries are constrained to a given time period.

Currently, maintaining the clustering has only been best-effort. We sort our data whenever we copy it from one location to another and the data comes in sorted by time, so it's fairly easy to maintain a high row correlation with time.

ngrilly 3577 days ago

My question was probably not clear enough... I'm asking about clustering by customers/tenants (i.e. Heap customers), not by users (i.e. the users of Heap customers).

drob 3577 days ago

The data is sharded by customer and then sub-sharded by end user within the customer. For all but the tiny customers, 100% of the data on a logical shard will belong to the same customer. That means our subqueries will never touch data from more than one customer unless the customer is very small. (And, if the customer is that small, it should be easy to make the query fast anyway.)

ngrilly 3577 days ago

Your answer is very useful. Thanks Dan!

May I ask how many logical shards do you have per physical shard/machine? And what is the average size of a logical shard on disk?

You wrote "the data is sharded by customer and then sub-sharded by end user within the customer", but malisper wrote above that "clustering by time winds up being a much bigger win". Isn't it contradictory?

malisper 3577 days ago

There's two parts to it. We first split up data by different customers. At some point customers get big enough that having a single table is slow. Once a customer reaches a certain size we split up the end users into ranges and have separate tables for each customer, range pair. W typically limit each table to 800k events which is about 800MB of data. Then when we query, we use citus which automatically sends the proper queries to the specific tables necessary and then aggregates the results. Each individual table in our cluster is sorted by time.

willlll 3578 days ago

Was the 30% disk saving over petabyte+ data set on a single-node Postgres or on your Citus cluster?

drob 3578 days ago

This isn't live yet, but we expect it to be across our citus cluster. The ~30% figure comes from the profiling we did on individual postgres nodes.