| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by atombender 2872 days ago

Is TimescaleDB a good fit for events?

For example, we have data where each record is a tuple of (time, event, object), where the event is things like "viewed", "performedSearch", etc., and the object is event data as JSON. Let's say the object is a movie, in which case the payload might be something like:

   {"id": 123,
   "name": "The Godfather",
   "director_id": 456,
   "genre": "crime"}

Our reporting UI lets users do aggregations based on arbitrary dimensions, so we might do the equivalent of:

    select object->>'genre' as genre,
      object->>'director_id' as director_id,
      extract(month from time) as month,
      count(*) as count
    from events
    group by month, genre, director_id

Then we do things like pivot/nest values to display the groupings.

It's unclear to me whether TimescaleDB fits this use case, or whether we'd have to change how things are indexed. Right now we just index the whole object itself. Another scheme I've considered is that we could index it with the values as keys; so for example, the above event would be {"name:The Godfather": 1, "director_id:456": 1, "genre:crime": 1} and essentially represent the counts. A counting aggregation would then be rewritten as as a sum(). But it's unclear to me how you do intersections here without also creating all the permutations (i.e. something like {"director_id:456/genre:crime": 1}) beforehand.

We're currently using Elasticsearch for this. Performance is okay, but we're not entirely happy with it.

2 comments

cevian 2872 days ago

Yes TimescaleDB handles event data. I'd need more information about the exact nature of your queries to really give a good answer but there are multiple possible designs here:

- keep the event as a json object and use a GIN index (this would then be combined with constraint-exclusion on the time field for faster queries if using timescaledb)

- convert each JSON key to a column and use multi-column indexes or bitmap scans

- Normalize out the unique object json into a separate table with the columns id, object (json) and have your time-series table as time, event, object_id.

Happy to talk more on our slack support channel https://slack.timescale.com/

link

atombender 2872 days ago

Thanks! As for the nature of our queries, almost all of our use cases are group-by with count(*) plus some constraints.

So we're interested, for example, in the number of views, grouped by a few dimensions, over a specific time interval (per month or quarter, typically), with some constraints, including time (last 12 months, often).

Usually the user selects a whole bunch of dimensions, and we display this in the UI as a table where we pivot or nest based on the dimensions. For example, if you group by month, by region and customer, then you might get the months as horizontal columns, the regions as the vertical column, with totals for each region, then the customer nested under the region, with totals for each customer within the region. (The underlying query gives us a flat table, which we convert to a kind of dimensional hypertable structure for display.)

Our application is schema-agnostic, which is why we use JSON. If we were to avoid JSON, the only realistic option would be for the app to use SQL to create tables, and handle schema migrations, and sort of control the schema. That would make it a somehwat different app.

Of course, for many time windows, we're talking about tens or hundreds of millions of events. Elasticsearch is very fast at aggregating data and lets us do queries that span a few months in just milliseconds, whereas grouping an entire dataset containing years of data typically takes maybe 4-5 seconds, still fast enough to be acceptable for a reporting UI. In my experience, Postgres isn't as fast at counting.

link

nh2 2872 days ago

I, too, would be very interested to find a Postgres-based replacement for Elasticsearch.

Specifically replacing ELK by Postgres-Kibana.

Requirements for that:

   * Events are (timestamp, {arbitrarily nested JSON object})
   * Filtering by timestamp must be fast
   * Full-text search on the object is required
   * Exact constraints on all object keys must be fast
   * It should be possible to define indices on the object's fields so that WHERE clauses are fast
   * Counting the number of results should be fast, or at least have fast reasonably accurate estimates
   * Support typical Kibana searches and filters

I have tried so far to implement Kibana's access patterns on Postgres, and got quite far, but never got past the problem of https://wiki.postgresql.org/wiki/Slow_Counting, which essentially means postgres must scan the whole table if you write a WHERE clause, even when using indices, because it has to double-check whether the returned rows weren't actually deleted.

link

cevian 2872 days ago

This seems like it could fit well with TimescaleDB but obviously would take testing. My only concern would be with Full-text search on JSON which I think is possible but I have never done. I would start with a timescaleDB hypertable on the Even table (time TimestampTz, object JSONB) with the following indexes (or some of them depending on testing):

- BTREE(time DESC)

- BTREE(time DESC, object)

- GIN(object)

- some kind of full text index

I don't know why the slow counting problem would be a problem with WHERE clauses since indexes are highly optimized to work with MVCC (e.g. hint bits etc). The wiki article itself says this isn't much of a problem when using indexes. But maybe you can elaborate?

link

nh2 2872 days ago

I found that the full-text search on JSON worked remarkably well with Postgres 10 -- surprisingly this was the least of all problems.

The issue with slow counting is this from the wiki page:

> PostgreSQL will take advantage of available indexes against the restricted field(s) to limit how many records must be counted, which can greatly accelerate such queries. PostgreSQL will still need to read the resulting rows to verify that they exist; other database systems may only need to reference the index in this situation.

Typical scenario:

If you make a trivial query that matches a lot of rows (even when using an index). You want to count the number of results in order to tell the user how much they probably have to scroll through (quite important when digging through logs, to know whether you'll have to scroll through doable 3 pages, or impossible 3000 pages).

    SELECT COUNT(*) FROM logs WHERE object->>environment == 'production'

Then the WHERE will match 100 million rows, and postgres will scan them all for existence (due to reason quoted above), no matter if `object->>environment` has an index on it or not.

This will take many minutes, even on SSDs, just for showing a COUNT.

link

atombender 2872 days ago

Yep, the reason ElasticSearch is fast here is that the underlying Lucene indexes essentially form a column-oriented database. This is superb for low-cardinality fields like "object->>environment"; if it has just a handful of values, then only those values are stored, as a sorted list of postings. Intersection with other field-based constraint are vector operations and can be super fast.

I suspect that to make a fast-counting time series mechanism for Postgres, you'd need to create a new index type that used a columnar approach (or even used Lucene unerneath). I don't know much about what optimization options are available to Postgres extensions, but it doesn't sound like it would be impossible.

link

nh2 2872 days ago

Also, this is quite related (including gist I linked in comments):

https://stackoverflow.com/questions/16916633/if-postgresql-c...

Some of the issues magically got faster with `VACUUM ANALYZE`, but it would be great to know whether TimescaleDB can be tuned to support this out of the box so that it's always as fast as Elasticsearch.

I have already written some scripts to preload postgres with an example data set; if I get some help with it, I could make that run against TimescaleDB, so that it can easily be evaluated whether it solves this use case, or whether it improves over time.

link

StavrosK 2872 days ago

I will second this question. cevian, can you opine on whether TimescaleDB would be a good fit for this? My (very small) experience with it so far says yes, but I'd like the opinion of someone more knowledgeable.

link