| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cevian 2827 days ago

This seems like it could fit well with TimescaleDB but obviously would take testing. My only concern would be with Full-text search on JSON which I think is possible but I have never done. I would start with a timescaleDB hypertable on the Even table (time TimestampTz, object JSONB) with the following indexes (or some of them depending on testing):

- BTREE(time DESC)

- BTREE(time DESC, object)

- GIN(object)

- some kind of full text index

I don't know why the slow counting problem would be a problem with WHERE clauses since indexes are highly optimized to work with MVCC (e.g. hint bits etc). The wiki article itself says this isn't much of a problem when using indexes. But maybe you can elaborate?

1 comments

nh2 2827 days ago

I found that the full-text search on JSON worked remarkably well with Postgres 10 -- surprisingly this was the least of all problems.

The issue with slow counting is this from the wiki page:

> PostgreSQL will take advantage of available indexes against the restricted field(s) to limit how many records must be counted, which can greatly accelerate such queries. PostgreSQL will still need to read the resulting rows to verify that they exist; other database systems may only need to reference the index in this situation.

Typical scenario:

If you make a trivial query that matches a lot of rows (even when using an index). You want to count the number of results in order to tell the user how much they probably have to scroll through (quite important when digging through logs, to know whether you'll have to scroll through doable 3 pages, or impossible 3000 pages).

    SELECT COUNT(*) FROM logs WHERE object->>environment == 'production'

Then the WHERE will match 100 million rows, and postgres will scan them all for existence (due to reason quoted above), no matter if `object->>environment` has an index on it or not.

This will take many minutes, even on SSDs, just for showing a COUNT.

link

atombender 2827 days ago

Yep, the reason ElasticSearch is fast here is that the underlying Lucene indexes essentially form a column-oriented database. This is superb for low-cardinality fields like "object->>environment"; if it has just a handful of values, then only those values are stored, as a sorted list of postings. Intersection with other field-based constraint are vector operations and can be super fast.

I suspect that to make a fast-counting time series mechanism for Postgres, you'd need to create a new index type that used a columnar approach (or even used Lucene unerneath). I don't know much about what optimization options are available to Postgres extensions, but it doesn't sound like it would be impossible.

link

nh2 2827 days ago

Also, this is quite related (including gist I linked in comments):

https://stackoverflow.com/questions/16916633/if-postgresql-c...

Some of the issues magically got faster with `VACUUM ANALYZE`, but it would be great to know whether TimescaleDB can be tuned to support this out of the box so that it's always as fast as Elasticsearch.

I have already written some scripts to preload postgres with an example data set; if I get some help with it, I could make that run against TimescaleDB, so that it can easily be evaluated whether it solves this use case, or whether it improves over time.

link