Hacker News new | ask | show | jobs
by whalesalad 942 days ago
Can anyone comment on QuestDB vs Clickhouse vs TimescaleDB? Real world experience around ergonomics, ops, etc.

Currently using BigQuery for a lot of this (ingesting ~5-10TB monthly) but would like to begin exploring in-house tooling.

On the flip side, we still use PSQL/RDS a lot and I enjoy it for the low operations burden - but we're doing some time series stuff with it now that is starting to fall over. TimescaleDB is nice because it is postgres, but afaik cannot work inside RDS. Clickhouse is next on my list for a test deployment, but QuestDB looks pretty neat too.

4 comments

Can't answer your question directly but I've touched a few of those here and there.

Clickhouse was deployed to replace a home grown distributed storage system that at one point in time a long time ago was much cheaper and faster than the results the team was getting with BQ.

We evaluated clickhouse and druid for a few data stores for doing interactive queries on a fairly high throughput clickstream data pipeline.

Clickhouse won in terms of performance. We did some chaos testing on it in the form of simulating node outages and network partitions and we were happy with the results. My only complaint is that there are some other database options which don't require me to run a database and that's nice if I can get away with it.

One of the things you might try is dumping your data into parquet files on gcs. There are quite a few databases you can query that with and get good results depending on your indexing and partitioning needs. It's tough to get lower operational burden than "stick it in cloud storage and sometimes spin up some stateless compute to query it".

I think duckdb is super cool but for me at the moment it's a solution that I'm still in search of a problem for.

We’re also using duckdb in prod and the performance is phenomenal.

Haven’t tried using it to read parquet direct from cloud storage but that’s on our todo list. We currently use it to generate a massive table once a day and then throughout the day that’s queried to serve prod requests.

I looked at it every year for a few years, most recently 1-2 yrs ago. It really is 20-50x faster at a lot of workloads than something like postgres, and it really does have a relational core and decent time-series functionality. I'd love to use it more, but IMO it's far too buggy. However it got that way, nothing about query execution felt cleanly composable.

No single query directly corrupted any persistent data, but "complicated" queries (any more nesting than the classic "select from join where" SQL backbone) were prone to returning no results, wrong results, error conditions, or slowly. For a bit I worked around that by wrapping right-sized queries (big enough to do something meaningful and avoid comms overhead, small/simple enough to work as expected) with a normal turing-complete language, but it was painful. Plus the defaults were quirky (broken telemetry opt-out, connections time out, RAM limits being per-arcane-questdb-subcomponent rather than per-questdb, ...) IMO, not that I would have minded enough to look elsewhere if the queries were correct.

Colored by that experience, I'd caution to thoroughly go through the docs and get it configured exactly as you want before deploying to prod, testing that the features you need behave correctly (best practice for most software, but IME not strictly necessary for a lot of products if your application code is sane), and also do something to probe reliability like spending a day hammering it with a toy project (maybe a few 10M row tables so that it's a fast experiment but anything fishy still stinks appropriately) and checking the query results. The current GitHub issues look fairly tame, so maybe it's better by now.

Sorry to hear about your painful past experience and thanks for sharing this feedback. QuestDB improved a lot in the past two years and it's much more robust and production-ready now.

A couple of things the engineering team did to address this as it grew from the early days at Ycombinator:

QuestDB introduced fuzz tests for all major database components [1] which helped find and fix lots of bugs. Besides that, the SQLancer team [2] was kind enough to add QuestDB support and find bugs around tricky SQL edge cases. We do our best to squash all critical bugs and constantly grow our integration tests suite.

On the SQL side specifically, we brought many SQL-related fixes over the last few years, introduced EXPLAIN and query plan testing, and finally improved memory management of cached queries. We're prioritising new bugs should they arise on GitHub or elsewhere.

Large companies such as Airtel [3], Mizuho Bank, Airbus, NetApp, Yahoo [4], Central Group [5] (largest retailer in Asia), Cloudera use QuestDB in production today.

There is still a lot of work to be done to bring better functionality. If you're keen to try QuestDB for one of your projects, we'd love to hear your feedback and see how it works for you.

[1] https://questdb.io/blog/fuzz-testing-questdb/ [2] https://github.com/sqlancer/sqlancer/ [3] https://questdb.io/case-study/airtel-xstream-play/ [4] https://questdb.io/case-study/yahoo/ [5] https://questdb.io/case-study/central-group/

I'd be curious to hear how RDS is starting to fall over with time series data, is it a bottleneck on ingestion, queries, or both?
It’s working okay with a 30 day rolling average… (every day we truncate older rows) we read from it to generate on demand status for per-second performance of tasks in a big job processing engine. But long term we want to have all the data available for historical analysis, trend analysis etc.
What about iceberg tables and a lake approach on GCS and then picking a querying engine?
This is the way the industry is going. Table formats such as Delta, Hudi, Iceberg stored on cloud object stores.

Though it works amazingly well, it is certainly slower than ingesting the data to be stored and manipulated in native formats.

These are terms I’m sorta familiar with but not sure. Data lake = bunch of noise (everything), iceberg = generated tables or views to read relevant/hot data from the lake?
Basically that’s it. Yeah. If you can afford BigQuery just use that but otherwise building off of blob storage and bolting on query engines and catalogs makes for a flexible approach but I find BigQuery solves most problems rather well just throwing money at the problem lol