Hacker News new | ask | show | jobs
by jaysh 7 days ago
ClickHouse replacing Loki finally made our observability stack feel 'right'. It really is a powerhouse for logs and general analytical queries.
2 comments

We are fully embracing LGTM ourselves but this is really interesting. Loki for us has been great, though, so what is better about CH other than maybe sql being more expressive than LogQL?
Off the top of my head:

- substantially better performance on the same hardware, even moreso for larger range queries (multiple days)

- no new query language to learn

- significantly more expressive as you said

- agents for scraping logs use way less CPU (I used to use grafana-agent which used about 80%, vector uses sub 5%)

- very intuitive to manage TTLs - I can keep some logs for 10 years, and some for 1 week based on the event in the JSON

- more compact storage, I didn't check scientifically but CH storage is better compressed at least 4-5x for us

- no running into maximum stream limits - struggled with these even on Grafana Cloud and didn't realise we silently lost a lot of logs

Honestly: why wouldn't you. Loki always felt like a mistake to me. A brand new query language, really counter-intuitive configuration, large ramp-up time for complex queries, lots of arguing about labels/cardinality etc. It all goes away when you drop it. I think logging should not be exotic or behave in unexpected ways.

How do you use it for visualization? Do you use ClickStack? or something else?
Still via Grafana. I ran it side-by-side with Loki and despite trying to optimise Loki and using ClickHouse out of the box - it really was shocking how much faster ClickHouse was for every single query (e.g. in the last 12 hours give my the frequency of logs with a particular JSON event or even "find this log entry, then join back and find the number of times a different entry appears within the same correlation_id)
What does the layout in click house look like? Do the input logs need to have a very defined structure?
Not really, ClickHouse is super forgiving so you can do something like:

    CREATE TABLE default.events (
      `timestamp` DateTime
      `event` String -- e.g. 'product.updated' or empty/null
      `message` -- human readable message
      `raw` -- the raw message - this is very useful when pushing logs that aren't JSON - you just let the `event` be null and dump the entire message here
    )
    ENGINE = MergeTree
    PARTITION BY toDate(timestamp)
    ORDER BY (timestamp, event)
    TTL timestamp + toIntervalMonth(6)
ClickHouse is extremely performant even in the cases of e.g.: SELECT count(*) FROM `events` WHERE `raw` LIKE '%hello world%'

Of course, the more columns you splat out (e.g. like correlation_id, user_id, order_id, etc) the better you can index and expect those queries to perform but in general I don't bother outside the obvious core domain ones (exampled above), the performance is so good that unindexed queries are significantly faster than indexed queries in Loki. I have reached the point where I JSON extract on-the-fly for the WHERE clause with very large queries with no meaningful performance issues.

Interesting, so you can bind a Clickhouse table as an extension to Grafana? Would you make a little Gist / post about it to show?
You only need the plugin: https://clickhouse.com/docs/observability/grafana - then you get basically everything natively.
There is HyperDX - search is not fastest, but it could be something that we do too - haven't checked deeply if high-cardinality is big issue with ClickHouse, but seeing some high cardinality data with what we post.
I have used SigNoz https://signoz.io/ for that
Worth noting both hyperdx and maple too for other observability on clickhouse options. https://www.hyperdx.io/ https://maple.dev/
We recently moved to openobserve for due to cost, but visualisations are good enough too.
Same question here!
Just replied to that question! Let me know if you have other questions.