We are fully embracing LGTM ourselves but this is really interesting. Loki for us has been great, though, so what is better about CH other than maybe sql being more expressive than LogQL?
- substantially better performance on the same hardware, even moreso for larger range queries (multiple days)
- no new query language to learn
- significantly more expressive as you said
- agents for scraping logs use way less CPU (I used to use grafana-agent which used about 80%, vector uses sub 5%)
- very intuitive to manage TTLs - I can keep some logs for 10 years, and some for 1 week based on the event in the JSON
- more compact storage, I didn't check scientifically but CH storage is better compressed at least 4-5x for us
- no running into maximum stream limits - struggled with these even on Grafana Cloud and didn't realise we silently lost a lot of logs
Honestly: why wouldn't you. Loki always felt like a mistake to me. A brand new query language, really counter-intuitive configuration, large ramp-up time for complex queries, lots of arguing about labels/cardinality etc. It all goes away when you drop it. I think logging should not be exotic or behave in unexpected ways.
Still via Grafana. I ran it side-by-side with Loki and despite trying to optimise Loki and using ClickHouse out of the box - it really was shocking how much faster ClickHouse was for every single query (e.g. in the last 12 hours give my the frequency of logs with a particular JSON event or even "find this log entry, then join back and find the number of times a different entry appears within the same correlation_id)
Not really, ClickHouse is super forgiving so you can do something like:
CREATE TABLE default.events (
`timestamp` DateTime
`event` String -- e.g. 'product.updated' or empty/null
`message` -- human readable message
`raw` -- the raw message - this is very useful when pushing logs that aren't JSON - you just let the `event` be null and dump the entire message here
)
ENGINE = MergeTree
PARTITION BY toDate(timestamp)
ORDER BY (timestamp, event)
TTL timestamp + toIntervalMonth(6)
ClickHouse is extremely performant even in the cases of e.g.: SELECT count(*) FROM `events` WHERE `raw` LIKE '%hello world%'
Of course, the more columns you splat out (e.g. like correlation_id, user_id, order_id, etc) the better you can index and expect those queries to perform but in general I don't bother outside the obvious core domain ones (exampled above), the performance is so good that unindexed queries are significantly faster than indexed queries in Loki. I have reached the point where I JSON extract on-the-fly for the WHERE clause with very large queries with no meaningful performance issues.
There is HyperDX - search is not fastest, but it could be something that we do too - haven't checked deeply if high-cardinality is big issue with ClickHouse, but seeing some high cardinality data with what we post.