Hacker News new | ask | show | jobs
by gbuk2013 756 days ago
Genuine question: does your job involve troubleshooting from logs on a regular basis? Because if it does, I would be surprised that you feel the way you do.

My experience is with ELK but at least Kibana interface is pretty decent for applying filter combinations to find the needle in a haystack of logs.

And in terms of ingestion, if you are in a container environment you can just configure stdout from the container to be ingested - no agent required.

Building a system that can ingest a few GB of logs a day, index them in near real time and keep them around for a few months while keeping search speed usable is not as easy as it might seem at cursory glance.

But the real challenge is to get developers to write software that outputs structured logs that don’t suck. :)

And don’t even get me started on all the snowflake non-json log formats I’ve had to deal with …

1 comments

I used to spend a lot of time looking at logs from a complex state machine. I would pull up a half day of logs in less (maybe a few GB), and search for something I was interested in like an id from an error message. This could be slow (tricks were disabling line numbers and searching backwards from the end) and then answer questions of the form ‘how long from this line until the next line matching x?’ or ‘what events of type y happened after the first event of type x after this log line’ or ‘what events of type x containing y happened between these log lines’ and suchlike. This is annoying to do with less; useful tricks were learning the shortcuts, taking advantages of regexes to highlight things or advance long wrapped lines with /^ and copying interesting messages into an editor and writing notes.

ELK/grafana don’t give great solutions here. Elastic/kibana already struggles with the first part because it doesn’t make it ergonomic to separate the part of your query that is ‘finding the log file’ and the part that is ‘searching within the file’. For the rest the UI tends to be more clunky and less information-dense than less, though if your data is sufficiently structured the kibana tables help. In particular, you’re still copying notes somewhere but next before/after queries aren’t easy/quick/ergonomic to express (I think), and changing the query and waiting is pretty slow. The typical search described above would be fast because the result wouldn’t be far away, and pressing n or N brings the next result exactly where you expect it on the screen, so you don’t need to try to find it on a page.

I think sql-like queries aren’t great here either because they are verbose to write and require either very complicated self-joins/ctes trying to write queries that find rows to search between/after or copying a bunch of timestamps back and forth.

Something people sometimes talk about is ltl (linear temporal logic) for log analysis. I think it can be useful for certain kinds of analysis but is less useful for interactive exploratory log querying. I don’t know of how to do ltl queries against data in Loki or Elasticsearch.

To be clear, most of the reasons that elk/grafana don’t work for the use case I described vaguely above are problems with the frontend ui rather than fundamental issues with the backend. It may just be the kind of logs too – if your logs all look similar to logs of incoming http requests, the existing ui may work great.

In the past I spent a lot of time cutting up logs with grep, less, jq and Perl. It was amazing UX that Kibana can't beat in terms of performance, assuming you already know the time-window you are interested in (although I never learned enough gnuplot to be able to do visualisations so Kibana wins there). However, all that went out of the window when I moved into a world of milti-instance micro-services in the cloud and SOC2 compliance. No more downloading of logs to my local machine and cutting them up with sensible tools. :(

That said, nothing that you outlined above is particularly difficult in Kibana, the main annoyance being the response time of the query API (somewhat mitigated by indexing). Based on your vague description my vague workflow would be:

  - filter for type x
  - limit time range to between occurrences of x
  - change filter to type y
  - an any point pick out the interesting fields of the log message to reduce noise in UI
  - save and reuse this query if it is something I do regularly
  - if your state machine has a concept of a flow then filter by a relevant correlation ID
Not sure what you mean by "finding the log file" since Elasticsearch is a document database where each log line is a document.
VictoriaLogs could fit your use case:

- It natively supports 'stream' concept [1] - this is basically logs received from a single application instance.

- It allows efficiently querying all the logs, which belong to a single stream, on the given time range, via 'curl', and passing them to 'less' or to any other Unix command for further processing in streaming manner [2].

[1] https://docs.victoriametrics.com/victorialogs/keyconcepts/#s...

[2] https://docs.victoriametrics.com/victorialogs/querying/#comm...

I am indeed excited about VictoriaLogs.