Hacker News new | ask | show | jobs
by umanwizard 808 days ago
I'm not really a cloud expert so maybe I'm fundamentally missing something about how I'm "supposed to work", but honestly all I have ever wanted to do, when looking at logs, is see the log from one process, from beginning to end, as a text file. You can of course do this using kubectl but only for the most recent two instances of a given pod which isn't helpful when investigating an incident that happened a while ago.

It seems nobody else cares about this use case and wants you to use LogQL and the incredibly clunky Grafana web UI instead, because it makes it possible to aggregate across many different processes, slice and dice by various labels, etc., which as I said, I have never (or almost never) actually wanted to do.

Hopefully this new UI is a step in the right direction as people won't need to futz around with LogQL anymore, but it seems like it still doesn't quite do what I want.

10 comments

Just want to chip in and say that I wholeheartedly agree with you. I'm not a cloud developer either, but I'm regularly forced into what's apparently called "Google Cloud's operations suite" to grovel through logs. Compared to working with Linux journals using the tried and true text manipulation tools, it feels like looking through a straw with oven mitts on. I'd happily download a 500 MB text file instead, but there is an arbitrary limit to how much I can grab (10k lines IIRC). Maybe we're just out of touch.
> but I'm regularly forced into what's apparently called "Google Cloud's operations suite" to grovel through logs

Is this google cloud logging? If so, personally I quite like it, especially for looking through logs from multiple sources at the same time. Being able to put all your logs through there, and then search them with a simple query language, feels very convenient.

it's ridiculously slow. and compared to how expensive it is ... it's robbery in daylight.
It fairly risky to download 500MB of log and analyse it locally in the machine. I know People do it anyways. Just saying.
Risky how exactly? If it has data in it that it shouldn't it's a problem no matter where it resides.
In theory a logfile could contain privileged information (indeed it almost certainly will - IP addresses etc), putting that on a laptop increases risk of losing it.
It is not about what data is in the log. It is about the fact that once the data is download, it most likely going to stay in the machine.

Logs often contain privileged info, if not reveal a bit about how the application behaves. It is risky to do that.

If logs contain privileged info, then the damage is done the moment they're published. Whether or not they're downloaded to a laptop is irrelevant, the risk impact is the same.
Fwiw this is how I use Loki most of the time. Pick an app label, pick a time period, look at raw logs. The LogQL for this ends up something like `{app="workload-foo"}`. Loki is excellent at that.

Then if I know which pod I'll filter down to a specific pod with `{pod="workload-foo-1234"}`, sometimes I'll search for a specific term (error message etc) with `{pod="workload-foo-1234"} |= "error message"` then look at the logs around that. There's really no point writing complicated queries unless you need to.

That will, if I understand correctly, get the logs for one pod, not for one process. For example if the pod restarted 10 times you will not get 10 separate files from that query.
You'd have the label shown in the output that indicates the log line in question is from a different process/pod/container/host/whatever.
How so? The pod, container, and host labels should be the same for a process that crashes and is automatically restarted, no?
Even more than that, if you are running multiple instances of the app in multiple pods concurrently, then all of those logs will be joined together.
I'm not sure I really understand this.

If you mean one instance in each pod, then each should be labelled differently and you can filter down to one instance.

If you mean running multiple instances in each pod (and container?), then the standard kubectl log output will also have them all joined together. For both of those, you would need to add another unique identifier to each line, or run each instance in a separate container so you can submit the logs with the pod name and container name combined being the unique identifier.

That's definitely false
Why? If the pod is defined to spawn multiple containers, and each container runs the same application, then this seems true to me? Unless you would add an additional filter on the container name.
Well yes obviously you have to filter on container if you want a single container (just like kubectl logs -l <...>). The parent comment was phrased as a limitation of Loki, of course if you request all logs for an application you'll get all containers, or if you request all logs for an applications or a namespace you will get that.

Not being able to filter between multiple processes or multiple restart of a container was a genuine issue, not being able to filter between pods of a deployment is not.

Its certainly true in my environment, maybe not others though? Apologies!
I'm an old fart so I use things like "cat" and "grep", and maybe "sed" and "cut" if the lines are particularly long.

I have one log file per day per host on my syslog server and can use "sort" to order across multiple files.

Loki was sold to me at fosdem a couple of years ago as this, but I still haven't got round to working it out, seems a very high barrier to entry compared with running cat.

> seems a very high barrier to entry compared with running cat

It really isn't. It's a single binary with a relatively simple configuration file, you throw logs at it via an API (which a bunch of logging agents support, and syslogs can be sent to it).

Then the actual queries aren't all that complex, it's just a difference of cd-ing to the correct folder for the date/server to be able to cat and grep vs writing a query that selects by server name and filters by date.

The learning curve and maintenance of Loki are quite minimal, but the value add is quite significant in most cases. Being able to do cross-host queries, metrics from logs (how many times did error X occur in the logs), as well as easy visualisations is pretty useful.

"to be able to cat and grep"

Admitably I learned how to use basic tools 25 years ago, but that's an investment that can be used for decades.

  cat *web*log | "grep 34.5.22.4" | sort -n | less
is hardly a complex thing to learn. Sure you can then build on that pipeline -- "cut -b -10|uniq -c" and if you want something really complex then you can use awk, or perl, or python, and do all sorts of things with the data.

Will whatever today's favoured log query/filter/etc be around in 25 years? Last time I looked at this people were going on about logstash and elasticsearch. Nobody could show me how to to the above command without touching the mouse.

Now sure, cat and grep can be sluggish on millions of lines (which is the main reason I'm tempted by loki or similar), and there's always some twat that comes along with "useless use of cat" [0], but the kind of pipeline processing serves me well and it seems a very different way to think about things when you need to access things from a database. Maybe I'm in a local maximum, but it's good-enough for me to find out what's going on.

[0] https://stackoverflow.com/questions/11710552/useless-use-of-...

"it's just a difference of cd-ing to the correct folder for the date/server" to be able to cat/grep.

You have to connect to your server, get to the correct folder, and then run the cats and greps which are easy (if you have to do some more advanced filtering with awk it gets more complicated.)

Connecting to Grafana and running a simple label query is practically the same in terms of complexity and time, but with vastly more features available.

> Will whatever today's favoured log query/filter/etc be around in 25 years? Last time I looked at this people were going on about logstash and elasticsearch. Nobody could show me how to to the above command without touching the mouse.

You can run ElasticSearch queries via the API, and can still do it today. I don't know about, but Loki is a statically compiled binary with only optional external dependencies. You'd still be able to run it in 25 year just fine.

Loki has a cli tool, called LogCLI. It's passable for needle-in-haystack searches, and the label browser is handy. But Loki doesn't handle multiline searches well. I'm with you on the ease of grep sort uniq, pretty easy to fashion up a quick report, sorted numerically - No enterprise data analysis suute needed.
For analysing text logs lnav is pretty good, if you need to work with a live updated view of the log in response to commands.
You may be amazed at how hard these tools are to get started with relative to that. I have been thoroughly unimpressed with and unable to really get started with any of these tools because of the overemphasis on cloud. Not sure what people were doing before, but sshing to the prod box kinda sucks.
If you’re debugging something simple or non-distributed, this product isn’t for you.

If you’re working on anything distributed, log aggregation becomes a must. But, also, if you’re working on anything distributed and you’re looking at logs, you’re desperate. Distributed traces are so much higher quality.

When I formed these opinions I was working on Materialize, which is basically the polar opposite of "simple and non-distributed". However it was still quite common that I knew exactly which process was doing something weird and unexpected.
Maybe it’s the difference between tracking a bug (abnormal operation) vs understanding behavior of a complex system (normal operation)?
Yup and the reason no one markets something like "tail the logs for server X" is because, if you're talking in the context of an individual server, you're too small for anyone to care about.
I've got logs from hundreds of servers that I use standard tools to look at, and that's a small system. Centralising logs has been a thing for decades.
Which is fine, I'm just saying you're not the target market for the big observability vendors.

The current generation of observability tools is built for distributed systems that are basically too complex to reason about, and so you have other ways of monitoring and debugging them. When you have 10's of k's of ephemeral containers running hundreds of services, you can't just look at some logs for a server to understand what's going on (ignoring the fact that servers aren't even a primitive in this system).

10's of GBs of logs a day just doesn't move the needle on pricing. They want the customers that are going to generate 7 figures in revenue and those customers aren't talking about aggregating logs from a few hundred servers.

Sorry, did plenty of "distributed" tracing back in the day and this is just not the case. I can't help but feel like you're after-the-fact rationalizing as if you need this for diagnosing anything "distributed" or "complicated".

Distributed anything is actually easier in most cases because you will always have input and output. Sure, if you're debugging a complicated and coordinated "dance" between two concurrent threads/processes then yeah fully agreed, but then you're deep in uncharted territory and you need all the help you can get.

> maybe I'm fundamentally missing something about how I'm "supposed to work", but honestly all I have ever wanted to do, when looking at logs, is see the log from one process, from beginning to end, as a text file.

This is still a valid use case but pretend for a minute you have thousands or millions of log lines to inspect. Even after filtering for ERROR level only, you still have too many "those are normal" errors, devs swear (but do not fix). And maybe the data you need to diagnose isn't even in ERROR!

The solution? Use log queries to compare a normal and abnormal process or cluster, group them by some kind of fingerprint, then apply some Laplace smoothing or other bayesian techniques to score fingerprints by strength of association with abnormal. This lets me rapidly identify problems at scale that would otherwise take hours of pouring through logs to exclude stuff by hand.

This works any time you can divide logs into "good" and "bad." Example scenarios:

- canary analysis, comparing canary and baseline

- single faulty pod in a deploy, comparing the bad container to the n good ones

- one AZ or region in a multi-region deploy

- now versus yesterday, or versus an hour ago, etc

- Android versus iPhone

  > then apply some Laplace smoothing or other bayesian techniques to score fingerprints by strength of association with abnormal
I would love to hear more about this process.
The simplest technique, and the one I currently use, is just "(n+bad)/(n+good)" where n is basically the strength of a prior belief that bad/good = 1. At some level I think this might replicate TF-IDF[1] but I haven't sat down to prove it or find where they diverge.

[1]: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

But this still requires you to classify each line manually to determine bad or good, no?
Not manually, it just requires you to be able to group them along a dimension of interest. For example, if I get a page from us-east-1a, I can compare all the logs from that against us-east-1b. Or, you can group all the logs from the hour after the incident started to the hour a day ago (or a week ago).

I pulled this technique from canary analysis and applied it to production outage analysis. In canary, you have a guaranteed random stable population that lets you perform accurate comparisons. Elsewhere, we can try to make that assumption but it might break down. For example, regional holidays can radically alter customer behavior over time or between regions. So it's not perfect but it's often good enough to provide me insights while on call.

And, it requires advanced log queries to perform all these filtering, grouping, counting and scoring functions.

OK, I'm starting to see where you're going with this - I also compare incident-affected logs with pre-incident logs, or HTTP requests that I want to debug with similar requests that are known good.

What tools are you using? For me it's often just grep and awk with temp files, maybe a touch of python occasionally.

> I'm not really a cloud expert so maybe I'm fundamentally missing something about how I'm "supposed to work", but honestly all I have ever wanted to do, when looking at logs, is see the log from one process, from beginning to end, as a text file.

That's the rub that I think you are missing. In distributed and/or cloud environments it is quite unusual for there to be a single end-to-end process, and thus we need new ways to trace across a system.

In harmony with tracing, we also need the aggregated view _across_ the estate to understand where system hotspots, levels of throughput, redundant infrastructure, error rates, etc.

Dump the logs into elastic, loki or whatever, along with pod name as a label. Usually I use Kibana, so I don't want to speak for Loki, but seems pretty straight forward.
You missed the key criterion, which is being able to see the logs from that process "as a text file", or the way I'd rephrase it "with the same ease of a text file."

Kibana is ok (definitely beats grep) when you want to look across a fleet and determine if a specific thing is happening. But when you have a specific symptom that happens on a particular instance, what you want to do is see logs in the order they happened, and Kibana isn't close. Querying and viewing logs are just slow and cumbersome relative to less/grep.

Well, honestly I don't understand what's missing - you just pick time window, instance and have logs displayed line-by-line as they happened.

Best to configure view for this to limit columns and maybe pre-configure some filters. Plus annotate your logs with timestamp, so you rely on time of event and not time of ingestion.

But these both are one-time configuration thing and then you can simply scroll.

> Well, honestly I don't understand what's missing - you just pick time window, instance and have logs displayed line-by-line as they happened.

What's missing is that I don't want to learn and use some clunky web UI in order to do this. I want the UI to be "download this text file" and then use the tools I already know and understand (local text processing utilities and text editors)

This seems like a solution for pets. If you have a lot of pets, this sounds totally reasonable, but it isn't some universal truism. People are moving away from pets as they're often harder to work with cattle. That also means you need an observability aggregation which can make sense of what's happening everywhere, not just one instance of on machine.
If GUI is main issue, you can use CLI client to extract data from elastic :-)
If you get a chance, please check out kubetail (https://github.com/kubetail-org/kubetail). It's an open source log viewer for Kubernetes. Currently you can use it to look at pod logs from beginning to end, grouped together by workload (e.g. Deployment, CronJob) with basic filtering available (e.g. node-id, AZ). It doesn't let you look at historical logs yet but that's where we're headed. We just launched so we're eager for feedback and we like to build out new features quickly.
Interesting. Will be following this tool.

There is a CLI tool with the same name that does something similar - https://github.com/johanhaleby/kubetail

Could LogQL do.something like

  select * from stdout, stderr
  where session_id = 123456
? If not, why?
yes it can, if you tag your log stream correctly - either by having the stream externally tagged via attributes, or internally by following certain conventions in the log line.

You can also do something like

select client_ip from requests where elapsed_ms > 10000

which is incredibly powerful

yep, with the caveat that you probably don't want to have the backend of whatever log system you use (not exactly sure how Loki does it) to have an index on something as high-cardinality as session id so that query could get slow.

But these log query systems can also optimize these queries for instance by by sampling, using distributed trace ids to ensure you get shown corresponding, allowing you to get only logs where at least one step in the trace errored, etc.

strace and gdb can trace and close and reopen process file handles 0,1,2.

ldpreloadhook has an example of hooking write() with LD_PRELOAD=, which e.g. golang programs built without libc don't support.

When systemd is /sbin/init, it owns all subprocess' file handles already, so there's no need to close(0), time, open(0) with gdb.

Without having to logship (copy buffers that are flushed and/or have newline characters in the stream) to a network or local Arrow database files and or SQLite vtables,

journalctl (journald) supports pattern matching with: -t syslogidentifier, -u unit; and -g grepexpr of the MESSAGE= field:

  journalctl -u <TAB>
  journalctl -u init.scope --reverse
  journalctl -u unit.scope -g "Reached target"  # and then "/sleep" to search and highlight with less
  
  journalctl -u auditd.service

  # this is slow because it's a full table scan, because
  # journald does not index the logfiles;
  # and -g/--grep is case insensitive if the query is all lowercase:
  journalctl -g avc --reverse
  journalctl -g AVC --reverse

  # this is faster:
  journalctl -t audit -g AVC -r

  # this is still faster,
  # because it only searches the current boot:
  journalctl -b 0 -t audit -g AVC

  # these are equivalent:
  journalctl -b 0 --dmesg -t kernel
  journalctl -k

  # 
  journalctl -b 0 --user | grep -i -C "xyz123"
There is a GNOME Logs viewer that has 'All' and a few mutually exclusive filter/reports in a side pane, and a search expression field to narrow a filter/report like All or Important.

There is a Grafana Loki Docker Driver that logships from all containers visible on that DOCKER_HOST docker socket to Grafana for querying with Loki: https://grafana.com/docs/loki/latest/send-data/docker-driver...

Podman with Systemd doesn't need the Grafana Docker Driver (or other logshippers like logstash, loggly, or fluentd) because systemd spawns containers and optionally pipes their stdout/stderr logs to journald.

Influx has Telegraf, InfluxDB, Chronograf, and Kapacitor. Chronograf is their WebUI which provides a query interface for configurable chart dashboards and InfluxQL.

Grafana supports SQL, PromQL, InfluxQL, and LogQL.

Graylog2 also indexes logfiles.

But you can't query stdout and stderr you or /sbin/init haven't logged to a file.

I use LogQL a fair amount. Often times even just negative filtering is quite useful.

I do a fair amount of tracking down of issues with LogQL. Looking for logs specific to a customer support ticket. Filtering for logs by a traceId for distributed traces.

I have serious doubts this new UI is something I will care about at all.

The explore ui for setting labels is atrocious and painful, and I'd rather just give me the text input for LogQL*

*: Please FFS someone fix the Ctrl+f creating a vscode like find dialog that only finds inside the text input. I never want to do a find specifically isolated to my LogQL

> Please FFS someone fix the Ctrl+f creating a vscode like find dialog that only finds inside the text input. I never want to do a find specifically isolated to my LogQL

Quick update: that has been fixed and will be available in Grafana 11.1.