Hacker News new | ask | show | jobs
Preview of Explore Logs, a new way to browse your logs without writing LogQL (grafana.com)
193 points by matryer 808 days ago
13 comments

I'm not really a cloud expert so maybe I'm fundamentally missing something about how I'm "supposed to work", but honestly all I have ever wanted to do, when looking at logs, is see the log from one process, from beginning to end, as a text file. You can of course do this using kubectl but only for the most recent two instances of a given pod which isn't helpful when investigating an incident that happened a while ago.

It seems nobody else cares about this use case and wants you to use LogQL and the incredibly clunky Grafana web UI instead, because it makes it possible to aggregate across many different processes, slice and dice by various labels, etc., which as I said, I have never (or almost never) actually wanted to do.

Hopefully this new UI is a step in the right direction as people won't need to futz around with LogQL anymore, but it seems like it still doesn't quite do what I want.

Just want to chip in and say that I wholeheartedly agree with you. I'm not a cloud developer either, but I'm regularly forced into what's apparently called "Google Cloud's operations suite" to grovel through logs. Compared to working with Linux journals using the tried and true text manipulation tools, it feels like looking through a straw with oven mitts on. I'd happily download a 500 MB text file instead, but there is an arbitrary limit to how much I can grab (10k lines IIRC). Maybe we're just out of touch.
> but I'm regularly forced into what's apparently called "Google Cloud's operations suite" to grovel through logs

Is this google cloud logging? If so, personally I quite like it, especially for looking through logs from multiple sources at the same time. Being able to put all your logs through there, and then search them with a simple query language, feels very convenient.

it's ridiculously slow. and compared to how expensive it is ... it's robbery in daylight.
It fairly risky to download 500MB of log and analyse it locally in the machine. I know People do it anyways. Just saying.
Risky how exactly? If it has data in it that it shouldn't it's a problem no matter where it resides.
In theory a logfile could contain privileged information (indeed it almost certainly will - IP addresses etc), putting that on a laptop increases risk of losing it.
It is not about what data is in the log. It is about the fact that once the data is download, it most likely going to stay in the machine.

Logs often contain privileged info, if not reveal a bit about how the application behaves. It is risky to do that.

If logs contain privileged info, then the damage is done the moment they're published. Whether or not they're downloaded to a laptop is irrelevant, the risk impact is the same.
Fwiw this is how I use Loki most of the time. Pick an app label, pick a time period, look at raw logs. The LogQL for this ends up something like `{app="workload-foo"}`. Loki is excellent at that.

Then if I know which pod I'll filter down to a specific pod with `{pod="workload-foo-1234"}`, sometimes I'll search for a specific term (error message etc) with `{pod="workload-foo-1234"} |= "error message"` then look at the logs around that. There's really no point writing complicated queries unless you need to.

That will, if I understand correctly, get the logs for one pod, not for one process. For example if the pod restarted 10 times you will not get 10 separate files from that query.
You'd have the label shown in the output that indicates the log line in question is from a different process/pod/container/host/whatever.
How so? The pod, container, and host labels should be the same for a process that crashes and is automatically restarted, no?
Even more than that, if you are running multiple instances of the app in multiple pods concurrently, then all of those logs will be joined together.
I'm not sure I really understand this.

If you mean one instance in each pod, then each should be labelled differently and you can filter down to one instance.

If you mean running multiple instances in each pod (and container?), then the standard kubectl log output will also have them all joined together. For both of those, you would need to add another unique identifier to each line, or run each instance in a separate container so you can submit the logs with the pod name and container name combined being the unique identifier.

That's definitely false
Why? If the pod is defined to spawn multiple containers, and each container runs the same application, then this seems true to me? Unless you would add an additional filter on the container name.
Its certainly true in my environment, maybe not others though? Apologies!
I'm an old fart so I use things like "cat" and "grep", and maybe "sed" and "cut" if the lines are particularly long.

I have one log file per day per host on my syslog server and can use "sort" to order across multiple files.

Loki was sold to me at fosdem a couple of years ago as this, but I still haven't got round to working it out, seems a very high barrier to entry compared with running cat.

> seems a very high barrier to entry compared with running cat

It really isn't. It's a single binary with a relatively simple configuration file, you throw logs at it via an API (which a bunch of logging agents support, and syslogs can be sent to it).

Then the actual queries aren't all that complex, it's just a difference of cd-ing to the correct folder for the date/server to be able to cat and grep vs writing a query that selects by server name and filters by date.

The learning curve and maintenance of Loki are quite minimal, but the value add is quite significant in most cases. Being able to do cross-host queries, metrics from logs (how many times did error X occur in the logs), as well as easy visualisations is pretty useful.

"to be able to cat and grep"

Admitably I learned how to use basic tools 25 years ago, but that's an investment that can be used for decades.

  cat *web*log | "grep 34.5.22.4" | sort -n | less
is hardly a complex thing to learn. Sure you can then build on that pipeline -- "cut -b -10|uniq -c" and if you want something really complex then you can use awk, or perl, or python, and do all sorts of things with the data.

Will whatever today's favoured log query/filter/etc be around in 25 years? Last time I looked at this people were going on about logstash and elasticsearch. Nobody could show me how to to the above command without touching the mouse.

Now sure, cat and grep can be sluggish on millions of lines (which is the main reason I'm tempted by loki or similar), and there's always some twat that comes along with "useless use of cat" [0], but the kind of pipeline processing serves me well and it seems a very different way to think about things when you need to access things from a database. Maybe I'm in a local maximum, but it's good-enough for me to find out what's going on.

[0] https://stackoverflow.com/questions/11710552/useless-use-of-...

"it's just a difference of cd-ing to the correct folder for the date/server" to be able to cat/grep.

You have to connect to your server, get to the correct folder, and then run the cats and greps which are easy (if you have to do some more advanced filtering with awk it gets more complicated.)

Connecting to Grafana and running a simple label query is practically the same in terms of complexity and time, but with vastly more features available.

> Will whatever today's favoured log query/filter/etc be around in 25 years? Last time I looked at this people were going on about logstash and elasticsearch. Nobody could show me how to to the above command without touching the mouse.

You can run ElasticSearch queries via the API, and can still do it today. I don't know about, but Loki is a statically compiled binary with only optional external dependencies. You'd still be able to run it in 25 year just fine.

Loki has a cli tool, called LogCLI. It's passable for needle-in-haystack searches, and the label browser is handy. But Loki doesn't handle multiline searches well. I'm with you on the ease of grep sort uniq, pretty easy to fashion up a quick report, sorted numerically - No enterprise data analysis suute needed.
For analysing text logs lnav is pretty good, if you need to work with a live updated view of the log in response to commands.
You may be amazed at how hard these tools are to get started with relative to that. I have been thoroughly unimpressed with and unable to really get started with any of these tools because of the overemphasis on cloud. Not sure what people were doing before, but sshing to the prod box kinda sucks.
If you’re debugging something simple or non-distributed, this product isn’t for you.

If you’re working on anything distributed, log aggregation becomes a must. But, also, if you’re working on anything distributed and you’re looking at logs, you’re desperate. Distributed traces are so much higher quality.

When I formed these opinions I was working on Materialize, which is basically the polar opposite of "simple and non-distributed". However it was still quite common that I knew exactly which process was doing something weird and unexpected.
Maybe it’s the difference between tracking a bug (abnormal operation) vs understanding behavior of a complex system (normal operation)?
Yup and the reason no one markets something like "tail the logs for server X" is because, if you're talking in the context of an individual server, you're too small for anyone to care about.
I've got logs from hundreds of servers that I use standard tools to look at, and that's a small system. Centralising logs has been a thing for decades.
Which is fine, I'm just saying you're not the target market for the big observability vendors.

The current generation of observability tools is built for distributed systems that are basically too complex to reason about, and so you have other ways of monitoring and debugging them. When you have 10's of k's of ephemeral containers running hundreds of services, you can't just look at some logs for a server to understand what's going on (ignoring the fact that servers aren't even a primitive in this system).

10's of GBs of logs a day just doesn't move the needle on pricing. They want the customers that are going to generate 7 figures in revenue and those customers aren't talking about aggregating logs from a few hundred servers.

Sorry, did plenty of "distributed" tracing back in the day and this is just not the case. I can't help but feel like you're after-the-fact rationalizing as if you need this for diagnosing anything "distributed" or "complicated".

Distributed anything is actually easier in most cases because you will always have input and output. Sure, if you're debugging a complicated and coordinated "dance" between two concurrent threads/processes then yeah fully agreed, but then you're deep in uncharted territory and you need all the help you can get.

> maybe I'm fundamentally missing something about how I'm "supposed to work", but honestly all I have ever wanted to do, when looking at logs, is see the log from one process, from beginning to end, as a text file.

This is still a valid use case but pretend for a minute you have thousands or millions of log lines to inspect. Even after filtering for ERROR level only, you still have too many "those are normal" errors, devs swear (but do not fix). And maybe the data you need to diagnose isn't even in ERROR!

The solution? Use log queries to compare a normal and abnormal process or cluster, group them by some kind of fingerprint, then apply some Laplace smoothing or other bayesian techniques to score fingerprints by strength of association with abnormal. This lets me rapidly identify problems at scale that would otherwise take hours of pouring through logs to exclude stuff by hand.

This works any time you can divide logs into "good" and "bad." Example scenarios:

- canary analysis, comparing canary and baseline

- single faulty pod in a deploy, comparing the bad container to the n good ones

- one AZ or region in a multi-region deploy

- now versus yesterday, or versus an hour ago, etc

- Android versus iPhone

  > then apply some Laplace smoothing or other bayesian techniques to score fingerprints by strength of association with abnormal
I would love to hear more about this process.
The simplest technique, and the one I currently use, is just "(n+bad)/(n+good)" where n is basically the strength of a prior belief that bad/good = 1. At some level I think this might replicate TF-IDF[1] but I haven't sat down to prove it or find where they diverge.

[1]: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

But this still requires you to classify each line manually to determine bad or good, no?
Not manually, it just requires you to be able to group them along a dimension of interest. For example, if I get a page from us-east-1a, I can compare all the logs from that against us-east-1b. Or, you can group all the logs from the hour after the incident started to the hour a day ago (or a week ago).

I pulled this technique from canary analysis and applied it to production outage analysis. In canary, you have a guaranteed random stable population that lets you perform accurate comparisons. Elsewhere, we can try to make that assumption but it might break down. For example, regional holidays can radically alter customer behavior over time or between regions. So it's not perfect but it's often good enough to provide me insights while on call.

And, it requires advanced log queries to perform all these filtering, grouping, counting and scoring functions.

> I'm not really a cloud expert so maybe I'm fundamentally missing something about how I'm "supposed to work", but honestly all I have ever wanted to do, when looking at logs, is see the log from one process, from beginning to end, as a text file.

That's the rub that I think you are missing. In distributed and/or cloud environments it is quite unusual for there to be a single end-to-end process, and thus we need new ways to trace across a system.

In harmony with tracing, we also need the aggregated view _across_ the estate to understand where system hotspots, levels of throughput, redundant infrastructure, error rates, etc.

Dump the logs into elastic, loki or whatever, along with pod name as a label. Usually I use Kibana, so I don't want to speak for Loki, but seems pretty straight forward.
You missed the key criterion, which is being able to see the logs from that process "as a text file", or the way I'd rephrase it "with the same ease of a text file."

Kibana is ok (definitely beats grep) when you want to look across a fleet and determine if a specific thing is happening. But when you have a specific symptom that happens on a particular instance, what you want to do is see logs in the order they happened, and Kibana isn't close. Querying and viewing logs are just slow and cumbersome relative to less/grep.

Well, honestly I don't understand what's missing - you just pick time window, instance and have logs displayed line-by-line as they happened.

Best to configure view for this to limit columns and maybe pre-configure some filters. Plus annotate your logs with timestamp, so you rely on time of event and not time of ingestion.

But these both are one-time configuration thing and then you can simply scroll.

> Well, honestly I don't understand what's missing - you just pick time window, instance and have logs displayed line-by-line as they happened.

What's missing is that I don't want to learn and use some clunky web UI in order to do this. I want the UI to be "download this text file" and then use the tools I already know and understand (local text processing utilities and text editors)

This seems like a solution for pets. If you have a lot of pets, this sounds totally reasonable, but it isn't some universal truism. People are moving away from pets as they're often harder to work with cattle. That also means you need an observability aggregation which can make sense of what's happening everywhere, not just one instance of on machine.
If GUI is main issue, you can use CLI client to extract data from elastic :-)
If you get a chance, please check out kubetail (https://github.com/kubetail-org/kubetail). It's an open source log viewer for Kubernetes. Currently you can use it to look at pod logs from beginning to end, grouped together by workload (e.g. Deployment, CronJob) with basic filtering available (e.g. node-id, AZ). It doesn't let you look at historical logs yet but that's where we're headed. We just launched so we're eager for feedback and we like to build out new features quickly.
Interesting. Will be following this tool.

There is a CLI tool with the same name that does something similar - https://github.com/johanhaleby/kubetail

Could LogQL do.something like

  select * from stdout, stderr
  where session_id = 123456
? If not, why?
yes it can, if you tag your log stream correctly - either by having the stream externally tagged via attributes, or internally by following certain conventions in the log line.

You can also do something like

select client_ip from requests where elapsed_ms > 10000

which is incredibly powerful

yep, with the caveat that you probably don't want to have the backend of whatever log system you use (not exactly sure how Loki does it) to have an index on something as high-cardinality as session id so that query could get slow.

But these log query systems can also optimize these queries for instance by by sampling, using distributed trace ids to ensure you get shown corresponding, allowing you to get only logs where at least one step in the trace errored, etc.

strace and gdb can trace and close and reopen process file handles 0,1,2.

ldpreloadhook has an example of hooking write() with LD_PRELOAD=, which e.g. golang programs built without libc don't support.

When systemd is /sbin/init, it owns all subprocess' file handles already, so there's no need to close(0), time, open(0) with gdb.

Without having to logship (copy buffers that are flushed and/or have newline characters in the stream) to a network or local Arrow database files and or SQLite vtables,

journalctl (journald) supports pattern matching with: -t syslogidentifier, -u unit; and -g grepexpr of the MESSAGE= field:

  journalctl -u <TAB>
  journalctl -u init.scope --reverse
  journalctl -u unit.scope -g "Reached target"  # and then "/sleep" to search and highlight with less
  
  journalctl -u auditd.service

  # this is slow because it's a full table scan, because
  # journald does not index the logfiles;
  # and -g/--grep is case insensitive if the query is all lowercase:
  journalctl -g avc --reverse
  journalctl -g AVC --reverse

  # this is faster:
  journalctl -t audit -g AVC -r

  # this is still faster,
  # because it only searches the current boot:
  journalctl -b 0 -t audit -g AVC

  # these are equivalent:
  journalctl -b 0 --dmesg -t kernel
  journalctl -k

  # 
  journalctl -b 0 --user | grep -i -C "xyz123"
There is a GNOME Logs viewer that has 'All' and a few mutually exclusive filter/reports in a side pane, and a search expression field to narrow a filter/report like All or Important.

There is a Grafana Loki Docker Driver that logships from all containers visible on that DOCKER_HOST docker socket to Grafana for querying with Loki: https://grafana.com/docs/loki/latest/send-data/docker-driver...

Podman with Systemd doesn't need the Grafana Docker Driver (or other logshippers like logstash, loggly, or fluentd) because systemd spawns containers and optionally pipes their stdout/stderr logs to journald.

Influx has Telegraf, InfluxDB, Chronograf, and Kapacitor. Chronograf is their WebUI which provides a query interface for configurable chart dashboards and InfluxQL.

Grafana supports SQL, PromQL, InfluxQL, and LogQL.

Graylog2 also indexes logfiles.

But you can't query stdout and stderr you or /sbin/init haven't logged to a file.

I use LogQL a fair amount. Often times even just negative filtering is quite useful.

I do a fair amount of tracking down of issues with LogQL. Looking for logs specific to a customer support ticket. Filtering for logs by a traceId for distributed traces.

I have serious doubts this new UI is something I will care about at all.

The explore ui for setting labels is atrocious and painful, and I'd rather just give me the text input for LogQL*

*: Please FFS someone fix the Ctrl+f creating a vscode like find dialog that only finds inside the text input. I never want to do a find specifically isolated to my LogQL

> Please FFS someone fix the Ctrl+f creating a vscode like find dialog that only finds inside the text input. I never want to do a find specifically isolated to my LogQL

Quick update: that has been fixed and will be available in Grafana 11.1.

I recently setup Victoria Metrics + https://github.com/prometheus/snmp_exporter + Grafana to get start tracking bandwidth on my top of rack switches in my Datacenter rack which has been a pretty awesome setup. The way you can auto generate a config for your SNMP MIBs with SNMP Exporter was unexpectedly not a terrible experience.

My next task is to get centralized logging going with Victoria Logs + Vector, I'll have to check this out once I get everything setup. I believe I can use LogQL with Victoria Logs but I haven't tried it out yet. https://docs.victoriametrics.com/victorialogs/logsql/

This is what I've been doing on my cluster:

https://github.com/nklmilojevic/home/blob/main/kubernetes/ap...

https://github.com/nklmilojevic/home/tree/main/kubernetes/ap...

Here you have Vector in aggregator + agent mode and several sources. VictoriaLogs also recently added Grafana datasource so it is fairly easy to set it up:

https://github.com/nklmilojevic/home/blob/main/kubernetes/ap...

I'm a big fan of VictoriaMetrics as well and we use it extensively in my company at high scale.

I've been eyeing a VictoriaLogs setup for my docker container fleet, but I haven't quite spotted where docker's remote logging export options overlap with VictoriaLogs ingestion options.

Wrinkle: two docker remote logging plugins I tried (e.g. loki, elastic) didn't seem to work on ARM processors out of the box.

Check out Vector for shipping logs from Docker. It might work out for you https://vector.dev/docs/reference/configuration/sources/dock...

I use Podman for all of my container stuff and there are issues with how Podman produces JSON logs

https://github.com/vectordotdev/vector/issues/6807 https://github.com/containers/podman/issues/16317

which needs to get fixed before I can use it for my workloads.

I'll preface this with the fact I haven't look at Loki in a bit, so maybe this has changed. But I found the documentation needing a lot of work and the configuration for promtail to be obtuse and not very user friendly. I haven't used it for those reasons, not because of the query language.
Our team uses loki and I have to say I think their collected helm charts are pretty easy to use - my problem is more that it seems to be quite slow to run on-prem. Very often my loki query times out and I have to do more work filtering down the log lines or selecting a narrower time range.

I'm kind of amazed the UI doesn't select small time ranges iteratively to build up the response, especially since I believe this is what the CLI does. Perhaps this is also part of their cloud offering provides and it is part of their marketing strategy. Not a good one because if we came down to the decision I would start by looking for something else from being p'ed off by Loki.

But I guess it still works pretty well considering it is free.

Loki UI in Grafana Explore seems to only select 1000 lines by default for me?

Also Loki on the backend splits/parallelizes requests if it can.

The Grafana backend Mimir / Loki / Tempo products all appear to be architected pretty similar, and I'm more experienced operating Mimir, but the answer to read load often has to just do with right-sizing the deployment scale, and using caches aggressively.

Having used Loki/promagent etc, it was sort of a pain/nonintuitive to set up.
It's also incredibly easy to shoot yourself in the foot and rack up huge cloud bills - something we recently hit: https://github.com/grafana/loki/issues/8756
They gotta sell their managed cloud service somehow, I have always assumed that this is part of the sales strategy
Loki OSS is just a sales pitch for their managed service. It doesn't work well without dedicating significant time tweaking and configuring it. Documentation is confusing at best if you want to do anything serious. You have to also be ready to handle support calls if you open it up for others to use, because it WILL have issues fairly regularly if you have a good volume of logs and query range is more than a day or two.

Unless you have the bank to go with their managed service, don't bother.

Why have explore logs as a separate app instead of bundled with Loki? It would be nice if Loki had the same kind of barebones querying/debugging functionality as Prometheus...
Loki is just the backend just like Prometheus is just the backend
Yeah but Prometheus has a web ui where you can run PromQL queries and it'll give you basic graphs back, which is handy for throwing a quick query at it before putting it into something more long-term like a Grafana dashboard or an alerting rule.
wow I always thought that Prom UI was just a tacked-on part of alertmanager or something because it’s so rudimentary. In my experience, everyone just uses Grafana Explore since that’s what Grafana was originally purpose-built for and it’s crazy easy to set up. Just pull down a container or helm chart or whatever.

Since Grafana built Loki, it doesn't make any sense why Grafana would create a separate querying UI app for Loki when they already have Grafana Explore. Prometheus (and I assume its UI) was created by Google [edit: sorry, created by SoundCloud, inspired by a Google Borg tool] before Grafana became the de facto Prometheus query UI, so it’s not really analogous.

Prometheus was created by SoundCloud, not Google, but was inspired by the Google Borgmon tool.
LogQL so far just does not click for me. I get that it's trying to be like Prometheus, but logs are not the same as time series - we have each and every log! So why am I forced to query it like a time series data source?

I want to query my logs like a SQL table, not a time series database.

I thought it was a standalone web app, but it's integrated into Grafana. I'm confused a bit. There's already Explore functionality in Grafana for Loki. Seems like spreading the efforts for no reason.
After having used Datadog for several years, going back to Grafana / Loki / Prometheus felt like regressing by two decades. As much as I appreciate free solutions, I feel like Grafana has really fallen behind when it comes to developer experience
Grafana cloud is better for querying logs. Grafana cloud is probably a bit better for querying metrics. Grafana cloud is terrible at finding traces or even loading them. Datadog is lightyears ahead. For alerting I feel datadog has better features but is overwhelming with all the different options.

grafana is very quirky for searching for traces. And has a huge learning curve.

Could you provide more details? Although I've never had the opportunity to use Datadog at any of my previous positions, I am quite familiar with Grafana and I'm generally pretty happy with it.

What's the TL;DR for why Datadog is better?

This is hard coded to searching for service_name in the query which doesn't return any data for me.

I will stop wasting my time here and try the Metric Explore panel

After going all in on tracing and ignoring logs entirely... I gotta say I'm really glad I never have to deal with logs.
LogQL is honestly fine, it's not SQL, but it's fine
I am a happy customer of Grafana Cloud, yet, I can't use their backends for logs and metrics as they are terribly expensive and slow.

Somehow, VictoriaMetrics manages to provide much better results.

While this is a step in the right direction, just let me write something closer to SQL. Influx did this correctly.
Influx used SQL, then deprecated it and made everyone use Flux, then deprecated that and moved back to SQL. They are definitely not "doing it correctly" when it comes to query language.
That’s what LogQL is, which is already in Loki. This is a new feature.
LogQL is nothing like SQL. Try aggregating and quickly you'll be asking yourself wtf an instant query is and how is that different.
I mean, time series data is different from generic tabular data, so obviously there are impedance differences that are reflected in the query languages. I can see how some people might feel more at home using something even more like SQL, but there are a lot of common use cases where SQL is awkward and/or more verbose.
Are logs time series data? That seems to be the thesis behind LogQL. But way more often than not I'm searching for a needle in a haystack, not charting trends over time.