Hacker News new | ask | show | jobs
by beardface 1721 days ago
I believe the complaints here are a case of 'not using it correctly'.

The 'Reverse index' (Lucene's inverted index) is a fundamental data structure used to enable very fast search. Other data structures, like KD trees, are used for non-text data types. If you're not doing full text search, don't use `text` fields. If you're not querying the data, why store it in the first place?

Full text search for logs is incredibly useful for log files when combined with alerting. If you get log entries indicating that a disk is full, a service has stopped, or a user account is blocked, Elasticsearch can (with the right license) send emails or post on Slack.

Static mappings can be a pain but if you're constantly increasing the maximum field count for an index, use different indices for different log sources. Come up with an index pattern or alias that allows querying all those indices at the same time.

The main task here is reconciling the different logs so the index mappings are easily searchable, effectively as a union. Elastic Common Schema helps a lot with this. Elasticsearch mappings are easier to build when you first consider the queries you're going to be running on the data. You can then design the mapping with the right structure, field types, and settings.

3 comments

My point was that by the time you filter on keyword fields (and other exact matching fields), the number of logs is small enough that an efficient full text search isn't necessary. That doesn't mean that full text search itself isn't useful, just that maintaining an inverted index is overkill in the logging case
This has been my experience. Obviously different people use logs for different things, but in my case I'm usually looking for information about something bad that already happened, within a very specific window of time, and within a specific section of the application. 99% of the time, that means I am filtering until there are only a handful of entries that match, at which point I don't need full text search at all.
I am not really sure about this.

A few days ago, a colleague asked me why a certain Google cloud instance does not exist. I did not know either, so I searched for this name in the Google audit log, and found when and by whom it was decommissioned.

But it was a full-text search, given the instance name. I probably could do it (in theory) as a field match, if I knew which field it was, and which format it was in (I am talking about project/abc/location/xyz type of junk that precedes the actual instance name).

And yes it was slow (this instance was deleted months ago, and Google tries to search the most recent logs first).

This sounds like the 1% of my experiences not served by filtering.

Naturally your experiences will be different from mine!

Completely agree.

My gripe with ES is that it won't let you do post-pass filtering at all. If you create an index with a few keyword fields indexed and then some unindexed fields, you can't query the unindexed fields.

Grafana's Loki seems to be exactly what we are looking for, although I haven't played with it.

I guess what they want is to use the elasticsearch query language but let it optionally do “expensive” non indexed filtering like a SQL database would let you do.

Without knowing for sure I imagine they originally expected the application side to handle this but many of the current solutions don’t do that. And they expose and overload the elastic search query language as the primary search interface with no additional app logic. The elastic search query “is” the search application.

Making some assumptions but might reconcile the different viewpoints on why it does or doesn’t make sense.

The problem with this thinking is that in most cases having the server send all of the data back to the client to do their own search is going to be far more expensive than running the search (even of unindexed data) on the server.

And I am only talking about server-side costs here as moving data between server and client has costs both in serialization and transmission. Yes, I can make up regexes that wind up throwing this cost comparison out the window (e.g.: lookbacks), but the fast majority of cases this is true.

I think the main reason that ElasticSeach does not do this is that they would either have to provide grep-like or regex support, and those two would provided different answers than the lexical search system they provide otherwise. That would be a nightmare to try and explain the differences to clients.

Note: in most places I wind up using ElasticSearch I absolutely hate that it is lexical search rather than grep or regex... especially when I am looking for exact text. This is particularly a problem in Jira where I have to be very careful about word boundaries.

You can update the index with the new field specification and reindex your content.

Your complaint really doesn't make sense, how would you query an unindexed field? Elasticsearch is a _search_ engine, which means it needs to index content that is to be discoverable. What you're saying with unindexed fields is you're completely fine with those not being included in any search or filtering.

Your response is, "Why can't you perfectly predict which columns/keywords will be necessary later on, or otherwise re-index the whole system at the drop of a hat for one query? And why would you think a search engine would be able to perform an unindexed, ad hoc search?"

Compared to my experience, you have a foundational difference of understanding with how systems are actually used.

You use the index to identify a subset of records and scan those for unindexed criteria.

It's ok to fail if the indexed criteria are not selective enough. In fact it's usually preferable to a long timeout.

Exactly, it's a search engine. It probably doesn't make sense to use it as a storage engine for logs unless you need to search all of them efficiently.
There's also cLoki. It's a new project that puts a Loki gateway over a ClickHouse backend store. We're looking at it and plan a presentation from the author(s) at the next ClickHouse SF Bay Area Meetup.

https://github.com/lmangani/cLoki

Will runtime fields help you with post-pass filtering? https://www.elastic.co/blog/introducing-elasticsearch-runtim...
I have little knowledge of the log aggregation domain, but generally indices are great for read mostly loads. It seems to me that for log aggregation writes are more frequent than searches; cheap writes and the occasional brute force search.

For alerting you might better off running each new line against a set of filters/watchers. It seems wasteful to run it after indexing.

Again, no experience or knowledge on the domain, so I might be completely off.

> I have little knowledge of the log aggregation domain, but generally indices are great for read mostly loads.

Generally, you write and read to/from the same index in Elasticsearch. Where this falls apart is that you'll often want to change the configuration for an index based on whether it's write or read heavy. The main thing that changes in this scenario is the number of primary and replica shards (Lucene indices) for the Elasticsearch index.

Indices with a high write, low search workload will generally require more primary shards and less replicas. Low write, high search workloads require the opposite; lower primaries and more replicas.

The problem comes when you need high write and high search rates. Using a single cluster with lots of primaries and lots of replicas will overwhelm the hosts and you end up with terrible performance. The general pattern with Elasticsearch is to run two clusters. Index into one cluster, then use cross-cluster-replication (CCR) into a different cluster you run queries against.

There's an incredible amount of nuance to all of this. I've worked with many clusters and they all have different usage and configuration requirements. There's no magic formula for calculating configuration values; it all comes down to experience, monitoring, and experimentation.

At the core of Lucene, as you index a document, it creates first an index containing a single document, and everything else is merge operations (operating in log N - merging larger and larger chunks). So the nice thing is that you can use the same query language, in fact the exact same implementation, to run a search query in alerting mode. You would create this single document index (which you'd do anyway to make it searchable) and run the query against it before adding it to the other documents.
The not-indexing the log lines in Loki doesn't mean you can run complex queries on Loki. I've made a video to explain this concept: https://youtu.be/UiiZ463lcVA