Hacker News new | ask | show | jobs
by vimda 1721 days ago
My point was that by the time you filter on keyword fields (and other exact matching fields), the number of logs is small enough that an efficient full text search isn't necessary. That doesn't mean that full text search itself isn't useful, just that maintaining an inverted index is overkill in the logging case
2 comments

This has been my experience. Obviously different people use logs for different things, but in my case I'm usually looking for information about something bad that already happened, within a very specific window of time, and within a specific section of the application. 99% of the time, that means I am filtering until there are only a handful of entries that match, at which point I don't need full text search at all.
I am not really sure about this.

A few days ago, a colleague asked me why a certain Google cloud instance does not exist. I did not know either, so I searched for this name in the Google audit log, and found when and by whom it was decommissioned.

But it was a full-text search, given the instance name. I probably could do it (in theory) as a field match, if I knew which field it was, and which format it was in (I am talking about project/abc/location/xyz type of junk that precedes the actual instance name).

And yes it was slow (this instance was deleted months ago, and Google tries to search the most recent logs first).

This sounds like the 1% of my experiences not served by filtering.

Naturally your experiences will be different from mine!

Completely agree.

My gripe with ES is that it won't let you do post-pass filtering at all. If you create an index with a few keyword fields indexed and then some unindexed fields, you can't query the unindexed fields.

Grafana's Loki seems to be exactly what we are looking for, although I haven't played with it.

I guess what they want is to use the elasticsearch query language but let it optionally do “expensive” non indexed filtering like a SQL database would let you do.

Without knowing for sure I imagine they originally expected the application side to handle this but many of the current solutions don’t do that. And they expose and overload the elastic search query language as the primary search interface with no additional app logic. The elastic search query “is” the search application.

Making some assumptions but might reconcile the different viewpoints on why it does or doesn’t make sense.

The problem with this thinking is that in most cases having the server send all of the data back to the client to do their own search is going to be far more expensive than running the search (even of unindexed data) on the server.

And I am only talking about server-side costs here as moving data between server and client has costs both in serialization and transmission. Yes, I can make up regexes that wind up throwing this cost comparison out the window (e.g.: lookbacks), but the fast majority of cases this is true.

I think the main reason that ElasticSeach does not do this is that they would either have to provide grep-like or regex support, and those two would provided different answers than the lexical search system they provide otherwise. That would be a nightmare to try and explain the differences to clients.

Note: in most places I wind up using ElasticSearch I absolutely hate that it is lexical search rather than grep or regex... especially when I am looking for exact text. This is particularly a problem in Jira where I have to be very careful about word boundaries.

You can update the index with the new field specification and reindex your content.

Your complaint really doesn't make sense, how would you query an unindexed field? Elasticsearch is a _search_ engine, which means it needs to index content that is to be discoverable. What you're saying with unindexed fields is you're completely fine with those not being included in any search or filtering.

Your response is, "Why can't you perfectly predict which columns/keywords will be necessary later on, or otherwise re-index the whole system at the drop of a hat for one query? And why would you think a search engine would be able to perform an unindexed, ad hoc search?"

Compared to my experience, you have a foundational difference of understanding with how systems are actually used.

You use the index to identify a subset of records and scan those for unindexed criteria.

It's ok to fail if the indexed criteria are not selective enough. In fact it's usually preferable to a long timeout.

Exactly, it's a search engine. It probably doesn't make sense to use it as a storage engine for logs unless you need to search all of them efficiently.
There's also cLoki. It's a new project that puts a Loki gateway over a ClickHouse backend store. We're looking at it and plan a presentation from the author(s) at the next ClickHouse SF Bay Area Meetup.

https://github.com/lmangani/cLoki

Will runtime fields help you with post-pass filtering? https://www.elastic.co/blog/introducing-elasticsearch-runtim...