Hacker News new | ask | show | jobs
by jcgrillo 712 days ago
This position has always confused me. IME logs search tools (ELK and their SaaS ilk) are always far too restrictive and uncomfortable compared to Hadoop/Spark. I'd much rather have unfettered access to the data and have to wait a couple seconds for my query to return than be pigeonholed into some horrible DSL built around an indexing scheme. I couldn't care less about my logs queries returning in sub-second time, it's just not a requirement. The fact that people index logs is baffling.
2 comments

If you can limit your research to GBs of logs, I kind of agree with you. It's ok if a log search request takes 100ms instead of 2s, and the "grep" approach is more flexible.

Usually our users search into > 1TB.

Let's imagine you have to search into 10TB (even after time/tag pruning). Distributing over 10k cores over 2 second is not practical and does not always economically make sense.

The question is why would someone need search through TBs of data.

If you are not google cloud and just have your workers ready to stream all data in parallel on x amount of workers in parallel, i would force usefull limitations and for broad searches, i would add a background system.

Start your query, come back later or get streaming results.

On the other hand, if not toooo many people search in parallel constantly and you go with data pods like backblaze, just add a little bit more cpu and memory and use the cpu of the datapods for parallisation. Should still be much cheaper than putting it on s3 / cloud.

I guess I was a little too prescriptive with "a couple seconds". What I really meant was a timescale of seconds to minutes is fine, probably five minutes is too long.

> Let's imagine you have to search into 10TB (even after time/tag pruning).

I'd love to know more about this. How frequently do users need to scan 10TB of data? Assuming it's all on one machine on a disk that supports a conservative 250MB/s sequential throughout (and your grep can also run at 250MB/s) that's about 11hr, so you could get it down to 4min on a cluster with 150 disks.

But I still have trouble believing they actually need to scan 10TB each time. I guess a real world example would help.

EDIT: To be clear, I really like quickwit, and what they've done here is really technically impressive! I don't mean to disparage this effort on its technical merits, I just have trouble understanding where the impulse to index everything comes from when applied specifically to the problem of logging and logs analysis. It seems like a poor fit.

It sounds like you are doing ETL on your logs. Most people want to search them when something goes wrong, which means indexing.
No, what I'm doing is analysis on logs. That could be as simple as "find me the first N occurrences of this pattern" (which you might call search) but includes things like "compute the distribution of request latencies for requests affected by a certain bug" or "find all the tenants impacted by a certain bug, whose signature may be complex and span multiple services across a long timescale".

Good luck doing that in a timely manner with Kibana. Indexed search is completely useless in this case, and it solves a problem (retrieval latency) I don't (and, I claim, you don't) have.

EDIT: another way to look at this is the companies I've worked at where I've been able to actually do detailed analysis on the logs (they were stored sensibly such that I could run mapreduce jobs over them) I never reached a point where a problem was unsolvable. These days where we're often stuck with a restrictive "logs search solution as a service" I often run into situations where the answer simply isn't obtainable. Which situation is better for customers? I guess cynically you could say being unable to get to the bottom of an issue keeps me timeboxed and focused on feature development instead of fixing bugs.. I don't think anyone but the most craven get-rich-quick money grubber would actually believe that's better though.