Hacker News new | ask | show | jobs
by 392 793 days ago
This is all quite true, but a possibly faster way to prototype would be to use DRAIN algorithm (there are rust and python impls that are easy to use) to determine the "log template". Then push the log template when its first seen and nothing but values after that, into a programmatically generated table in a common columnar format like Parquet or Iceberg. Then you can point the myriad of data analysis tools like DuckDb, DataFusion, or the latest InfluxDb at it, you've got your SQL on logs implemented. It can feel a bit Rube-Goldbergish, and it's a bit tricky to navigate the space uninformed because it's early, but it can also handle all other data your company uses in one platform, no need to special case "applications" from the "data analysis"/historical/log side. One place to handle permissions. Then there are tools like Dagster for managing this humongous single database in a straightforward way rather than writing a web of applications that push and pull to it but without a complete picture being possible, or needing devs to remember their place in the system. Search up Uber CLP for prior art, or more generally the "modern data stack" (PRQL will be perfect for querying logs). But by piggybacking on big systems like this, you can take advantage of future advancements in state of the art, like BtrBlocks https://www.google.com/url?sa=t&source=web&rct=j&opi=8997844.... Of course, if the savings/earnings are high enough, I guess you start on implementing this now.
1 comments

Thanks I wasn't aware of either DRAIN or BtrBlocks. CLP is very cool. Honestly I'm not sure what a good query experience looks like. I really enjoy the flexibility of mapreduce because there are no "unsolvable" problem--if a high level DSL like Hive or Pig gets in the way you just drop down a level to Spark or streaming Python mapreduce or whatever. So ultimately rather than a "DSL for logs" I'd rather have more like a "programming model for logs". I don't know what this looks like in 2024, hopefully not still actually hadoop/EMR.