|
|
|
|
|
by binarylogic
150 days ago
|
|
Yeah, it's funny, I never went down the regex rabbit hole until this, but I was blown away by Hyperscan/Vectorscan. It truly changes the game. Traditional wisdom tells you regex is slow. > I'm surprised it's only 40%. Oh, it's worse. I'm being conservative in the post. That number represents "pure" waste without sampling. You can see how we classify it: https://docs.usetero.com/data-quality/logs/malformed-data. If you get comfortable with sampling the right way (entire transactions, not individual logs), that number gets a lot bigger. The beauty of categories is you can incrementally root out waste in a way you're comfortable with. > compare logs from known good to known bad I think you're describing anomaly detection. Diffing normal vs abnormal states to surface what's different. That's useful for incident investigation, but it's a different problem than waste identification. Waste isn't about good vs bad, it's about value: does this data help anyone debug anything, ever? A health check log isn't anomalous, it's just not worth keeping. You're right that the dimensional analysis and pre-processing is where the real work is. That's exactly what Tero does. It compresses logs into semantic events, understands patterns, and maps meaning before any evaluation happens. |
|
Because it's uncomfortably easy to create catastrophic backtracking.
But just logical-ORing many patterns together isn't one of the ways to do that, at least as far as I'm aware.