Hacker News new | ask | show | jobs
by bcjordan 1360 days ago
Always bugged me that highly repetitive logs take up so much space!

I'm curious, are there any managed services / simple to use setups to take advantage of something like this for massive log storage and search? (Most hosted log aggregators I've looked at charge by the raw text GB processed)

3 comments

Check out https://github.com/parseablehq/parseable ... we are building a log storage and analysis platform in rust. Columnar format helps a lot in reducing overall size but then you have little computational overhead to deal with conversion and compression. This trade off will be there but we are discovering ways to minimise it with rust
It doesn't seem like your solution achieves columnar breakdowns for unstructured parts of the log. Eg they will basically reverse engineer printfs, you don't. Misleading claim of being similar
ZFS as an underlying filesystem offers several compression algos and suits raw logs storage well.
Deduplication can literally save petabytes.
deduplication is probably the biggest "we don't do that here" in the ZFS world lol, at this point I think even the authors of that feature have disowned it.

it does what it says on the tin, but this comes at a much higher price than almost any other ZFS feature: you have to store the dedup tables in memory, permanently, to get any performance out of the system, so the rule of thumb you need at least 20GB of RAM per TB stored. In practice you only want to do it if your data is HIGHLY duplicated, and that's often a smell that building a layered image from a common ancestor using the snapshot functionality is going to be a better option.

and once you've committed to deduplication, you're committed... dedup metadata builds up over time and the only time it gets purged is if you remove ALL references to ANY dedup'd blocks on that pool. So practically speaking this is a commitment to running multiple pools and migrating them at some point. That's not a huge problem for enterprise, but, most people usually want to run "one big pool" for their home stuff. But all in all, even for enterprise, you have to really know that you want it and it's going to produce big gains for your specific use-case.

in contrast LZ4 compression is basically free (actually it's usually faster due to reduced IOPS) and still performs very well on things like column-oriented stores, or even just unstructured json blobs, and imposes no particular limitations on the pool, it's just compressed blocks.

They would still charge you per raw GB processed regardless of compression used.

IIRC Elasticsearch compresses by default with LZ4