Hacker News new | ask | show | jobs
by jeffffff 1400 days ago
yeah that isn't free either, it adds significant bloat to your metadata. with most enterprise customers encrypting and/or compressing data before putting it into s3, it doesn't seem like there would be much benefit. s3 really isn't the right layer to implement compression. filesystems aren't either. it's better to leave it up to the application.
1 comments

> yeah that isn't free either, it adds significant bloat to your metadata

yeah, 4 bytes for every megabyte

> s3 really isn't the right layer to implement compression. filesystems aren't either. it's better to leave it up to the application.

yeah, I'm sure you're right and Amazon have absolutely no idea what they're doing and like to spend unnecessary CPU cycles doing pointless work and add "significant bloat" to their metadata

... or, you're wrong (like in every previous comment in this chain)

https://www.reddit.com/r/programming/comments/wtd61q/aws_swi...

this tweet is not talking about compressing customer data in s3, i seriously doubt that aws compresses customer data in s3 for all the reasons i've already listed. i am right and amazon does know what they're doing, which is why they don't compress customer data in s3.

4 bytes per megabyte becomes significant at scale when you have to keep it in ram, which you have to do if you want to avoid the extra IO.

You only need a single part to calculate a specific offset, assuming you have part sizes stored in metadata already (a good idea).

Each part can be max 5GiB as per S3 spec. 5120 * 4 = 20KiB.

Even if you unpack to 8*2 bytes in memory when decoding, you are still not talking a huge amount of memory.

The on-disk space is ~0.0004% as blibble calculated, and should easily be offset by the compression achieved. In MinIO we don't store indexes for files < 8MiB, so for small files there is no overhead.

If the added metadata is a problem for whatever system you are looking at, then that is a characteristic of that system and not a general problem.

ah yes, "authoritative" comments from random reddit accounts

and you don't understand the algorithm if you think you need to keep the index in RAM, because you don't

if it's not in ram, you have to do an extra IO to look it up. i don't think you understand how precious metadata space is in a large scale storage system. if you pollute the metadata cache with useless junk like this, you can't cache as many things, your hit rate goes down, and you have to do more IO operations to service each request on average. name one popular distributed file system or object store that compresses everything by default like you are claiming. you won't be able to, because none of them do it, because it's better to leave it to the application.
> if it's not in ram, you have to do an extra IO to look it up

as has been explained to you several times, you don't

> if you pollute the metadata cache with useless junk like this

the overhead is 0.0004% with 1 index entry per megabyte, and if that's too much that can be reduced by 10/100/1000/10000x that by changing the size

as we're clearly now going around in circles, I won't be responding again.

well i tried. i wish you the best of luck in your continued ignorance.