| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jeffffff 1403 days ago
	Sure, but now you've added an extra layer of indirection which can have a significant impact on performance

1 comments

klauspost 1403 days ago

It doesn't really have to impact performance. The index is generated easily as a side-effect of compression. And the index is only needed if you need to seek.

I implemented this as part of the MinIO server. See "Seeking Compressed Files" here: https://blog.min.io/transparent-data-compression/

We choose a compressor without literal compression for a faster baseline, but the concept remains the same.

link

jeffffff 1403 days ago

But if you do need to seek, which is really common in data warehouse workloads for example, unless you keep the index in ram you have to do an extra IO on every seek to read the index

link

blibble 1403 days ago

there's always going to be some metadata for the file that needs to be looked up before you can start seeking (ACLs, sector/extent/cluster location, etc)

the index goes in there, no extra seek needed

link

jeffffff 1402 days ago

yeah that isn't free either, it adds significant bloat to your metadata. with most enterprise customers encrypting and/or compressing data before putting it into s3, it doesn't seem like there would be much benefit. s3 really isn't the right layer to implement compression. filesystems aren't either. it's better to leave it up to the application.

link

blibble 1402 days ago

> yeah that isn't free either, it adds significant bloat to your metadata

yeah, 4 bytes for every megabyte

> s3 really isn't the right layer to implement compression. filesystems aren't either. it's better to leave it up to the application.

yeah, I'm sure you're right and Amazon have absolutely no idea what they're doing and like to spend unnecessary CPU cycles doing pointless work and add "significant bloat" to their metadata

... or, you're wrong (like in every previous comment in this chain)

link

jeffffff 1402 days ago

https://www.reddit.com/r/programming/comments/wtd61q/aws_swi...

this tweet is not talking about compressing customer data in s3, i seriously doubt that aws compresses customer data in s3 for all the reasons i've already listed. i am right and amazon does know what they're doing, which is why they don't compress customer data in s3.

4 bytes per megabyte becomes significant at scale when you have to keep it in ram, which you have to do if you want to avoid the extra IO.

link