| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by klauspost 1590 days ago

We considered TAR, but indexing requires reading back and decompressing the entire archive.

This may be feasible on small TAR files, and for single PutObject you could index while uploading. However for multipart objects, parts can arrive in any order so you are forced to read it back. This would lead to unpredictable response times.

Compare that to reading the directory of a zip, which maybe on big files are a couple of megabytes max.

Add to that that tar.gz will require you to decompress from the start to reach any offset. You can recompress while indexing, but an object-store mutating your data is IMO a no-no.

1 comments

remram 1590 days ago

S3 is "eventually consistent", so I don't think indexing in the background would be such a big deal. But yeah, like I said this would only work for no-compression or those schemes that are seekable (not gzip).

In any case it is definitely a lot more work than ZIP.

link

klauspost 1590 days ago

No, S3, as MinIO, has a read-after-write consistency.

So indexing would block on either writes or reads until it is done. We block when doing the zip indexing, but that is much more lightweight - and we limit to 100MB ZIP directory. That way we don't risk long-blocking index operations.

link

remram 1589 days ago

I see. Indeed that is a potentially long time to block.

link