Hacker News new | ask | show | jobs
by klauspost 1590 days ago
We considered TAR, but indexing requires reading back and decompressing the entire archive.

This may be feasible on small TAR files, and for single PutObject you could index while uploading. However for multipart objects, parts can arrive in any order so you are forced to read it back. This would lead to unpredictable response times.

Compare that to reading the directory of a zip, which maybe on big files are a couple of megabytes max.

Add to that that tar.gz will require you to decompress from the start to reach any offset. You can recompress while indexing, but an object-store mutating your data is IMO a no-no.

1 comments

S3 is "eventually consistent", so I don't think indexing in the background would be such a big deal. But yeah, like I said this would only work for no-compression or those schemes that are seekable (not gzip).

In any case it is definitely a lot more work than ZIP.

No, S3, as MinIO, has a read-after-write consistency.

So indexing would block on either writes or reads until it is done. We block when doing the zip indexing, but that is much more lightweight - and we limit to 100MB ZIP directory. That way we don't risk long-blocking index operations.

I see. Indeed that is a potentially long time to block.