Hacker News new | ask | show | jobs
by blacha 1578 days ago
This is basically exactly what we do we have created a cloud optimised tar (cotar)[1] by creating a hash index of the files inside the tar.

I work with serving tiled geospatial data [2] (Mapbox vector tiles) to our users as slippy maps where we serve millions of small (mostly <100KB) files to our users, our data only changes weekly so we precompute all the tiles and store them in a tar file in s3.

We compute a index for the tar file then use s3 range requests to serve the tiles to our users, this means we can generally fetch a tile from s3 with 2 (or 1 if the index is cached) requests to s3 (generally ~20-50ms).

To get full coverage of the world with map box vector tiles it is around 270M tiles and a ~90GB tar file which can be computed from open street map data [3]

> Though even that would only work with a subset of compression methods or no compression.

We compress the individual files as a work around, there are options for indexing a compressed (gzip) tar file but the benefits of a compressed tar vs compressed files are small for our use case

[1] https://github.com/linz/cotar (or wip rust version https://github.com/blacha/cotar-rs) [2] https://github.com/linz/basemaps or https://basemaps.linz.govt.nz [3] https://github.com/onthegomap/planetiler

1 comments

Why not upload those files separately, or in ZIP format?
> Why not upload those files separately,

Doing S3 put requests for 260M files every week would cost around $1300 USD/week which was too much for our budget

> or in ZIP format?

We looked at zip's but due to the way the header (well central file directory) was laid out it mean that finding a specific file inside the zip would require the system to download most of the CFD.

The zip CFD is basically a list of header entries where they vary in size of 30 bytes + file_name length, to find a specific file you have to iterate the CFD until you find the file you want.

assuming you have a smallish archive (~1 million files) the CFD for the zip would be somewhere in the order of 50MB+ (depending on filename length)

Using a hash index you know exactly where in the index you need to look for the header entry, so you can use a range request to load the header entry

  offset = hash(file_name) % slot_count
Another file format which is gaining popularity recently is PMTiles[1] which uses tree index, however it is specifically for tiled geospatial data.

[1] https://github.com/protomaps/PMTiles

Nice tools!

When it is serverside, reading a 50MB CFD is a small task. And once it is read we can store the zipindex for even faster access.

We made 'zipindex' to purposely be a sparse, compact, but still reasonably fast representation of the CFD - just enough to be able to serve the file. Typically it is around a 8:1 reduction on the CFD, but it of course depends a lot on your file names as you say (the index is zstandard compressed).

Access time from fully compressed data to a random file entry is around 100ms with 1M files. Obviously if you keep the index in memory, it is much less. This time is pretty much linear which is why we recommend aiming for 10K file per archive, which makes the impact pretty minimal.

You mean the cost of the PUT requests becomes significant. That makes sense since AWS doesn't charge for incoming bandwidth. Thanks!