Hacker News new | ask | show | jobs
by l3x 636 days ago
From the FAQs on GitHub [1]

> What about PMTiles?

> I would have loved to use PMTiles; they are a brilliant idea!

> Unfortunately, making range requests in 80 GB files just doesn't work in production. It is fine for files smaller than 500 MB, but it has terrible latency and caching issues for full planet datasets.

> If PMTiles implements splitting to <10 MB files, it can be a valid alternative to running servers.

[1] https://github.com/hyperknot/openfreemap

4 comments

That's an interesting claim. I make range requests to 100GB+ files (genomics) all the time for work and it works great. I've never considered total file size as directly related to latency in this respect, assuming you have some sort of an index of course.
You can test this claim directly against a AWS S3 bucket.

First 100KB of a 100GB+ file:

curl -H "Range: bytes=0-100000" https://overturemaps-tiles-us-west-2-beta.s3.amazonaws.com/2... --output tmp -w "%{time_total}"

First 100KB at the 100GB mark:

curl -H "Range: bytes=100000000000-100000100000" https://overturemaps-tiles-us-west-2-beta.s3.amazonaws.com/2... --output tmp -w "%{time_total}"

Here the requests are really really small, on average 405 bytes each. I guess in your genomics work you are making larger requests, so probably it's not so much of an issue.

BTW, we are discussing latency with bdon in this issue, it seems to be specific to Cloudflare: https://github.com/hyperknot/openfreemap/issues/16

I just tried @bmon's curl examples above with 100 byte requests. Similar results. I think the Cloudflare explanation is more likely.
If you store the PMTiles in S3 or any other object store that supports HTTP Range Requests, that's a no-brainer... In a normal disk on you own server, this might become interesting, yes.
ok except "full planet datasets" make little sense for terrestrial features. Splitting .. aka sharding the files into basic continents would make SO much sense. Asia is big, but no requests for Africa mixed in.. Australia would be manageable?
PMTiles could come up with a version in the future where instead of one 90 GB file, they have 9 thousand 10 MB files. That would work well I believe.
The latency for small files and ranges of large files is pretty similar on most storage platforms, but there are some exceptions like Cloudflare R2.

The main reason PMTiles is one file and not two or more files is that it enables atomic updates in-place (which every mature storage platform supports) as well as ETag content change detection in downstream caches. All of the server and serverless implementations at http://github.com/protomaps support this now for AWS, S3-compatible storage, Google Cloud, and Azure.

Now I'm curious, what causes the latency for range requests with R2?
I don't have any insight into this other than observing how their storage system works, but here's some scripts I made last year to test:

https://github.com/bdon/cloudflare-r2-latency

Range requests means work and logic. Getting a file requires no logic.

Also, I'm pretty sure range requests are going to be difficult to cache. That implies going to origin every request which is bad.