Hacker News new | ask | show | jobs
by taldo 1007 days ago
A very simple optimization for those complaining about having to fetch a large file every time you need a little datapoint: if they promised the file was append-only, and used HTTP gzip/brotli/whatever compression (as opposed to shipping a zip file), you could use range requests to only get the new data after your last refresh. Throw in an extra checksum header for peace of mind, and you have a pretty efficient, yet extremely simple incremental API.

(Yes, this assumes you keep the state, and you have to pay the price of the first download + state-keeping. Yes, it's also inefficient if you just need to get the EUR/JPY rate from 2007-08-22 a single time.)

3 comments

Absolutely! I have a plan for a client lib that uses ETags (+ other tricks) to do just that.

Very WIP but check out my current "research quality" code here: https://pypi.org/project/csvbase-client/

Also, on the topic of range requests, when a server allows the range requests for zip files, the zip files are huge and one needs just a few files from them, one can actually download just the "central directory" and the compressed data of the needed files without downloading the whole zip file:

https://github.com/gtsystem/python-remotezip

Or, just serve a bunch of diff files. Just having a single daily patch can drastically reduce the bandwidth required to keep the file up to date on your side.

That's if downloading a few hundred kB more per day matters to you. It probably doesn't.