| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by taldo 1007 days ago
	A very simple optimization for those complaining about having to fetch a large file every time you need a little datapoint: if they promised the file was append-only, and used HTTP gzip/brotli/whatever compression (as opposed to shipping a zip file), you could use range requests to only get the new data after your last refresh. Throw in an extra checksum header for peace of mind, and you have a pretty efficient, yet extremely simple incremental API. (Yes, this assumes you keep the state, and you have to pay the price of the first download + state-keeping. Yes, it's also inefficient if you just need to get the EUR/JPY rate from 2007-08-22 a single time.)

3 comments

calpaterson 1007 days ago

Absolutely! I have a plan for a client lib that uses ETags (+ other tricks) to do just that.

Very WIP but check out my current "research quality" code here: https://pypi.org/project/csvbase-client/

link

acqq 1007 days ago

Also, on the topic of range requests, when a server allows the range requests for zip files, the zip files are huge and one needs just a few files from them, one can actually download just the "central directory" and the compressed data of the needed files without downloading the whole zip file:

https://github.com/gtsystem/python-remotezip

link

GuB-42 1007 days ago

Or, just serve a bunch of diff files. Just having a single daily patch can drastically reduce the bandwidth required to keep the file up to date on your side.

That's if downloading a few hundred kB more per day matters to you. It probably doesn't.

link