Hacker News new | ask | show | jobs
by kshay 1254 days ago
> There's nothing in the underlying HTTP response from https://csvbase.com/meripaterson/stock-exchanges that tells me I can get an HTML or CSV version. Is there a JSON version available? What other variants exist? How do I know that this URL will deliver different responses?

Well, you can send a HEAD request with a given accept: header to find out what you’ll get without actually fetching the data. But it’s true that it would be nice to have the full set of possible responses advertised somehow.

1 comments

I should have mentioned the HTTP 'Vary' response header, that servers can use to inform the client that its response was based upon some of the headers that it sent.

A 'Vary: accept' response header gives a hint that it could have supplied a different response had you given a different Accept: header. But I don't think that there's a way to actually list the variants available in the HTTP spec?

My reading of the HTTP spec suggests that csvbase.com is behaving incorrectly by not setting the Vary header properly (it sends: 'Vary: Accept-Encoding', but it should also list 'Accept' in there too). Potentially, a proxy server could decide to cache the CSV or HTML response, and then serve that version back to another client instead of the 'right' one, because the server didn't correctly report that the response varies based upon 'Accept'. In practice, this isn't likely to happen unless you've got a caching proxy that is also unwrapping the encryption between itself and your HTTP client.

> My reading of the HTTP spec suggests that csvbase.com is behaving incorrectly by not setting the Vary header properly

This is a good idea, I will certainly look at this. There are some planned features WRT caching coming up.

To address the comments you made in GP:

> How do I know that I can get csv files from that URL

That is a good question. The web UI could be better, of course. But programmatically, how do you advertise alternate representations? I'm not sure. Suggestions appreciated.

> Will the website always default to csv files or will my app break when they decide that XML is superior? (Well, obviously not, especially for a site called csvbase!)

As you say: csvbase won't change :)

But the other thing is that the HTTP client you use could decide to change it's default Accept header. If curl changed to "application/json,q=0.9;/" then suddenly you'd get json (I didn't mention in the blog post but that is also implemented)!

Oh dear. Perhaps a good idea to include the file extension or explicit Accept header when you're coding something that needs to last. But I do think it's nice to be able to copy and paste into pandas. That's my main usability case and I wanted that to be as smooth as possible.

This is a good idea, I will certainly look at this. There are some planned features WRT caching coming up.

While it is probably a bug, it's probably not a serious one that many people would run into nowadays. Now that https is ubiquitous, there aren't many caching proxies around to cause grief. Probably the only proxies people will experience are where they are behind a paranoid company's firewall, one that is configured to decrypt (and then re-encrypt) all their web traffic. And in those situations, they don't tend to do caching much now. (Because even though you can cache HTTP, you'll hit problems with misconfigured sites and users will blame your proxy for it.)

But programmatically, how do you advertise alternate representations? I'm not sure. Suggestions appreciated.

Sorry, I don't have a good answer for this. I only nit-pick problems in web comments :)

You could set a HTTP header to list the available variants, but there isn't a standard AFAIK so it would only help developers who spotted the header.

But the other thing is that the HTTP client you use could decide to change it's default Accept header. If curl changed to "application/json,q=0.9;/" then suddenly you'd get json (I didn't mention in the blog post but that is also implemented)!

That's cool! Aeons ago, I was involved in developing a web server, where we added support for properly handling all kinds of content negotiation (Accept-Encoding, Accept-Language, etc), where you could configure it to deliver the right file based on the user's language, file type preference, etc. It was a large chunk of code, but in the end, nobody really used it. In theory, web browsers and sites could co-operate to deliver the right page in the right language for all their users automatically. In practice though, it never works. No-one sets up their web browser to pick the language properly (who even knows how to change it?) As a result, multi-lingual sites offer to switch languages by clicking on a link, and if they choose a default language, they mostly do it based on IP address (and assumed location)

That's my main usability case and I wanted that to be as smooth as possible.

I think it's the right choice for csvbase, my original comment reads far too critical in retrospect, it's neat that if you curl a URL, you get the csv. But if I was writing code to scrape some csv data, I would still always prefer to download URLs with a .csv extension, because you know what you are getting 100% of the time, and you avoid any unpleasant surprises if some 3rd-party library or tool changes its behaviour.

> Now that https is ubiquitous, there aren't many caching proxies around to cause grief.

Well, there are still CDNs. csvbase is designed for a public cache for some pages. I haven't done much on this except for the blog pages, which use the CDN a lot.

I also have vague plans for client libraries that include a caching forward proxy as my experience is that most people export the same tables repeatedly. Likely that will be based on etags though so that the cache is always validated.

The designers of HTTP 1.1 clearly thought a lot about a lot of things, including caches.

Thanks for your thoughts. :) Keep in touch via email if you like (same goes for anyone else reading this): cal@calpaterson.com

Yeah, I guess maybe this is what a 300 Multiple Choices[1] response was intended for but that seems to be underspecified and I’ve never seen it used.

[1] https://www.rfc-editor.org/rfc/rfc7231#section-6.4.1

I vaguely remember a web server set up for language content negotiation failing to determine which version to send and giving me a list of links to the individual language versions instead.

I think it was Apache and the negotiable resource was called X.html while the individual linked versions had names like X.en.html etc.

Might that have been a 300 response?

Without sending "Vary: Accept" the server might have its response mis-cached by a proxy. A request from a browser could populate the cache with HTML, which could then serve HTML in response to a request that wants CSV. Any time you vary your response based on a request header, the spec says you should list it in your Vary response header.

In practice, with the move to HTTPS this rarely comes up anymore outside the sending company's internal infrastructure. Basically no one is running client-side caches that are shared between multiple consumers.