|
|
|
|
|
by zepearl
1829 days ago
|
|
All extremely useful: the overview, the examples and the comments. A few months ago while writing a bot/crawler I searched for hours for something like this, but I found only full specs or just bits and pieces scattered around that used different terminology and/or had different opinions. In the end I didn't even clearly understand what should be the max total URL length (e.g. mixed opinions here https://stackoverflow.com/questions/417142/what-is-the-maxim... - come on, a xGiB long URL?) => most of the time 2000 bytes is mentioned but it's not 100% clear. Writing a bot made me understand 1) why browsers are so complicated and 2) that the Internet is a mess (e.g. once I even found a page that used multiple character encodings...). My personal opinion is that everything is too lax. Browsers try to be the best ones by implementing workarounds for stuff that does not have (yet) or does not comply to a spec => this way it can only end up in a mess. A simple example is the HTTP-header "Content-Encoding" ( https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Co... ) which I think should only indicate what kind of compression is being used, but I keep seeing in there stuff like "utf8"/"image/jpeg"/"base64"/"8bit"/"none"/"binary"/etc... and all those pages/files work perfectly in the browsers even if with those values they should actually be rejected... . |
|
I guess what I’m getting at here is that the blame for the C-E weirdness lies in large part on the browsers, which could’ve made a clean break and improved the semantics at the same time by using T-E, but instead chose to initiate a chicken-and-egg dilemma out of a desire to support broken HTTP servers from the last century.
(The intended semantics is that C-E, an “end-to-end” header, says “this resource genuinely exists in this encoded form”, while T-E, a “hop-to-hop” header, says “the origin or proxy server you’re using incidentally chose to encode this resource in this form”; this is why sometimes the wrong combination of hacks in the HTTP server and the Web browser will lead you to downloading a tar file when you expected a tar.gz file.)
The use of “gzip” as the compression is also a wart, because it’s “deflate” (which is what you want: DEFLATE compression with a checksum) with a useless decompressed filename (wat?) + decompressed mtime (double wat?) header stacked on top.