Hacker News new | ask | show | jobs
by zepearl 1829 days ago
All extremely useful: the overview, the examples and the comments.

A few months ago while writing a bot/crawler I searched for hours for something like this, but I found only full specs or just bits and pieces scattered around that used different terminology and/or had different opinions.

In the end I didn't even clearly understand what should be the max total URL length (e.g. mixed opinions here https://stackoverflow.com/questions/417142/what-is-the-maxim... - come on, a xGiB long URL?) => most of the time 2000 bytes is mentioned but it's not 100% clear.

Writing a bot made me understand 1) why browsers are so complicated and 2) that the Internet is a mess (e.g. once I even found a page that used multiple character encodings...).

My personal opinion is that everything is too lax. Browsers try to be the best ones by implementing workarounds for stuff that does not have (yet) or does not comply to a spec => this way it can only end up in a mess. A simple example is the HTTP-header "Content-Encoding" ( https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Co... ) which I think should only indicate what kind of compression is being used, but I keep seeing in there stuff like "utf8"/"image/jpeg"/"base64"/"8bit"/"none"/"binary"/etc... and all those pages/files work perfectly in the browsers even if with those values they should actually be rejected... .

3 comments

The use of Content-Encoding for compression is actually something of a historical wart: what was intended to be used for that purpose is Transfer-Encoding, but modern browsers don’t even send the TE header necessary to permit the HTTP server to use it (except for Transfer-Encoding: chunked which every HTTP 1.1 client must accept), even though some servers are perfectly capable of it and all but the most broken will at least ignore it. Things like 7bit, 8bit, binary, or quoted-printable are not supposed to be in the HTTP Content-Encoding header, either, but their presence is at least somewhat understandable as they are valid in the MIME Content-Transfer-Encoding header, and HTTP originally shares much of its infrastructure with MIME (think Content-Disposition: attachment).

I guess what I’m getting at here is that the blame for the C-E weirdness lies in large part on the browsers, which could’ve made a clean break and improved the semantics at the same time by using T-E, but instead chose to initiate a chicken-and-egg dilemma out of a desire to support broken HTTP servers from the last century.

(The intended semantics is that C-E, an “end-to-end” header, says “this resource genuinely exists in this encoded form”, while T-E, a “hop-to-hop” header, says “the origin or proxy server you’re using incidentally chose to encode this resource in this form”; this is why sometimes the wrong combination of hacks in the HTTP server and the Web browser will lead you to downloading a tar file when you expected a tar.gz file.)

The use of “gzip” as the compression is also a wart, because it’s “deflate” (which is what you want: DEFLATE compression with a checksum) with a useless decompressed filename (wat?) + decompressed mtime (double wat?) header stacked on top.

Even though HTTP DEFLATE saves ~20 bytes compared to GZIP, it itself is a wart because of some vendor misunderstandings. HTTP DEFLATE is actually DEFLATE data wrapped in a zlib container, not raw DEFLATE. See https://en.wikipedia.org/wiki/HTTP_compression#Problems_prev... ; https://stackoverflow.com/questions/3932117/handling-http-co...
I just implemented decompression in my HTTP client this week

I could not test that part because both server I tried send raw deflate, without zlib container

The original filename is optional in gzip. It is not included in the response sent by, for example, Apache.

(There is a mandatory MTIME which is included, and an OS byte, but those only waste 5 bytes total. Far less than gzip will typically save.)

The spec is silent on length. 2000 bytes came from some web servers (old IIS comes to mind) that capped the URL at 2K or something close to that. So extra long URLS were problematic (and a lot of early web apps went nuts with parameters). So, max length is up to the implementer. All I know is that I've had to fix lots of code where someone assumed that 255 characters is all you'll ever need for a URL.
255 characters is the default to a variable length string column in databases. So if a developer did not pay attention he just used the default which is in some cases to short for an url.
> old IIS comes to mind

And msie for which it’s a hard limit not just a default.

There is no single max total URL length. You probably shouldn't enforce one other than to prevent DoS.