Hacker News new | ask | show | jobs
by austincheney 2300 days ago
That is an excellent article and I learned a tremendous amount.

I do have one minor technical criticism though. It is so common for people to conjoin parameter with the components of a query string that we don't give it a second thought. The specification, though, does delineate these terms. See: https://tools.ietf.org/html/rfc3986#section-3.4 and the preceding paragraph.

Specifically parameters are trailing data components of the path section of the URI (URL). The query string is separated from the path section by the question mark. URI parameters are rarely used though so this is a common mistake.

Also encoding ampersands into a URI (URL) using HTML encoding schemes is also common, but that is incorrect. URI encoding uses percent coding as its only encoding scheme, such as %20 for a space. Using something like & will literally provide 5 characters in the address unencoded or may result in something like %26amp; in software that auto-converts characters into the presumed encoding.

* https://tools.ietf.org/html/rfc3986#section-2.1

* https://stackoverflow.com/questions/16622504/escaping-ampers...

3 comments

It's important to use the jargon precisely, as you did, otherwise you end up with gibberish that nobody understands, like Python's "get_selector" function in urllib. Nobody knows what the heck a selector is, and the word does not even appear in RFC 3986.
I believe the discussion of encoding ampersands is as relates to printing them out in the text of the page, where you would indeed want to use the HTML entity encoding.
>Also encoding ampersands into a URI (URL) using HTML encoding schemes is also common, but that is incorrect.

To encode any string (for example a URL) containing & in HTML, you must HTML-encode that &. Using & in the value of the href attribute for an a-tag must result in a URL containing just & in place of the entire entity. This is a property of HTML that has nothing to do with URLs or URL encodings.

So let's say you have a raw address with an ampersand that needs to be encoded (the second one) so as not to confuse a URI parser with into thinking there are 3 query string data segments when there are only 2 as the second ampersand is part of a value and not a separator:

    http://domain.com/?name=data&tag=c&t
You will need to encode that ampersand so that it is interpreted as something other than syntax:

    http://domain.com/?name=data&tag=c%26t
Now the first ampersand is not encoded but the second one is. You are correct that ampersands are also syntax characters in HTML/XML so if you wanted to place that address in HTML code it would need to be escaped in HTML:

    http://domain.com/?name=data&tag=c%26t
That address can now be inserted as the value of an HTML anchor tag as such:

    <a href="http://domain.com/?name=data&amp;tag=c%26t">somewhere</a>
The important point to distinguish is that addresses are often used in contexts outside of HTML, even in the browser. For example the address bar at the top of the browser is outside the context of the view port that displays HTML content, and so the appropriate text there is:

    http://domain.com/?name=data&tag=c%26t
This is so because URI only has one encoding scheme, which is percent encoding.