Hacker News new | ask | show | jobs
by ubernostrum 3567 days ago
guarantee that the input is valid with a parser generator

OK, that works really well... until you learn how much non-RFC-specified behavior is built in to web browsers. Simply building a parser to the RFC will leave you wide open to all sorts of nastiness!

The is_safe_url() internal function in Django is a bit of a historical dive into things we've learned about how browsers interpret (or, arguably, misinterpret) various types of oddball URLs:

https://github.com/django/django/blob/master/django/utils/ht...

3 comments

> non-RFC-specified behavior

I never said anything about limiting the parser to what's defined in an RFC. The acceptable input to "quirks mode" is just another (non-RFC) grammar, which still needs to be defined and validated.

Then I do wish you luck, but I don't think you'll ever be able to produce a suitably complete grammar since parts of it will require knowledge of undocumented proprietary internals of Internet Explorer.

Hence we scrape along doing our best with what we can figure out from observing behavior and collecting bug reports. But even with that, is_safe_url() is one of the most prone-to-security-issues functions in Django's codebase.

Hopefully the URL spec (https://url.spec.whatwg.org) is helpful here in finding other potentially unsafe behaviours that browsers have, though given much of it seems to be dealing with the fact that urllib.urlparse doesn't match what browsers do in many, many ways it's probably of limited help. (Nobody really implements it yet; it's just an attempt at standardising rough intersection semantics of what browsers currently do. Eventually, however, it should suffice, once legacy browsers eventually die.)
That URL spec is just "this is what chrome does, everyone repeat that".

They’re unwilling to modify anything, or standardize anything, but just want to cement the current piece of shit that URL parsing it for the future.

WHATWG standards are generally formed by starting from what the 4 major browsers (Chrome, Firefox, IE (Edge), Safari) do. Anything that is done in common by all of them gets implemented no problem. It's when they all differ that the editor(s) tries to come up with more reasoned algorithms.
> WHATWG standards are generally formed by starting from what the 4 major browsers (Chrome, Firefox, IE (Edge), Safari) do.

I thought WHATWG standards are formed by starting with what the four major browser vendors agree to do, not what they currently do (though usually at least one has an implementation before something gets proposed for standardization.)

Which is not really ideal.

Standards aren’t about documenting what is, but about defining what will be.

Given Chrome Canary currently fails a large number of tests, it seems like it's hardly just "this is what chrome does, everyone repeat that".
Because the standard was changed to clean a bit of the stuff Google did up.

But WHATWG only changes standards to include more, never to include less.

I do wonder; is there any browser that is actually full-RFC-specced? I checked a few (the mainstream desktop ones, but also links2 etc.), but so far they all seem to have glue to fix historical behavior.
Pretty much no, because it'd be practically useless. And I don't think anyone has the willingness to spend time or money on something that will essentially just be a toy.

There's been plenty of work on moving the standards so that there are actually implementations of them, instead of them being practically useless at best and misleading at worst (given doing input validation based on a spec that nobody actually implements is just outright dangerous), with HTML 5 and much of CSS 2.1 leading that charge (though CSS 2.1 still has massive blackholes, notably table layout remains largely undefined, though that is finally being worked on).