Hacker News new | ask | show | jobs
by magnio 23 days ago
Never, ever, ever transform URIs and paths by string manipulation. If you think pulling in a library for this is overkill, it is not.

(Lesson learned from trying to quickly write my own function to make ".." to go back one URL segment that took 3 hours and discovering the URI spec contradicts my intuition depending on whether the URI is a URL or filesystem path.)

3 comments

Differentials between different URI parsers are a huge source of bugs. The amount of shenanigans you can do inside URIs is bonkers, and trying to handle this by yourself with some regex and string splitting is absolutely insane.

Like https://www.example.com:443@203569230:8080/ will send you to the IP address "12.34.56.78" on port 8080 using basic authentication with the domain and port as username and password. If your code tries to split by `:` or check that the URI starts with some specific string, then it won't be good enough. Indeed, use a library that you trust.

I don't believe Python's urllib has a function that takes what HTTP terms an "origin-form" (an absolute path with possibly a query attached to it with "?") and parses it apart.

Still, the RFC 9112 that defines HTTP/1.1 basics requires that, for the purposes of URI reconstruction, "if there is no Host header field or if its field value is empty or invalid, the target URI's authority component is empty."

Yep, none of them are suitable for this use case; you need to validate the Host header first and reconstruct the URI first before parsing it.
You kind of have to, it's not turtles all the way down, at some point the network is sending strings my man.

You just have not to make mistakes, there's no silver bullet or instant cop-out like "this would never happen to me because I don't do one of the things in this multi-sub-system vuln".