Hacker News new | ask | show | jobs
by BoppreH 1076 days ago
Aside from Memory Management, there's another general category that always comes up in these lists, but is not talked about much: in-band signaling (i.e., "Strings are Evil"):

- Improper Neutralization of Input During Web Page Generation ('Cross-site Scripting') (#2)

- Improper Neutralization of Special Elements used in an SQL Command ('SQL Injection') (#3)

- Improper Neutralization of Special Elements used in an OS Command ('OS Command Injection') (#4)

- Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal') (#8)

- Improper Neutralization of Special Elements used in a Command ('Command Injection') (#16)

- Improper Control of Generation of Code ('Code Injection') (#23)

All of these came from trying to avoid structured data, and instead using strings with "special characters". It's crazy how many times this mistake has been repeated: file paths, URLs, log files, CSV, HTML, HTTP (cookies, headers, query strings), domain names, SQL, shell commands, shell pipelines... One unescaped character, from anywhere in the stack, and it all blows up.

One could say "at least it's human-readable", but that's not reliable either. Take files names, for example. Two visually identical file names may map to different files (because confusables[1] or surrounding spaces), or two different names map to the same file (because normalization[2]), or the ".jpg" at the end may not actually be the extension (because right-to-left override[3]).

So the computer interpretation of a string might be wrong because a special character sneaked in. And even if everyone was perfectly careful, the human interpretation might still be wrong. For the sake of the next generations, I hope we leave strings for human text and nothing more.

[1] https://unicode.org/cldr/utility/confusables.jsp

[2] https://developer.apple.com/library/archive/qa/qa1173/_index...

[3] https://krebsonsecurity.com/2011/09/right-to-left-override-a...

6 comments

Out of this frustration I've built: https://github.com/Endava/cats. It's for APIs, but mostly addressing exactly this case: don't use strings for everything, if you choose to use it though, make sure you add patterns for checking if things are valid, make sure you think about all the corner cases and all the weird characters that can brake you app, and so on.
And it's even worse when everything is a map, rather than specific object schemas.
What‘s the alternative though? For URLs for example, would you have to put a JSON structure into the browser? That‘s obviously not going to happen.
Sure, most of these decisions are too entrenched to be fixed.

But yes, URLs should have been structured. We already see paths rendered with breadcrumbs, the protocol replaced with an icon, `www` auto-inserted and hidden, and the domain highlighted. If that's not a structure, I don't know what is.

By cramming everything into the same string, we open ourselves to phishing attacks by domains like `www.google.com.evil.com`, malicious traversal, 404s from mangled relative paths, and much more.

URLs are structured. But when you need to send them across the network or store them on disk or even just send them between different processes on the same machine you need to define what the byte level representation is.

I don't see how you can get away from having a defined serialisation format. People try to operate directly on the serialised data using ad-hoc implementations and run into trouble.

But I'm not sure exactly what you mean by "should have been structured". Eventually you've gotta define the bytes if you want to interoperate with other software.

> I don't see how you can get away from having a defined serialisation format.

Yep, that's exactly it. Your TLS certificate is not sent as string, and neither are your TCP packets, nor the images contained in them. Your URLs shouldn't be either, but it's probably too late for that.

> People try to operate directly on the serialised data using ad-hoc implementations and run into trouble.

That's a whole lot better than the current footgun we have, where

    http://http://http://@http://http://?http://#http://
is a valid URL. People don't operate directly on string URLs without trouble either, so at least the structured data is not inviting incorrect usage.
> > I don't see how you can get away from having a defined serialisation format.

> Yep, that's exactly it. Your TLS certificate is not sent as string, and neither are your TCP packets, nor the images contained in them.

...all of those things mentioned have defined serialization. i expect all of them have had security issues because of problems with deserialization code.

Yes, of course. Everything that is stored or transmitted must have a defined serialization. And any piece of code as widely used as this is going to have security issues.

What is your point? That strings don't need defined formats? That they have less security issues?

Your certificate isn't entered by hand, though?

That is, it is easy to see that the reason we have URLs sent as strings, is that we collect them from the user. And it makes perfect sense that we would collect strings of characters from users.

How many URLs, as a percent of all browser navigation, do you think are typed by hand? And I don't mean "news.ycombinator.com", I mean the full URL, like "https://news.ycombinator.com/news".

And in those rare cases, of course you can collect strings from the user. But then they have to be parsed, and that's what should be on the wire. IP addresses are also sometimes entered by hand, but we don't send those strings in TCP packets.

Humans think in strings so it's not surprising we carry this thinking to code where it blows up in our face.
Some humans think in strings. I don't, generally I think in pictures.
No, IMHO escaping is an elegantly simple concept; it's just that for some reason (like basic arithmetic) people don't seem to be taught enough about it to understand.

Two visually identical file names may map to different files (because confusables[1]), or two different names map to the same file (because normalization[2]), or the ".jpg" at the end may not actually be the extension (because right-to-left override[3]),

Those are all because of Unicode, which is an even worse idea in general.

Escaping is a cute solution, but it doesn't belong in infrastructure.

> it's just that for some reason (like basic arithmetic) people don't seem to be taught enough about it to understand.

That's the same argument used to defend manual memory management. But education is not enough. Escaping is something you have to remember to do every time*, or it'll blow up spectacularly. Even knowledgeable professionals mess it up, or it wouldn't occupy 6 of the 25 spots in this list.

> Those are all because of Unicode, which is an even worse idea in general.

What's the alternative? Japanese speakers writing file names in ASCII? Unicode is a modern marvel, it's our fault we use it where it doesn't belong.

* Not necessarily every input/output, but at least every system that interacts with it.

You are going to be sorely disappointed with LLMs. :(

We make it look like it is a request response with a chat bot, but it is more realistic to say we are making a single document and having the model fill out the rest. That is, there is no out of band. There is only the document.