Hacker News new | ask | show | jobs
by drdaeman 601 days ago
It’s astonishing that handling and/or storing strings correctly is so hard, people actually suggest it’s somehow better to “just” stop such strings at administrative level.

I find it harmful assuming that some externally-sourced data will match any arbitrary format (e.g. contain only allowed characters), even if it’s really supposed to be so. (Inverse for outputs - one has to conform as strictly as they can.) Ignoring this leads to mental dismissal of validation and correct handling, and that’s how things start to crack at the seams. I have seen too many examples of “this can never be… oops”.

Add: Best one can safely assume when handling a string is that it’ll be composed of a zero or more octets (because that’s what typically OS/language would guarantee). Languages and frameworks usually provide a lot of tooling to ensure things are what they expected to be. Ignoring the failure modes (even less probable ones, like a different Unicode collation than is conventional on a certain system) makes one sloppy, not practical.

3 comments

And assuming all your consumers are not sloppy is impractical.

We sanitise input all the time. This is not particularly unique. There isn't a great loss in this restriction of company names.

>We sanitise input all the time.

No we don't.

Companies like the aforementioned were made illegal because nobody sanitizes input.

SQL query injection and other forms of malformed data entry is still one of the most common attack vectors in the year 2024.

Isn't making it illegal a way of sanitizing it though?
Will making (non-)computer viruses illegal sanitize the world of them?
Bad analogy. In the company name case, there’s a registry (list) with a gatekeeper (filter) in front of it rejecting very simple inputs (small strings) that don’t conform to their standards. You literally can’t get your company name on this list if you don’t pass muster. One might even say the list is “sanitized”.
No
You probably want to say "correctly handle arbitrary input" than "sanitize" inputs.

If everybody sanitizes their inputs (in undefined ways) then companies like the one mentioned would be randomly blocked from administrative processes.

This is not what we (as a society) want.

If Bobby Tables isn't a valid name the legislation should make it invalid, instead of rubber stamping it at the government registry and let poor Bobby get random errors when making requests to various public bodies. ("Sorry, our school does not admit persons with semicolons in their names.")

Sanitising inputs would mean Bobby Tables would be able to use their name just fine.
> It’s astonishing that handling and/or storing strings correctly is so hard

Is it astonishing? "Don't sanitize your own strings; always use a library" is common advice for handling SQL and HTML, which implies to me that it is in fact pretty hard to do correctly.

Anything is hard, if the plank is low enough. Basic language transformations with regular grammar (like escaping a string for use in a HTML document) are, IMHO, not particularly hard. The hardest part is to actually recognize what is the language of your output and if there is a mismatch with the language of your string value.

What's astonishing is the popularity of the way of thinking that producing the cheapest code possible that still works along happy path (and simply doesn't fail too badly when it does) is is considered not only a valid practice but even some business virtue that needs to be protected.

The more I think about it, the more I like the idea of an EICAR-like records like this SCRIPT one - in the official database. It must be fully benign, of course (in a sense the script source should point to the same agency, and contain only a warning but no harmful code), and it must be well-known - effectively a test case for production systems. Rather than a pinky-swear "company name will should be okay, don't worry" that allows neglect, it's a "hey, this is a special weird case - specially to make sure you're doing things right" friendly guidance.

The fact that so many people were impacted by left-pad leads me to believe that people aren't using libraries because a problem is pretty hard, but rather because they don't even want to think about the problem that a library supposedly addresses. It can also often be way to hand off responsibility IMO.
I'm genuinely curious - where does this end? I once was curious about whether I should sanitize dynamodb inputs, and was surprised to see zero guidance for or against.

How about things like parsing strings for serializing to binary storage?

Can everything be an injection attack?

I think it's safe to put arbitrary data in DynamoDB (just use the proper API instead of concatenating it directly into a command string...) It's the systems interacting with it you have to be careful about. In general, there is no silver bullet beyond "understand your systems capabilities and limitations". Formal verification also comes to mind.

> Can everything be an injection attack?

What does this question even mean? I guess we must say "for any system accepting arbitrary input: yes". Not even sure if the "arbitrary" qualifier is necessary.

> where does this end?

It never does, because abstractly speaking, there is no such thing as a secure computing system. This goes double for any computer that is switched on.

Practically speaking, it depends on how critical your application might be. If you're storing values for neurosurgery or automated dispersal of life-saving (or potentially life-ending) medication, you'd better be sanitizing on the way in, validating on the way out, and have some additional layers like audits and comparisons to known good values at rest. Look into defense in depth, and never trust the computer to make a decision, because the computer cannot be held accountable.

If you're storing quiz results for someone's favourite colour, or it's not internet connected, you can probably be a bit less paranoid about it.

> Can everything be an injection attack?

But yeah, anything and everything could be an injection attack if the attacker is determined enough. It's just a matter of how difficult you want to make it for them.

That advice is 90% because developers are lazy. Like we'll write

    const csv = rows.map(cols => cols.join(','))
                    .join('\n')
because we are too lazy to write the more correct,

    const esc = cell => `"${String(cell).replace(/"/g, '""')}"`
    const csv = rows.map(cols => cols.map(esc).join(','))
                    .join('\n')
(And perhaps something slightly more efficient but slower that only quotes each cell when it needs to be escaped.)

I caught myself doing it the other day, Go has a JSON library and here I was too lazy to define a struct,

    w.WriteHeader(500)
    fmt.Fprintf(w, `{"error": %q}`, err.Error())
Is %q a JSON-compatible format? I have no idea without reading some source code! Almost certainly it won't \u-encode weird characters. That might be OK, I think the only stuff you really have to escape in JSON strings is newlines, backslashes, and double quotes? And %q probably handles those. Maybe it breaks on ASCII control characters...

But yeah, we are meant to always use a library because we have deadlines and we are willing to compromise a whole lot of quality to deliver on them.

Both cases are the result of library/runtime/env designer not thinking about the crowd. If csv.esc(s) and json(x) were available right away, without imports even, you wouldn’t have to decide whether it’s fine. Fmt should just have %j.

Specifically json and unjson I make globally available in all my projects. If I used csv more often than once in a decade, I’d have csvesc(s) too.

Sometimes you read some stdlib reference and wonder what they were thinking with things like System.out.println and without one-line one-arg readtext(), tojson(), fetch() and so on. It’s like a kitchen with all appliances still in boxes and all utensils in a tight vacuum cover. Everything is there, but preparation friction makes it absolutely unusable.

I don't think the problem we are talking about is lazy programmers or the availability of libraries.

People think hard things should be easy and with less "friction". If I want to output a string why should I have to know what the difference between stdout and stderr is? If I write CSV to a file why do I need to know the difference between CRLF and LF, and UTF-8 and UTF-16 or what a BOM is? At the end of all of this you end up with a company named 'W""oopWoop;' crashing the banking industry.

So no, you should know all of that, and more or get the fuck out of my industry.

For me it is. I feel the friction and how it disrupts the parallel flow of multiple lines of thought on the code, cause you have to stop and implement a stupid method. Also have seen this many times in less experienced or less patient programmers, who inlined lots of code that should have been a library and cut corners in there due to time, mental and other pressures. Providing them a set of tools they could paste (poor platform) into a globally loaded module improved their jobs a lot.

I think the high horse here is a bad point cause it simply claims it must be hard for no good reason. It’s not even complexity-wise hard, you just have to (metaphotically) unpack your instruments every time you use them. That’s bs at all experience levels and it must be obvious to anyone who works in a shop. Ime, the problem isn’t knowledge, but inconvenience.

It's not hard to do correctly. If you employ people to write SQL who can't tell the difference between string concatenation and parameterised queries, then your bar is too low. This can be learned in under an hour[0], and is the most fundamental thing to bear in mind when writing a query.

[0] https://cheatsheetseries.owasp.org/cheatsheets/SQL_Injection...

> is common advice for handling SQL

Are we still passing SQL statements and data to the SQL back end as single string instead of passing them separately? Why would you even need to escape SQL data in 2024?

One example that I found is that some libraries/databases don't allow DDL statements to be parameterised - so if you are managing tables and columns from code and those names came from end users then you should be checking them.
Agencies like this /already/ have plenty of other restrictions on what names are permissible, this is just a new one.

Most are to do with ones which could be misleading, eg you can’t have ‘bank’ in the name unless you are, well, an actual bank.