Hacker News new | ask | show | jobs
by underwater 3622 days ago
That's not actionable. You have to deal with user input at some point.

What if my language can't deal with strings properly? Maybe strpos has a buffer overflow on a carefully crafted input. Would I be wrong for using it?

2 comments

Never trust user input doesn't mean never use user input; you need to use it carefully -- often that means restricting to acceptable values and lengths, (appropriate!) escaping, and passing through that it's user input to other functions (ex: sql placeholders). When functions do arcane things, and don't let you pass through that it's user provided, that's a red flag.

Note that a cookie that you set, and are now getting back IS user-input, unless you do something to validate that it's actually the value you set. (HMAC is a good start)

If your language can't deal with strings properly, I strongly suggest you not expose it to strings provided by users. If you do expose it to strings from users, at least you should sandbox your application as much as possible.

I advised someone doing their masters in information security as a mentor. My student did their dissertation on input scrubbing. We did quite extensive research on the subject, and we found out that a simple AWK program doing regular expression matching on the input, before passing it on to conventional scrubbers inside of languages like PHP virtually eliminated attack vectors. For three months we tried our very best to craft some SQL code to get by the AWK regex and we couldn't. Lesson learned.
> Lesson learned.

Without meaning to be sarcastic (particularly because I found your post interesting), what lesson learned? A casual perusal of your post suggests the lesson "one can't craft SQL code to get by an AWK regex", but, of course, "what I can't do no-one can" is a bad lesson to learn in security.

The lesson we learned is that sometimes getting back to the roots (AWK) and using simple methods (regex) can be extremely effective. You are of course right that "what I can't do no-one can" is a bad thing.
IMO your lesson is to get to define a problem simply enough that you can apply a simple solution. This is not a given and usually needs serious design and project management skills.

Otherwise even your simple solution would be drawn in "can you support multi-byte characters ? Do you handle non unicode stuff ? What if it leaks in your layers if code before reaching your awk library ?" and other problems that abound in most mildy complex projects.

your lesson is to get to define a problem simply enough that you can apply a simple solution.

Hear hear! So true. The problem is that making complex things simple is extremely difficult.

It is actionable, sort of, but it takes a lot of careful thinking about what you can safely do with data you do not trust.

E.g. you need to assume that every function you pass that data to will be subjected to malicious or accidentally broken input. I originally wrote "unless/until you have sanitised the content", but really, when building applications taking user data, just assume that you're dealing with malicious or accidentally broken input everywhere unless you have proven otherwise.

This goes from the trivial: Range check numbers; check length of strings, and verify any other constraints (encoding, character set limitations) that you may later depend on.

To the very complex: Going to re-use end-user HTML? May sound simple, but you basically need a HTML parser with explicit white-listing of tags and attributes, and if you allow CSS you need to parse and white-list CSS attributes too (biggest risks: chance of executing malicious JS in context of another logged in user, including in your admin interface; chance of causing unintended side-effects if you allow triggering HTTP requests - as a minimum, even assuming nobody any longer are stupid enough to trigger side-effects on GET requests and assuming that's all they are able to trigger, it has privacy impacts including the chance of leaking details about your admin systems or any third party systems you pass the HTML on to).

In general it means you have to understand all the ways the type of data you allow can go from being innocuous inert sequences of bytes to triggering effects that may be under the control of a potentially malicious user, and you have to assume that if you don't know, then format needs to be assumed to not be inert when passed to any given piece of code.

E.g. to take a much simpler example than HTML. Consider passing arbitrary XML to an XML parser in order to validate it against a schema to sanitise. Could be a smart thing to do. Except, even assuming the schema is strict enough, consider that a malicious XML document passed to a parser that's not explicitly configured not to, may be able to make HTTP requests with a source IP on your internal network (by specifying a suitable URL for the doctype).

Doesn't need to be malicious either - I've seen plenty of systems have throughput fall through the floor because someone didn't handle this case and suddenly got a bunch of XML documents with a doctype URL that took ages waiting for requests to a downed nameserver for some third-party domain to time out.

In this case you also better be sure you don't have any services that are "protected" only by being behind a firewall that allows side effects via GET requests (a there's a good reason to never allow side effects via GET requests and not allow unauthenticated services even behind your firewall, on the assumption that somewhere, sometime, you will slip in this area and allow a user-supplied URL to get retrieved from an internal IP due to the multitude of formats that can include URLs)

And yes, if there's a risk of strpos having a buffer overflow, you are now SOL if you haven't validated your input in a way that prevents it, and while that's an unlikely case, it is an important illustration of the overall point:

All third-party data is unsafe until proven safe in the context of the code it will be passed to.

As a wider point, you should consider not only your own immediate usage, but whether or not a given piece of data may ever be passed on to a third party API etc., as whether or not you consider their own security lapses to be their problem, it can also harm you.

As a corollary, you should assume any data coming coming from a trusted partner is as unsafe as data passed to you from a known hacker.

It's with data as with unprotected sex: when you take data from someone, you're exchanging data not just with them, but with everyone with access to their systems and anyone they exchange data with.

Don't assume they're being safe - it takes just a single slip-up in their data handling before what you might think are "safe" data fields provided by your partner are actually unvalidated content provided by a malicious user. You may think you know the source of the data when taking a feed from a trusted partner, but you don't - not really.

To the extent that you should not just treat individual fields as supplied by potential malicious users. You should treat their entire supplied data feed as supplied by a potentially malicious user. As for why, consider the equivalent of SQL injection applied to whatever format your partner is passing you. Or they may have been hacked.

The TL;DR boils down to pretty much the comment you replied to. Anything longer, including the above needs to come with a big, huge caveat: It's NOT complete.

You can write books about the ways data-validation can go wrong and things to look for, and what I've written above just scrapes the surface in a few very unsatisfactory ways (except, hopefully, by terrifying you). You need to always approach it assuming the worst.

    It's with data as with unprotected sex: when you take data from someone, you're exchanging data not just with them, but with everyone with access to their systems and anyone they exchange data with.
I'll start calling airgapped systems abstinence-only networking.
As we all know abstinence-only doesn't work, so maybe there are stronger parallels here than at first glance. ;)
Well, it works if you actually practice it...
In both cases, it's much easier said than done.