Hacker News new | ask | show | jobs
by segmondy 3619 days ago
Actually, the takeaway is not that "you should never use user input on unserialize." It is, that you should NEVER TRUST USER INPUT. This rule is as old as computing itself and trust of user input has always been the beginning of a security vulnerability. You need user input, you will use user input, but you must understand how it's used and filter, strip everything that is not needed away.
4 comments

> you should NEVER TRUST USER INPUT

I think a lot of people are reading this and think that this advice is too wide, and include so much in so few words.

It's the same caliber as "don't trust strangers" and "be responsible of your actions". There is undeniable truth in it, you cannot go wrong following it and it's difficult to argument against. But I think that's what makes it counter productive and basically devoid of meaning.

Effectively if the rule is as old as computing, but it still has to be voiced, it means it's not a simple rule to begin with.

As you put it "you must understand how it's used and filter, strip everything that is not needed away". This basically means every time you have user input, ideally you'd have to audit all the libraries and frameworks accessing the data to check how they use it, and filter accordingly. Pass a construct to a json library ? first check what the json library does with it, sanitize your input for everything that could be harmful to the library. This solution is voiced in one sentence, but would mean hours/days of library auditing in a real world scenario.

TL;DR: you can't cover for everything, you have to choose your battles. Knowing which libraries have known vulnerabilities is valuable info.

That's not actionable. You have to deal with user input at some point.

What if my language can't deal with strings properly? Maybe strpos has a buffer overflow on a carefully crafted input. Would I be wrong for using it?

Never trust user input doesn't mean never use user input; you need to use it carefully -- often that means restricting to acceptable values and lengths, (appropriate!) escaping, and passing through that it's user input to other functions (ex: sql placeholders). When functions do arcane things, and don't let you pass through that it's user provided, that's a red flag.

Note that a cookie that you set, and are now getting back IS user-input, unless you do something to validate that it's actually the value you set. (HMAC is a good start)

If your language can't deal with strings properly, I strongly suggest you not expose it to strings provided by users. If you do expose it to strings from users, at least you should sandbox your application as much as possible.

I advised someone doing their masters in information security as a mentor. My student did their dissertation on input scrubbing. We did quite extensive research on the subject, and we found out that a simple AWK program doing regular expression matching on the input, before passing it on to conventional scrubbers inside of languages like PHP virtually eliminated attack vectors. For three months we tried our very best to craft some SQL code to get by the AWK regex and we couldn't. Lesson learned.
> Lesson learned.

Without meaning to be sarcastic (particularly because I found your post interesting), what lesson learned? A casual perusal of your post suggests the lesson "one can't craft SQL code to get by an AWK regex", but, of course, "what I can't do no-one can" is a bad lesson to learn in security.

The lesson we learned is that sometimes getting back to the roots (AWK) and using simple methods (regex) can be extremely effective. You are of course right that "what I can't do no-one can" is a bad thing.
IMO your lesson is to get to define a problem simply enough that you can apply a simple solution. This is not a given and usually needs serious design and project management skills.

Otherwise even your simple solution would be drawn in "can you support multi-byte characters ? Do you handle non unicode stuff ? What if it leaks in your layers if code before reaching your awk library ?" and other problems that abound in most mildy complex projects.

It is actionable, sort of, but it takes a lot of careful thinking about what you can safely do with data you do not trust.

E.g. you need to assume that every function you pass that data to will be subjected to malicious or accidentally broken input. I originally wrote "unless/until you have sanitised the content", but really, when building applications taking user data, just assume that you're dealing with malicious or accidentally broken input everywhere unless you have proven otherwise.

This goes from the trivial: Range check numbers; check length of strings, and verify any other constraints (encoding, character set limitations) that you may later depend on.

To the very complex: Going to re-use end-user HTML? May sound simple, but you basically need a HTML parser with explicit white-listing of tags and attributes, and if you allow CSS you need to parse and white-list CSS attributes too (biggest risks: chance of executing malicious JS in context of another logged in user, including in your admin interface; chance of causing unintended side-effects if you allow triggering HTTP requests - as a minimum, even assuming nobody any longer are stupid enough to trigger side-effects on GET requests and assuming that's all they are able to trigger, it has privacy impacts including the chance of leaking details about your admin systems or any third party systems you pass the HTML on to).

In general it means you have to understand all the ways the type of data you allow can go from being innocuous inert sequences of bytes to triggering effects that may be under the control of a potentially malicious user, and you have to assume that if you don't know, then format needs to be assumed to not be inert when passed to any given piece of code.

E.g. to take a much simpler example than HTML. Consider passing arbitrary XML to an XML parser in order to validate it against a schema to sanitise. Could be a smart thing to do. Except, even assuming the schema is strict enough, consider that a malicious XML document passed to a parser that's not explicitly configured not to, may be able to make HTTP requests with a source IP on your internal network (by specifying a suitable URL for the doctype).

Doesn't need to be malicious either - I've seen plenty of systems have throughput fall through the floor because someone didn't handle this case and suddenly got a bunch of XML documents with a doctype URL that took ages waiting for requests to a downed nameserver for some third-party domain to time out.

In this case you also better be sure you don't have any services that are "protected" only by being behind a firewall that allows side effects via GET requests (a there's a good reason to never allow side effects via GET requests and not allow unauthenticated services even behind your firewall, on the assumption that somewhere, sometime, you will slip in this area and allow a user-supplied URL to get retrieved from an internal IP due to the multitude of formats that can include URLs)

And yes, if there's a risk of strpos having a buffer overflow, you are now SOL if you haven't validated your input in a way that prevents it, and while that's an unlikely case, it is an important illustration of the overall point:

All third-party data is unsafe until proven safe in the context of the code it will be passed to.

As a wider point, you should consider not only your own immediate usage, but whether or not a given piece of data may ever be passed on to a third party API etc., as whether or not you consider their own security lapses to be their problem, it can also harm you.

As a corollary, you should assume any data coming coming from a trusted partner is as unsafe as data passed to you from a known hacker.

It's with data as with unprotected sex: when you take data from someone, you're exchanging data not just with them, but with everyone with access to their systems and anyone they exchange data with.

Don't assume they're being safe - it takes just a single slip-up in their data handling before what you might think are "safe" data fields provided by your partner are actually unvalidated content provided by a malicious user. You may think you know the source of the data when taking a feed from a trusted partner, but you don't - not really.

To the extent that you should not just treat individual fields as supplied by potential malicious users. You should treat their entire supplied data feed as supplied by a potentially malicious user. As for why, consider the equivalent of SQL injection applied to whatever format your partner is passing you. Or they may have been hacked.

The TL;DR boils down to pretty much the comment you replied to. Anything longer, including the above needs to come with a big, huge caveat: It's NOT complete.

You can write books about the ways data-validation can go wrong and things to look for, and what I've written above just scrapes the surface in a few very unsatisfactory ways (except, hopefully, by terrifying you). You need to always approach it assuming the worst.

    It's with data as with unprotected sex: when you take data from someone, you're exchanging data not just with them, but with everyone with access to their systems and anyone they exchange data with.
I'll start calling airgapped systems abstinence-only networking.
As we all know abstinence-only doesn't work, so maybe there are stronger parallels here than at first glance. ;)
Well, it works if you actually practice it...
In both cases, it's much easier said than done.
Even better would be to not trust what one's application is returning back, and scrub the output in addition to scrubbing the input.
> you should NEVER TRUST USER INPUT

That is not clear at all and pretty useless. What does it mean? I should not accept any user input at all?

> strip everything that is not needed

That does not always work. What if I have a comment form that should accept any characters?

>What does it mean? I should not accept any user input at all?

No, it means you should never assume that user data is safe, or even sane. Assume, rather, that everything every user is sending you is malicious, all the time, and write your code accordingly.

>. What if I have a comment form that should accept any characters?

First, you probably shouldn't, because your database and HTML should be using explicit character encodings, so a comment form that accepts anything doesn't make a lot of sense. How are you expecting to deal with "any characters"? What happens when they paste in a binary blob, or javascript code?

Secondly, assuming you want to do that, you still shouldn't trust the data. Add it to the database using parameterized queries, escape it when rendering, never mix it in to javascript variables and never serialize it into a format designed to unserialize executable objects.

It's not an unreasonable burden to expect web developers to at least be aware and code defensively. Especially with PHP.

> it means you should never assume that user data is safe, or even sane

I'm curious if Haskell's purity helps developers focus on this issue and therefore makes it easier to mitigate. Given that all user input/state already has to be handled carefully (for ex: with monads). It will be obvious in the codebase which parts need to be zero'd in on for possible attack vectors.

Haskell's web frameworks help, but it's nothing to do with purity. In fact any web framework can do this, you segregate user-supplied data and ensure it can never be supplied to an untrusted function without explicit cleaning.

Perl and Ruby have included this as a 'tainted' flag, many functions cannot be called with a tainted string.

>I'm curious if Haskell's purity helps developers focus on this issue and therefore makes it easier to mitigate

No, haskell's type system does, not its purity.

>Given that all user input/state already has to be handled carefully (for ex: with monads)

What? Monads are not some mythical beast, there is nothing "handled carefully" about it. A monad is just a general interface.

I'm talking about clear separation of state via Monads making it easier to focus on the riskier input. For example when you are doing code reviews. That is about the developer nothing to do with some inherent functionality of Monads. Not sure what it being a "general interface" has to do with that. Maybe I wasn't clear.
>Not sure what it being a "general interface" has to do with that

I was explaining what monads are. You have read some of the weird misconceptions about haskell and monads and are now repeating them.

I'm not being a smartaleck, but "you shouldn't be writing code" with your attitude/approach. "NEVER trust user input" is an important security mantra to learn all on its own, like "wipe your butt/wash your hands" is in another context.

The guy is writing a valid point on Hacker(!) News. People writing comments on HN (especially to summarize a takeaway from a longer form article) are not required to accurately recapitulate entire dossiers of how to process input. It is completely valid to say "you should never trust user input". Somebody who is looking to make that "actionable" or "clear and pretty useful" can very very easily google the phrase and will turn up a lot of useful answers and information.

This is what is meant by the idea that the simplicity of the iPhone UI and/or automated IDEs has created a generation of helplessness and entitlement.

The good advice remains good advice: you should never trust user input. If you can't turn that into sound advice from Hacker News, your options become limited to, nobody should trust the code you write, you shouldn't write code, or you shouldn't read hacker news for advice.

But the idea that people need to write what you personally need to hear or they shouldn't write comments? that's nuts. Could I have written a more useful comment to you and to the community? I'll tell you this, I did think about it, and this my best shot at what I thought you and the community could benefit from!

There used to be a guy on usenet news who posted all sorts of stuff, and had the name of his company in his .sig line, and he included the phrase "these ARE the opinions of my company" instead of that boring old boilerplate "none of the opinions I express are..."

   NEVER trust user input" is an important security
   mantra to learn all on its own, like "wipe your 
   butt/wash your hands" is in another context.
Even the contexts are not so different. DNA is an information carrier, life is an information system, hygine and the immune system are information security mechanisms.

Though I am not sure who the user is in this analogy.

You're confusing "trust" with "use," which appears to be the cause of your apparent bewilderment. I could be wrong, however.

Suppose you were handling snakes. Some snakes are not poisonous, and don't need to handled with the care you'd handle, say, a black mamba with. However, you are being advised to treat every snake you encounter as though it was the most poisonous snake known, and apply every care that you normally apply to snakes that you know are poisonous.

Will you handle the snakes? Yes, you're a snake handler, remember? But you handle all of them like they are deadly, even the ones you "know" to not be deadly.

On a side note; snakes are venomous, not poisonous. Your point is well taken though - and I've come closer than I'd like to a few tiger snakes in my area.

User input ought to be treated with the same kind of respect, although in terms of user input the option of giving them a wide birth isn't always as practical.