Hacker News new | ask | show | jobs
by jwecker 3574 days ago
I always assumed it was more a sanitization issue for security's sake. By allowing only a simple subset ("common") email address type, you can be ambivalent about what email server is running and how it reacts to the wide variety of specially crafted email addresses.

With no validation other than sending the email, you have to know, for example, what the server would do with an email address that claims to be @localhost. Now it becomes a problem- or at least a question and concern- for the backend system. Whether the backend interprets root@localhost as valid and does exactly what it's told or rejects it due to some configuration- it has become a backend complication and a DOS attack vector.

A simple policy of only handling a subset- the common class of email addresses- is one of the things that allows us to have a simple mental model of what the MTA is supposed to do. The fact that it sometimes caught a type-o, or not, is incidental. "Invalid email" wasn't meant to imply the email address doesn't fit the spec- it was meant to imply that a particular site or service has chosen not to accept email addresses like that.

Or at least that's what I assumed :-)

3 comments

> I always assumed it was more a sanitization issue for security's sake.

Sanitization is at best idiotic, at worst creates security problems. There is no such thing as "bad characters", there only is broken code that incorrectly encodes stuff. If you ever find yourself modifying user input "for security reasons" (or really, for any reason at all), you are doing it wrong. The only sane thing to do is to make sure that the semantics of every single character of your user's input is preserved in whatever data format you need to represent it in.

An email address isn't a document though, it's a routing command. I don't mean sanitization in the sense of inserting backslashes. I mean sanitization in the sense of "we don't allow people to set their email address to a mailbox on localhost at our mail server."
1. Sanitization generally means changing information. As in, "removing bad characters", that kind of stuff. That's different from validation, which should result in rejection of bad input, and which can be perfectly fine. However, more often than not, validation is implemented badly and rejects perfectly fine input, which is why validation shouldn't be employed more than necessary either.

2. Rejecting @localhost addresses doesn't really make a whole lot of sense. People could just enter the public IP address or hostname of the server, or add a DNS A record under their own domain that points to 127.0.0.1, or an MX record that points to localhost, or any number of other weird stuff that you could not possibly validate anyway (if only because it could be changed at any point lateron). Just configure your mail server properly and then send the damn email, and if it does get sent to root@localhost, and possibly forwarded to the admin--so what? People obviously could just sign up using your admin's email address anyway, and that not only at your site, but at millions of sites out there, you won't be able to stop them. There is nothing particularly dangerous about receiving unsolicited signup emails or about sending emails to yourself.

> There is nothing particularly dangerous about receiving unsolicited signup emails or about sending emails to yourself.

Depends on what you do with them, in the latter case. There could be an amplification attack there.

Validating domain parts to a certain extent isn't a bad idea, at least as far as non-routable domain names and RFC1918 ranges go. I've seen this done (actually implemented some of it, in fact) at a past employer, who were basically looking to cover the 90% case in terms of not getting hosed by a trivial attack. It doesn't take much effort and it makes 4chan's life harder. What's not to like?

> Depends on what you do with them, in the latter case. There could be an amplification attack there.

Hu? How would that work?

> Validating domain parts to a certain extent isn't a bad idea, at least as far as non-routable domain names and RFC1918 ranges go.

What do you mean by "non-routable domain names" and what do you gain by checking for RFC1918 ranges?

> I've seen this done (actually implemented some of it, in fact) at a past employer, who were basically looking to cover the 90% case in terms of not getting hosed by a trivial attack.

Why did you prefer that approach over a robust solution?

My idea of a robust solution: Have one central outbound relay that's firewalled off from connecting to anywhere but the outside world, make all servers that need to send email use that relay as a smarthost (so they never connect to anything but that relay, regardless what the destination address is), use TLS and SMTP AUTH with credentials per client server to prevent abuse of the relay by third parties.

> What's not to like?

(a) that it's a lot easier to build a solution that's more robust, (b) it's extremely likely that your implementation is buggy, thus rejecting valid addresses, and (c) it's causing a maintenance burden (what happens when the first people drop IPv4 for their MXes? I'd pretty much bet that you don't check for AAAA records, so you'd probably suddenly start rejecting perfectly fine email addresses, thus making the transition to IPv6 unnecessarily harder, am I right?).

'Non-routable' as in a single label, or as in not resolvable. I don't think it is unreasonable to consider an address invalid when its domain part cannot be resolved. Checking for RFC1918 ranges means you don't try to send to another class of addresses that's never going to be received.

You would lose the bet. The product supported IPv6 from day one.

That is a robust, if somewhat complex, solution for a relatively small volume of mail. When you're sending ten million messages a day by the end of the first month, pushing everything into a single relay of any kind is asking for a lot of trouble.

Then reject email addressed to localhost. It shouldn't matter how the email got there. I'd suggest that especially given DNS trickery involving setting up a low TTL then redirecting to 127.0.0.1, you're probably not preventing this from happening or you'd have to invalidate any unrecognised domain. Better to solve that problem at a different layer -- validate the email by sending a validation link if you must...
True, but that's my point- it's a backend issue, not a front-end "help the user" issue.
And their point is, it's a backend issue, the backend being the mail client/server that already completely handling the sending/receiving of emails. Either the activation link gets clicked or it doesn't. The click is the only correct validation, and yes, the whole process happens on the "backend".
I mostly agree but there are cases where some definition of sanitization is the only appropriate thing. For example, if you allow users to create content with a lightweight subset of HTML for the sake of formatting control and want to render that html in your page. And in such cases, the correct way to sanitize it is not via regexps but via a DOM parser that takes user input and builds a DOM and then emits rendered html according to a whitelist of available tags/attributes. So you might argue DOM parsing isn't sanitization and so still matches your assertion, however, in general it's common and not really inaccurate to call this sanitization.
Well, it depends ... ;-)

The important thing is to not change information. "Sanitization" as it is commonly used means doing something that (potentially) changes information. Which is in contrast to decoding/encoding/parsing/unparsing/translation/..., which, if done correctly, change representation, but not information.

So, to make it a useful distinction, I would call anything that potentially changes the semantics of the processed data "sanitization", and avoid using the term for anything else.

So, simply parsing a string with an HTML parser, possibly checking for acceptable elements, and then serializing back into some sort of canonical form that is semantically equivalent to the input, that's perfectly fine, and I wouldn't call that sanitization, but rather validation and canonicalization.

If you simply start dropping elements, though, that's probably a bad idea, just as simply dropping "<" characters is a bad idea, because those elements presumably bear some semantic meaning, just as a "<" in a message presumably bears some semantic meaning.

Now, it is not always obvious which level of abstraction to evaluate the semantics (and thus the preservation of semantics) at. So, it might be prefectly fine, for example, to remove or replace some elements where the semantics are known and you can show that, say, removing emphasis still generally preserves the meaning of a text.

But a whitelist approach where you simply remove everything that isn't on the whitelist usually is a bad idea. If you want to have a whitelist, use it for validation, and reject anything that's not acceptable, so the user can transform their input in such a way as to avoid any constructs you don't want, while still retaining the meaning of what they are trying to say.

I hear what you're saying and it represents an ideal. But there are circumstances where information really has to be removed. Perhaps because the user is no longer present and it was collected under circumstances that had more liberal validation. Or because you're handing information across a boundary of implementation ownership and can't trust the receiver to handle potentially dangerous information correctly. I agree that sanitization (in the sense of stripping bits out of a user data payload according to some security rules) shouldn't be the first tool in the toolbox, but I would really hesitate to say it's always the wrong thing to do.

Edit: here's a good example. I don't know if they still do this, but when I worked at Yahoo!, they used a modified version of PHP that applied a comprehensive sanitization process to all user inputs. As a frontend coder at Y!, all the information you pulled from request parameters, headers, etc, ran through this validation at the PHP level before your app code got to it. You can then literally splat this information into an html page raw, without any further treatment, and not expose the Y! property you work for to an XSS or other injection vectors. There were ways to obtain the raw input using explicit accessors when needed, and these workarounds were detectable by code monitoring tools and had to be reviewed and approved by security team(s). Overall this worked really quite well, in my opinion. Y! could hire junior frontend devs without deep knowledge of data encoding, security issues, etc etc and rest easy. I think the principle of safe-by-default, even if it means destruction of user input in some cases via aggressive sanitization, is a good principle to apply to a frontend framework.

re edit:

No, that's just a terrible idea. It might work quite well in the sense that it prevents server security holes. But it makes for terrible usability, and potentially even security problems for the user. The user expects that their input is reproduced correctly, and if it isn't, that can potentially have catastrophic consequences because it might result in silent corruption.

See also: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...

If you think that it should be possible to have developers who don't understand this problem and its solution, you might as well argue that it should be possible to have developers who don't know any arithmetic or who are illiterate.

If you want to have some help from the computer in avoiding injection attacks, the solution is a type system that ensures that you cannot accidentally use user input as HTML or SQL, for example, or possibly automates coercion when you need to insert pieces of one language into another.

And that's a great philosophy until your email gets rejected by some service that picked a different "common class of email addresses" than you did. This is precisely why we have written standards.
Back when SMTP servers still had remnants of UUCP etc.- where the address could actually contain characters that specify intermediate servers to route to, I would have argued that front-end sanitization was important as, for example, html sanitization from end-users from a security perspective.

However, IETF made lots of progress simplifying things- to the point where, at the very least, the standard tells us specifically that we should leave it up to the destination host to interpret the local part of the email address-- that is, the thing to the right of the @ should be given the thing on the left unmolested ideally- even being ignored by intermediate relay servers. Since that's what most people complain about, any validation to the left of the @ should become extinct.

But off the top of my head that still leaves the thing on the right of the @ (such as localhost), buffer overflows by allowing longer strings than the standard allows (those limits do exist), and the problem with multiple @ which the MTA may or may not handle well... Since I'm not a security expert I'm going to go out on a limb and assume that I'm missing a bunch of other things.

My point is not that the article is wrong, though- my point is that if he wanted to convince me to only validate by sending the email on any string, he should convince me that those security concerns are not an issue- not that it's not good at catching type-os.

I think there is a reasonable middle-ground for validating the domain side of an email address. There are RFCs on all this stuff; it's not just a total free-for-all. The RFCs just aren't nearly as strict as a lot of badly-designed validation regexps are, presumably because most people are unaware of the diversity of acceptable email addresses.

The currently operative RFC is 5322, specifically section 3.4.1: https://tools.ietf.org/html/rfc5322#section-3.4.1

There are some basic rules that you could safely apply to an address, which would prevent some attacks (buffer overflows, etc.) while also not blocking any legitimate addresses. E.g. limiting the overall length to 255 characters, for instance, could be defensible practice. There are also well-defined rules for validating the domain portion, since it has to be a routable address by definition.

What nobody ought to be doing is looking too hard at the string to the left of the @ symbol, because it's designed purely as instructions to the recipient server. Nobody else needs to care about it; only the receiving mailserver needs to actually parse that part of the address, in order to put the message into the right mailbox. From what I've seen, the vast majority of false-positive validation failures occur because people are looking at the mailbox portion of an email address when they have no business doing so.

Your security model is garbage if you depend on controlling all apps that might send email to your mail server.