Hacker News new | ask | show | jobs
by glimcat 4704 days ago
As long as we're playing the "Falsehoods Programmers Believe About Names" game again, here's the relevant patio11 article:

http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-b...

If you try to validate names, or if you don't safely escape names along with your other user-input strings, you're gonna have a bad time.

4 comments

One thing that's annoying to me is that governments and employers increasingly believe many of these things, partly because they want to cross-reference names and match canonical forms.

My given names in English are Mark Jason, and that's on my birth certificate. In Greek, they're Μάρκος Ιάσονας, which are the equivalents, and that's on my municipal birth records there (registered as a foreign birth at the time of baptism). There seems to be a move towards wanting to use "accurate" transliterations, though, rather than the more traditional method of translating names to equivalents (Mark<->Markos, George<->Georgios, Paul<->Pavlos, etc.). Sometimes people desire that: maybe someone named Михаил in Russian really doesn't want to be turned into Michael, but wants to go by Mikhail. That's fine, if they prefer. But in my case, I consider each of these translated forms to be my name in the respective languages, and do not consider the transliterated forms to be my name.

But in trying to sort out some paperwork, it appears that what I am supposed to do is one of these two things: 1) change my name in English from Mark Jason to Markos Iasonas, the transliteration of my Greek name; or 2) change my name in Greek from Μάρκος Ιάσονας to Μαρκ Τζέισον, the transliteration of my English name. But I don't want to do either of those things. #2 in particular is ridiculous, because it doesn't decline properly, and is trying to approximate a 'j' sound with 'tz'.

Growing up, my parents called me by my middle name, as I share a first name with my dad. (I'd rather be an Edward than a Ralph anyway.) When giving my name to someone, I tell them I'm Edward <Lastname>, as telling them I'm R. Edward <Lastname> just sounds pretentious. But if I'm beginning a relationship with a doctor's office or lawyer, or filling in a tax form, it's Ralph E. Lastname, because that's what's on my birth certificate and SSA record. It is quite annoying when the phone rings, and I don't recognize the calling number, so I answer with a guarded, "This is Ed..." and hear the caller ask, "May I speak to Ralph?" and have to explain to them that I really am Ralph, even though I said I was Ed. But I have to say, my problems are nothing compared to yours!

CSB: My mom signed me up for a book club when I was 6 or 7. For the Firstname field, she wrote, "R Edward" for reasons known only to her. For the next three years, every couple months, I'd get a package addressed to Redward <Lastname>. I could just imagine the shipping clerk in that company reading my shipping label and saying to himself, "Redward... what a goofy name."

My name is Kim <Lastname> and I'm a male. Try convincing Americans (and other English speaking countries) about that...

One example: Many years ago I subscribed to TIME and filled out a form where I checked "Mr." Apparently the person who typed in my name decided to "correct" this error and I became a "Mrs."... and I wasn't even married :-)

The company I work at has offices in different cities, so most of the communication are done by email and instant messaging. I see a clear difference between the messages from people who know my gender, and those who probably think I'm female. Even attempts at flirting...

I went to school with a male Kim in Aus. Never even realised it could be a girls name until the 80s when there were several female singers called Kim. Lots of male names seem to become girls names. Ashley is another that seems to have been lost in living memory. Apparently Shirley was once a male name and I am not joking. Between that and boys once wearing dresses until breaching along with pink clothes and long hair and time travel must be really confusing.
All the Ashley's I went to school with in the 70s were boys. I often wonder if it's an issue for them these days...
Males named Kim go relatively unremarked in Australia at least, perhaps because there have been three well-known male national political figures named Kim in the last 40 years.

http://en.wikipedia.org/wiki/Kim_Edward_Beazley

http://en.wikipedia.org/wiki/Kim_Beazley

http://en.wikipedia.org/wiki/Kim_Carr

Kim, like Hilary and Evelyn is a traditional boys name that has somehow become exclusively female over that last half-century or so.
In Scotland there's a fair amount of male Kellys and Lesleys which also appear to have slowly evolved into female names in the rest of the English speaking world (as far as I can see)
as telling them I'm R. Edward <Lastname> just sounds pretentious

Not to mention it casts serious doubt as to your organic nature.

Many years ago Safeway, a commonly seen grocery store in the US, started using a Safeway Card to gather data on its customers. Shoppers get lower prices if they use it, so I filled out my paper form. Some non-English speaking data entry clerk in Mexico somehow mistook my middle initial for an "Os" and prepended it to my last name, creating quite a tongue twister.

Safeway checkout clerks are apparently required to thank me by name, using the name that pops up on their screens when I swipe my card. For nearly twenty years now, all over the country, every harried Safeway checker has sent me on my way with, "Thank you Mr., uh, Asperger", or "Thanks, um, Mr. Ostrich", or whatever that bizarre cluster of letters randomly turned into on the way out of their mouths.

At first, I thought I should fix it, but I quickly grew to enjoy the show. I also enjoy the thought of them trying to cross-match Mr. Asperger with other consumer databases.

Some of us get to enjoy name-mangling theatre on all social occasions.
Try just answering the phone with "hello?" for unknown numbers.
At home, I only pick up for known numbers, and let the answering machine screen the rest. At work, they kind of expect you to identify yourself when you pick up the phone. But yes, what you suggest is a workable alternative, where allowed.
This is my situation as well. I can't count the number of times I've gotten the "Redward" equivalent.

To compound the problems caused by this, I switched to using my first name as my primary name around the same time as I switched coast. People on the east coast know me as my middle name, people from the west coast know me as my first name. This can come in handy sometimes as it gives me a very quick indication of where I know somebody from, but gave me a good deal of trouble recently at a west-coast wedding with lots of east-coast people attending... The fact that some of my friends were introducing me to other people with just my last name made it quite... interesting.

It's an annoying thing to have to deal with on a regular basis. I'm in the same boat, in that my father and I both have the same first and last names, so I've always gone by my middle name.

Except now when I go to networking events or interviews, I tell people my middle name L* and then they point that out that my name tag or application says my first name is M* and I have to go through the whole song and dance of explaining the situation.

I'm frustrated enough to be looking into getting it legally changed. You might want to consider that as well if only to not have to deal with those phone calls any more.

Same here. I get around this by never mentioning my first name unless I'm in a situation where it's legally required, like at a doctor's office or the DMV. That groups down to a small number of cases:

1. If they're working for me, like at the doctor's office, I ask them to please call me by my middle name. They're generally respectful about it and are used to dealing with nicknames and other aliases anyway.

2. In the DMV and other situations, I just grit my teeth and answer by my first name. It's not worth the hassle of explaining and they don't care anyway.

3. If I'm being hired, I fill out my paperwork "officially" and give it to HR, with the explanation that I go by my middle name for all legal purposes.

4. Banks are kind of weird because they perform official government functions, but they're still ultimately working for me. I've only had one bank flat-out refuse to put my middle name on my debit card and checks, and I explained to the branch manager why I was walking out the door before we'd finished opening my account.

I've thought about it, but as soon as I do, I think about all the paperwork inevitably involved in making sure my medical history follows me, my pensions and other financial records get updated, and all that other nonsense, and, having nearly half a century of paperwork that ought to be updated, and being an essentially lazy old cuss, I decide that I can live with the annoyance.
Not sure why, but your story reminded me of a friend of mine who's always gone by the name, "Mick" (a common shortening for the name Michael in Australia.) The thing is, neither his first or middle names are Michael. It was just a nickname that stuck when he was a kid.

I kid you not, at his wedding when the celebrant said, "Do you Susan, take this man Brian..." his bride exclaimed, "Who's Brian?"

Yea, my daughter's name is a bit unusual, and so the transliteration produces a different name than english. Her name is Thamina. It's an Arabic origin name, so in Arabic it's : ثمينة. Not that she's Arabic at all, she's part Russian, and was born in Russia, so her name is Тхамина in Russian. They transliterate her name in a standard way to Tkhamina on her Russian passport, but on her USA passport it's Thamina.
I feel your pain, although we Romanians use the (standard?) latin alphabet. I had to spell 'Andrei Simionescu' over the phone so many times that I'm sure I could win a couple of spelling bees easily.

Speaking of which, why don't all companies just move to automated support systems already? These guys are doing it right http://www.zocdoc.com/

Human names are an excellent illustration of the reason that you shouldn't use 'real' data as a key in a database. If the key is arbitrary and meaningless then it doesn't need to be mutable.
Change it to Mapcock.
My name fail to register surprisingly often, even here in Brazil.

It is Hélder Maurício Gomes Ferreira Filho

Common reasons for failure is being too long and having non ASCII characters, but sometimes it fails for other reasons, for example do not allow me to register without a middle name ( I don't haven't one actually... ), me confused and not knowing how to register Filho ( it is not a family name, neither a surname or a last name, but it is still part of my name. It means Son, my father has the same name as me, without the Filho part), or breaking when it cross check with somewhere ( for several reasons I ended registering my name in several different ways, usually omitting Hélder, that I did not even knew was on my name until I got to school and got forced to use because of stupid rules that assume your first name is your typical name )

> It means Son, my father has the same name as me, without the Filho part

Interesting, that's similar to Junior in the US, but there it generally isn't part of the "official" name, only informal.

Jr. is absolutely part of my official name in the USA. It is the only distinction between my and my father's name. Many forms have a specific spot for suffix.
Are you implying that two people (you and your father) can't have identical names in the US?
No I'm stating that Jr. is an official part of Norman John Harman Jr., my name. It is on my birth certificate, filled out on my tax return, etc. In any case were I'm required to use my real name if I used Norman John Harman Sr. I would be committing fraud. Likewise fraud if I left off the Jr. in an attempt to confuse with or impersonate my father.
"Norman John Harman Sr."

By your reasoning, there is no such person. If there can be a "Norman John Harman Sr." then there can also be a "Norman John Harman Jr." who does not have it listed on official documents.

When your father dies, and you've named your son Norman John Harman as well, don't you become senior? It's can't be an immutable part of your name if your junior/senior status changes.
Of course not! That would break the Computer.
One memorable day at work a POS Kodak system decided that it wouldn't store a particular record. Nothing worked. This happened a few times until it became clear that the only things these cases had in common was that the people's first names started BRE. I can remember a rather sad looking IT manager nodding with agreement having tried everything when I suggested we just get them to change their names. Bug is still there. It's just a historic archive now, thank god. Worst software ever.
> "having non ASCII characters"

Accented vowels are ASCII characters but in the extended set which people sometimes don't take account.

Generally not. On the Internet, ASCII generally means ANSI_X3.4-1968, a 7-bit standard with 128 code points. (Run "man ascii" on a Unix system to see this.) There aren't any accented characters.

By contrast, there were national variants of ISO/IEC 646 (also a 7-bit character set, and essentially the internationalized version of ASCII) that included accented characters within those 128 code points. Generally these swapped out things like the at-sign (@) and the curly braces and vertical pipe character for accented vowels instead.

There were also lots of 8-bit character sets in ISO/IEC 8859 (e.g. Latin-1, or ISO/IEC 8859 part 1) that included accented characters within the "extended" set of code points 128-255.

There are a number of different "extended set" (IBM code pages and ISO/IEC 8859 parts for instance), and they're "extended" because they're not ASCII but supersets of it (as is UTF-8).

ASCII is the 7-bit encoding ANSI_X3.4-1968, composed of 95 printable and 33 control characters.

Or they account wrongly ;) (ie: from the wrong set)

I love how sometimes even on the same company, each place account ASCII differently.

I remember registering for a IM, and in one info screen my name was Maur&cio and in the site info screen Maur€cio and in the search screen was Maur£cio and so on...

ASCII is Obviously meant to use CP437 for character codes 128-254, duh...

Seriously though, most other code pages are pretty transparent to/from unicode... IBM PC-DOS extended ascii (classic ANSI-BBS) isn't so transparent.

Extended ASCII is not ASCII. There is ASCII, which has no accented characters, and there are other character encodings based on ASCII, which often do. Those other character encodings are not ASCII.
I found your comments on "Filho" interesting. In English, the equivalent is "Junior" with the father sometimes using "Senior." It is not part of your name per se, and thus would not be typed into the name field of a form. However, in places where naming sons after their fathers is common, there is often be an additional drop down box listing (Jr, Sr, I, II, III, and so on).
Oh yeah, these drop boxes piss me off, specially because here in Brazil there is BOTH Junior and Filho... And I am Filho, not Junior.

(also we have "Neto" that means Grandson, it is quite popular, I know a bunch of guys like that, I don't think dro down boxes in other countries will expect that)

Prefix and Suffix should be free form text!
thats why the x.400 and 500 standards had the concept of generational qualifiers.
It get's worse. A friend of mine has a hyphenated name, "Kerry-Jean". The hyphen alone often breaks things.
Hyphenated last names seem to break many customer service people - they'll do things like insist one is the "real" name, assume you are married and ask when, etc. I think it's that hyphen is such a simple thing, it screams "understand me!" instead of simply being entered verbatim.
Assuming hyphen = married = wife's father's name-husband's father's name is just so ignorant. I've known many people who have always have hyphenated names, a few I've gone to primary school with. I have a good friend with an always-hyphenated-last-name who gets asked personal questions by strangers and near strangers about her name. How about "It isn't any of your business, I can have as many names as I want?"
My best friend in elementary school had the misfortune of a hyphenated last name that was exactly sixteen characters long. The systems at the school evidently limited you to fifteen characters, because I saw her name printed in a lot of places without the final character.
In this case, "Kerry-Jean" is her first name. (Just to be clear.)
As a Portuguese speaker, i can say i would never write Hélder correctly if i simply hear it.
For a long time I did not even knew how to pronounce Hélder.

The thing is Dutch...

I am Hélder because of my father.

He is Hélder, because of the priest "Dom Hélder Câmara" (my Grandma was very Catholic)

Dom Hélder Câmara was named after the city of Den Helder in the Netherlands: http://en.wikipedia.org/wiki/Den_Helder

The result is kinda wonky (lots of people write it wrong, usually "Elder" that of course resulted into video game savy friends nicknaming me "Mr. Scrolls")

In our app we neither validate nor escape user strings for any free form text (eg. "names" and descriptions)[1]. We only validate the max length.

If text is truly free form then you don't need to validate or white list anything. Just make sure it's valid UTF-8 (or whatever encoding you're using) and escape it when you display it. That combined with using prepared statements with bind variables (aka named parameters) and you don't have any issues with user inputs.

One other benefit of this approach is that you end up with proper i18n support without doing anything special. From your apps perspective all text is the same. If user's want to use unicode characters or put html tags in their descriptions then let them. If you escape it then there's no XSS issue. Plus it's WYSIWYG[2] from a user's perspective.

Who am I to judge that a user putting "<script>alert('Haxors!');</script>" as the name of an object is a bad idea?

[1]: "Names" don't include usernames which generally should have a whitelisted character set (ex: ASCII [a-z][a-z0-9+]) or email addresses (use a a real validator ... not a regex!).

[2]: https://en.wikipedia.org/wiki/Wysiwyg

"Just make sure it's valid UTF-8 (or whatever encoding you're using) and escape it when you display it."

I've lately been coming around to the belief that anyone who uses the term "sanitize" in this domain, as in, "sanitize user input" really doesn't know what they are talking about (at least on average). The approach you describe is the generally correct approach; you need to ensure that the proper levels of escaping are being applied. Unfortunately this is nontrivial in practice, but it's still the correct solution.

The "sanitization" meme has resulted in me smacking down at least 3 commits from developers in my organization trying to "solve" XSS by scrubbing out all less than characters across all input from the user, or eliminating all quotes, apostrophes, less than, greater than, backticks (for shell interpolation problems), etc etc. Unfortunately, the problem is, these are in general all perfectly valid input values, and some of them really smack you in the face immediately. (For instance, names may contain apostrophes. You can't "sanitize" them away; you need to write your SQL layer to handle that correctly, such as with binding.) You handle them by managing your encoding layers correctly, not by "sanitizing" them.

(There's still some sanitization components in the resulting solution, I just don't think they are the way you should think about it. For instance, there are some characters that are flat-out forbidden in, say, an HTML attribute, and the right thing to do is just strip them out of any incoming string. But that should be thought of as a "sanitization" step being a importent element of proper encoding, but not the actual "answer".)

It's a shame there's such a proximity in terminology between 'sanitize' and 'sanity check'. I wonder if that's where this whole confusion began in the first place. Yes, it is extremely unlikely that a user's given name contains a <script> tag, but there are few reasons why your sofware should really care about it on a technical level - least of all if the way you choose to care about it leads to it also complaining when someone claims their name is O'Reilly. The correct response to someone claiming their name is "'; DROP TABLE Users --" should, ideally, be to say "Are you really sure about that?" but defer to the human decision on whether it's really the right thing to do.
Relevant XKCD - http://xkcd.com/327/
> I've lately been coming around to the belief that anyone who uses the term "sanitize" in this domain, as in, "sanitize user input" really doesn't know what they are talking about (at least on average).

I've had this view for a long while. I think there's a common sense to it that either clicks or it doesn't. Plus people hear/read "escape your inputs!" so often it becomes a cargo cult.

> You can't "sanitize" them away; you need to write your SQL layer to handle that correctly, such as with binding.) You handle them by managing your encoding layers correctly, not by "sanitizing" them.

Exactly. Whitelisting the values that can be stored in field should be done to maintain the data integrity of the field. It's not an approach to solve security problems or prevent SQL injection.

> For instance, there are some characters that are flat-out forbidden in, say, an HTML attribute, and the right thing to do is just strip them out of any incoming string. But that should be thought of as a "sanitization" step being a importent element of proper encoding, but not the actual "answer".)

We ran into something like this in our app as well. When displaying meta data for an object we create related objects in the dom and reference them by id. Originally the ids were generated by simply escaping the name of the raw object but that doesn't work because as you mention there are additional restrictions on what can be used in an "id" field. The solution? Hash it! Obviously that's a very specific solution as we only cared about it being unique and tied to the other object on the same page but it worked.

If you're going to accept all characters by default, be prepared to sanitize the outputs for every use, not just your website.

Maybe you will output a data dump for someone else to print mailouts. Or you'll share the user database with a vendor's web forum. Or payment processing. Or any SaaS.

No. That completely doesn't work. This is really important: You CAN'T "sanitize" for every possible use. You can not correctly figure out in advance how to represent an input, because the different possibilities are numerous and actively self-contradictory.

To "sanitize" for "every possible use" is pretty much to remove everything that isn't an ASCII letter. Even unexpected spaces can cause crazy behavior. Commas can cause CSV-injections. And you might still have length problems even so. Oh, and you still can't guarantee something won't screw up even so! https://news.ycombinator.com/item?id=6140631

You can not, at the time input comes in to a system, even pretend to know where all the data might end up, someday, given the whims of who knows whom, and who knows when. The only thing that works is for each system to correctly encode its output as needed, and if you output the correct thing and a subsequent system blows it up, it's the subsequent system's fault. You can't prevent it. You only think you can, but you're wrong.

To be clear, if you could defend against those systems messing up, I'd be willing to consider it. But you can't. It's impossible, both in theory and in practice.

There's no easy answer to writing secure code. (Though it would help a lot of people used type systems to better effect in this problem.) Filtering out certain "dirty" characters isn't an easy answer either, on the grounds that it isn't even an answer. (It turns out to often become not easy, too, because as you gradually and inevitably learn exactly how it isn't working for you, the subsequent frantically flailing addition of heuristics becomes very not easy itself. It is easier in the long run to do it correctly.)

Perhaps I was unclear, but I did not claim that there could be one single sanitized version of the data, safe for all use cases. I was saying that you have to do different sanitization for every output.
That's not called 'sanitizing', it's called 'escaping' and 'encoding'.

The byte sequence I need to store to communicate the name "Kei$ha O'Shaughnessey, Jr." in a UTF-8 JSON string literal, a UTF-8 HTML attribute, a UTF-16 bigendian CSV file, or an ISO-8859 SQL parameter, are going to be different - but so long as all the characters I need to pass are representable in all of those domains all I have to do is perform the correct escaping and encoding. At no point do I need to 'sanitize' the name. It's a name, it's not dirty.

If there are characters there that I can't represent in the target domain, then I need to handle the loss of information.

Somewhat related example anecdote: For several years, Vimeo was sending me newsletter emails addressed to "Dear Jarek_Piórkowski" (previously "Hi Jarek Pi??rkowski"). The ó that should be there shows up fine on the Vimeo website and I even cleared and re-input the name into my profile to give them a chance to re-encode it. Still continued.

I unsubscribed from the newsletter eventually.

And ó isn't even a difficult character, it's in ISO 8859-1 for crying out loud.

Perfect example. That indicates that at some point, your data passed through a system using Windows-1252 encoding.

http://www.i18nqa.com/debug/utf8-debug.html

I expect Vimeo used a Linux system to collect your data, and I bet the thing that blasts emails out is ultimately Linux as well. So the Windows-1252 bungle probably happened in a third system in between, maybe a Windows system chosen for its ease of administration by the community managers.

Not that this is relevant to data sanitization (they're just being fuckups here) but it shows how complex this can get.

Just to be a bit pedantic, unfortunately you don't get "proper i18n support" just by putting everything in UTF-8.

Unicode lets you represent lots of abstract characters, from different languages and societies, in one character set. That doesn't quite tell you how to render the characters. For that, you need to know what language the text is in. Unicode wants you to provide that information out-of-band, e.g. in an HTML "lang" attribute, which the renderer can use to paint the proper glyphs.

For example, the Arabic digits 4 through 7 (۴ U+06F4 .. ۷ U+06F7) have different glyphs in Persian, Sindhi, and Urdu. And a character like 直 (U+76F4) has Chinese and Japanese glyphs that may not be mutually recognizable.

Bottom line: if you want an internationalized system that can store and render multilingual text, storing the text in Unicode is a good start, but you will need to store additional info (like the language) to be able to properly render the text.

I found http://en.wikipedia.org/wiki/Eastern_Arabic_numerals which shows examples of the differences in those numerals, but it looks like the different representations have different Unicode codepoint. So, there's no need for the lang attribute. (The page uses them, but if you take them off there's no difference in the display.)

You probably need to know the language to do things like sorting, comparison, regex, etc. But if you're just storing and displaying user-entered strings and your software has no need to understand the meaning of the strings, I think it's enough to do what the parent says.

Not quite. The Wikipedia article shows the difference between U+0660 .. U+0669 (Arabic-Indic digits) on the top row and U+06F0 .. U+06F9 (Eastern Arabic-Indic digits) on the bottom row.

But what I'm talking about are the different glyphs used to represent the bottom row (U+06F0 .. U+06F9) depending on whether the text is in Persian, Sindhi, or Urdu. See http://www.unicode.org/versions/Unicode6.2.0/ch08.pdf, table 8-2.

There is also the issue I mentioned about Chinese vs. Japanese glyphs for the same coded character, which is at least as important in practice.

This is an issue with CJK characters and probably just one more reason why UTF-8 adoption has been slow where JIS is good enough.
Regarding [1], in the favour of regexps: http://en.wikipedia.org/wiki/Regular_language

If you can't use a regexp to recognize the general case of email addresses, no finite automaton can..

Yes, but there is a point at which it's better to just hand-write some code which is equivalent to the automaton, rather than trying to use a regexp.

This is what a proper email-validation regexp looks like: http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html

> If you escape it then there's no XSS issue.

Not XSS, but you need to be careful about allowing through things like the LTR/RTL override characters.

This list is useless, because trying to follow it is impossibly ambitious. Which of these do I need to support for my system to work for X% of users with X+Y % being able to work around the limitations?
Logical fallacy (bifurcation): either you correctly implement all of the requirements, or it makes no sense trying at all. Note that the article even explicitly says "try to make _fewer_ of these assumptions," not "you MUST explicitly support all this."

Similar example: Do you lock your door, or does that make no sense to you? (Because if there's no absolute, perfect, 100% protection, there's apparently no difference at all between locked, closed and wide open; right?)

Logical fallacy (non sequitur), as the comment you're responding to said nothing of the kind, and argued specifically for a middle point in the second of only two sentences.
You don't need to explicitly support them if you just treat "name" as a freeform, unicode text field.
Well, you can still get bitten by "11. People’s names are all mapped in Unicode code points," as well as the sets 1-8 and 32-36 (people have exactly X names at a given point in time, where X>0); that's not to mention ordering and collation (12,13,18,30). But it's definitely the easiest option, and avoids many common pitfalls (if I had a nickel for every database using latin1 + latin1_swedish_ci because that's the first charset + collation in the list, I'd have a lot of nickels).
I can see 11, but as long as you're not using the name as a unique key but just as a label then the mutability, non singularity, and non-orderedness aren't such problems.
That makes sense - I was under the impression that you need to keep the name's history etc; even so it wouldn't be much of a problem.