Hacker News new | ask | show | jobs
by dfranke 554 days ago
Allowing purely numeric usernames seems like a terrible idea to me, because it creates ambiguity between what's a username and what's a UID. It's common for tools like ls or ps to display a username when one is found and fall back to displaying a UID if it isn't, and similarly tools like chown will accept either a UID or a username and disambiguate based on whether it's numeric or not. Now suppose there's a numeric username that doesn't match its own UID, but does match some other user's UID. It doesn't take a lot of imagination to see how this would lead to vulnerabilities.
6 comments

Talk to POSIX:

> A string that is used to identify a user; see also User Database. To be portable across systems conforming to POSIX.1-2017, the value is composed of characters from the portable filename character set. The <hyphen-minus> character should not be used as the first character of a portable user name.

* https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...

The "portable filename character set" is defined as:

    A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
    a b c d e f g h i j k l m n o p q r s t u v w x y z
    0 1 2 3 4 5 6 7 8 9 . _ -
* https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...

So only a hyphen as the first character is forbidden.

Given that you can't necessarilly control where usernames come from (e.g., LDAP lookups), properly speaking your system has to handle everything anyway, even if you don't allow local creation.

Yes, I'm aware, and POSIX has many such bugs that make command input or output unavoidably ambiguous if certain unexpected characters are present that they didn't think to prohibit. A lot of the revisions that went into POSIX 2024 were aimed at fixing some of these, such as standardizing find -print0 and xargs -0. The fact that this one got overlooked doesn't mean it's a good idea to make the situation worse and harder for future POSIX revisions to address.
It is time for POSIX to get with the times. Computers are used in more than the US and Canada (for the most generous interpretation of American in ASCII I'm including Canada, their French speakers will not be happy with that, not to mention first nations of which I know nothing but imagine their written language needs more than ASCII). UTF8 has been standard for decades now, just state that as of POSIX 2025 all of UTF8 is allowed in all string contexts unless there is a specific list of exception characters for that context (that is they never do a list of allowed characters). They probably need to standardize on utf8 normalization functions and when they must be used in string comparisons. Probably also need some requirement that and alternate utf8 character entry scheme exist on all keyboards.

The above is a lot of work and will probably take more than a year to put into the standard, much less implement, but anything less is just user hostile. Sometimes commettiees need to lead from the front not just write down existing practice.

Some practical concerns I have with UTF-8 are similar (or even the same, depending on font) characters which can be used in malicious ways (think package names, URLs, etc), not to even mention RTL text and other control characters. Every time I add logging code, I make sure that any "interesting" characters are unambiguously escaped or otherwise signaled out-of-band. Having English as an international writing standard is perfectly fine and I say that as a non-native speaker with a non-ascii name.
A good chunk of the world does not speak english or latin character based languages. They should be able to interact with computers completely in their own languages and alphabet sets, even if those are written right-to-left or top-to-bottom.

Of course, someone has to do the work to make this possible. And no one is obliged to do it. But to suggest that, such work should not be done at all, does not sit right.

This isn't quite black and white.

Right now, I can set up and use Linux in my language, have my display name in my script, but my username and password are ASCII-only and are available on the standard English keyboard anywhere. If I run into trouble, I can SSH in from any device in the world without any issue. I can just borrow a laptop from anyone, switch to English if needed, and jump right in.

Having a common denominator set of characters for such things is just really, really useful. I’d rather focus on all the other things that need to be localised.

Without any issue is a stretch, using a French keyboard is bad enough experience for passwords, not everyone uses standard English keyboards.
> A good chunk of the world does not speak english or latin character based languages.

nearly everyone in a first world country knows the English alphabet though. a vast majority of the developing world as well. just look at street view on Google maps in any country, there's going to be a ton of street signs using English characters, even in non-touristy areas.

> They should be able to interact with computers completely in their own languages and alphabet sets, even if those are written right-to-left or top-to-bottom.

if you're a typical android/ios end user you're interacting with a computer in your native language anyway. this discussion only applies to low level power users.

in that case: why? these aren't user-facing features. this is like saying that people should be able to use symbols native to their language rather than greek letters when writing math papers.

it might not be "fair" that English is overrepresented in computing but it also hasn't demonstrably been a barrier to entry. Japan, Korea and China have dominated, particularly in hardware.

if you think it should be fixed why stop at usernames? why represent uids with 1234 instead of 一二三四?

> nearly everyone in a first world country knows the English alphabet though

And not only 1st world. Actually the bigger country the more everything is localized - from dubbed films to food packaging labels. In a small country one would see more English/Spanish/French e. t. c. because they don't have resources to localize everything.

> if you're a typical android/ios end user you're interacting with a computer in your native language anyway. this discussion only applies to low level power users.

I don't think you realize how poor this experience is. Partly the reason being that the underlying system is so english focused, that app developers have to do so much work to get things working.

> if you think it should be fixed why stop at usernames? why represent uids with 1234 instead of 一二三四?

I mean, if the computers had first been built in south east asia, they would have been.

I have an impression that people confuse learning English (which is hard unless you native language is a Germanic/Romance one) with learning to recognize and type Latin characters which is easy and people around the world already use Latin alphabet without knowing any English. You may escape Latin alphabet if you have spend a whole life in a remote village but for people living in cities around the world it should be familiar and not a barrier at all. It's hard to escape Latin characters in the modern world and this ship has already sailed like it or not (I mostly do).
Oh no please, I don’t want to have my linux username in Cyrillic. Thanks but no, thanks!

I know enough linux to see 10 ways in which it will make things worse at some point.

> similar (or even the same, depending on font) characters which can be used in malicious ways

These are called "confusables" and boy does that well run deep: https://www.unicode.org/Public/security/16.0.0/confusables.t...

> It is time for POSIX to get with the times.

"Be the change that you wish to see in the world." — Mahatma Gandhi

It's free to join:

* https://www.opengroup.org/austin/lists.html

* https://www.opengroup.org/austin/

NO. PLEASE DON'T. This wreaks havoc especially on East Asian users because Unicode is poorly supported in console on top of being binary non-canonical in both entry and display.

Meaning,

  - :potato: OR :potatoh: may display as :eggplant: OR :potato:    
  - isEqual(`:eggplant:`, `:eggplant:`) may fail OR succeed   
  - trying to type :sequence: breaks console until reboot  
  - typing :potato: may work but not :eggplant:  
  - users don't know how to spell :eggplant:  
  - etc. 
If you must, please fix Unicode first so that user entry and display would have 1:1 relationship. I do have Han Unification in mind, but I believe the problem isn't unique to the unification or East Asia.
Almost nobody supports string search and comparison API functions for unicode. The unicode security tables for unicode identifiers are hopelessly broken.

Not even the simplest tools, like grep do support unicode yet. This didnt happen in the last 15 years, even if there are patches and libs.

Wasn't one way to make grep faster setting LANG=C to avoid using language-aware string comparison? If so, shouldn't Unicode be supported by default or what would, say, de_DE.UTF-8 actually compare to make it slower?
yes it should. but the libunistring variant was too slow. And since LANG is run-time evaluated you cannot really provide pre-compiled, better search patterns.

sometime I'll come up with pre-computed optimized tables, but no time.

It's just a grep bug, ripgrep is fast and supports proper regex.
Sure, go ahead. Write the PR and make sure to test against all other things used in production.

Let's talk again in 30 years when you're done.

Oh, it's been closer to 20 years for the rest of the world to catch up to Unicode than 30. We aren't at "perfect" now but we're certainly down to the trickier corner cases that are difficult to even see how you solve the problems at all, let alone code the solutions, and that's just reality's ugly nose sticking in to our pristine world of numbers.

But there really isn't any other solution. Yes, there will be an uncomfortable transition. Yes, it blows. But there isn't any other solution that is going to work other than deal with it and take the hits as they come. The software needs to be updated. The presumption that usernames are from some 7-bit ASCII subset is simply unreasonable. We'll be chasing bugs with these features for years. But that's not some sort of optional aspect that we can somehow work around. It's just what is coming down the pike. Better to grasp the nettle firmly [1] than shy away from it.

At least this transition can learn a lot from previous transitions, e.g., I would mandate something like NFKC normalization applied at the operating system level on the way in for API calls: https://en.wikipedia.org/wiki/Unicode_equivalence Unicode case folding decisions can also be made at that point. The point here not being these specific suggestions per se, but that previous efforts have already created a world where I can reference these problems and solutions with specific existing terminology and standards, rather than being the bleeding-edge code that is figuring this all out for the first time.

[1]: https://www.phrases.org.uk/meanings/grasp-the-nettle.html

Don't get me wrong, I think using UTF-8 everywhere is how things should be.

But this is not a "let's just" or "why don't we" type of endeavor. This is a major undertaking, and as such people are needed who (A) think it is worth the effort and (B) are willing to follow through with all the consequences.

Open Source software lives from contributions and if you're not willing to do it, why should others spend years of their lives for it?

In the end this is a question of: are the benefits worth the effort? What do we win? Where do things get simpler? Where more complicated? How do you pull it off if half the distributions use UTF8 and the other half uses the legach way? How would tooling deal with this split? etc.

To add a little bit of context:

You know what I think would be way worse than todays reduced characterset usernames with some special rules or "just" using utf-8 for them?

Both. Imagine a world where some usernames are UTF-8 some are not and it is hard to figure out which is which. That would be worse than just leaving things as they are.

Avoiding that situation makes pulling the whole thing off even harder, since there needs to be a high amount of coordination between many projects, distros etc.

> Unicode case folding decisions can also be made at that point

Ok I will bite. How do you indent to do case folding without knowing the language the string is in? Will every filename or whatever also have its language as part of the string? I am not sure what the plan is there.

Unicode opens a whole can of worms. World is already full of software which in theory supports non-ASCII texts but in practice breaks for some use cases. It's easy to allow UTF8, it's hard to test all possible use cases and to foresee them to know what to test. Nowadays I use mostly English so don't see localization bugs but when I used my native language with software/internet (~10y ago) I've encountered too many bugs and avoided using non-ASCII in things like usernames/password, file names and other places when utf-8 may be allowed but causes problems later. Just allowing UTF-8 is rarely enough. Localization is hard so better to start with places where it is important. Usernames IMHO not one of them.
Sounds like lots of work and a lot of new bugs for no real value.
> Computers are used in more than the US and Canada

Even if you speak US (or Canadian) English exclusively, there are still some words that are just impossible to spell correctly in pure ASCII, e.g. résumé, café etc.

“correctly”. I don’t consider it “incorrect” English when someone writes “cafe” or “resume”. It seems to me a little bit pædantic to insist that those words must have the accent marks in order to be correct (when using them in English).
Yeah, loanwords are different words than the original word.

The correct plural of "baby" in German is "babys".

I would say it is not the place of posix to prescribe how it should be, the job of posix is describe what it is, a common operating environment. this is why posix is such a mess and why I feel it is not a big deal to deviate from posix, however posix fills an important role in getting everyone on the same page for interoperability.

In my opinion the way to improve this, is bottom up, not top down. Start with linux(theese days posix is largely "what does linux do?"), get a patch in that changes the defination of the user name from a subset of ascii to a subset of utf-8. what subset? that is a much harder problem with utf-8 than ascii, good luck. get a similer patch in for a few of the bsd. then you tell posix what the os's are doing. and fight to get it included.

On the subject of what unicode subset. perhaps the most enlightened thing to do is the same as the unix filesystem and punt. one neat thing about the unix filesystem is that names are not defined in an encoding but as a set of bytes. This has problems and has made many people very mad. but it does mean your file system can be in whatever encoding you want, transitioning to utf-8 was easy(mainly doe to the clever backwards compatible nature of utf-8) and we were not locked into a problematic encoding like on windows. perhaps just define that the name is a array of bytes and call it a day. that sounds like the unix way to me.

"however posix fills an important role in getting everyone on the same page for interoperability."

Isn't that exactly what the posix username rules are doing? Specifying a set of characters which are portable across systems to allow for interoperability between current and legacy unix systems along with most non-unix systems.

"Start with linux"

Which linux? Debian/Ubuntu, Redhat/Fedora, shadow-utils, and systemd all differ.

"get a patch in that changes the defination of the user name from a subset of ascii to a subset of utf-8"

ASCII is a subset of UTF-8 so the POSIX definition already specifies a subset of UTF-8.

Honestly, I just don't care. UTF8 is excessively complicated. ASCII is simple.
> properly speaking your system has to handle everything anyway, even if you don't allow local creation.

Honestly, I try not to be a pessimist, but this sounds like the opening narration to some dystopian doomsday movie. Titled something like You're Not Wrong, I suppose.

At the meatspace level, purely numeric usernames are problematic.

I was working as a contractor at a Fortune 500 firm several years ago when they introduced a new ERP system which apparently encouraged the company to switch to numeric system IDs. Fortunately the technical teams, especially Linux support, objected and it was overruled, but I was just as worried about the communications problems that would result.

When everyone has a system ID that matches a consistent pattern, like “YZ12345”, IDs are easy to recognize in documentation and data. An ID like “1234567” could be practically anything.

I really like the concept of adding some redundancy to ids, like a prefix. It helps to disambiguate things (kind of like static typing). A good example is also bank numbers, which must be a multiple of 97 +1, enabling fast client-side validation against typos.
Could you give a reference on this 97 rule? I’m intrigued.
I was also intrigued, so I searched and on wikipedia ( https://en.wikipedia.org/wiki/International_Bank_Account_Num... ), in the section "Validating the IBAN" it is written :

    Interpret the string as a decimal integer and compute the remainder of that number on division by 97
    If the remainder is 1, the check digit test is passed and the IBAN might be valid
It’s pretty common in places that handle Tax data.

At the end of the day, pushing opinionated bullshit doesn’t belong in utilities. If there’s a security vulnerability, sell that and push for incorporation into NIST standards.

I am also worried about more subtle bugs caused by usernames that are not strictly only-numeric, such as “10e2” or “0xDEADBEEF”.
It shouldn't be a problem as long as the system disallows a numeric username to be the same as an existing UID (excepting the case where the matching UID is assigned to said username).
still makes historic data garbage, both users and pids can be created/destroyed over time.
> Allowing purely numeric usernames seems like a terrible idea to me

"I'm not a number, i am a free man. Ha ha ha ha ha"

“Who is UID 0?”

“You are UID 6.”

You have an off by one error. But I honestly don’t know which you should change to with the spirit of the show.
There’s lots of dumb things that you can do. Where do the safety bumpers stop?
wherever each community puts them?