| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by a3_nm 4153 days ago

What about research to determine to what extent usernames with words in a certain language will tend to use passwords with words for the same language? (More generally, is there any connection between the bi- or trigram distribution on usernames and the one on passwords? In fact, do they just look the same, or could you tell given a string whether it's more likely a username or a password?)

Do usernames of people with weaker passwords have something in common? How do they differ from people with stronger passwords? In France there is a practice of picking names like "foobar42" or "foobardu42", where "foobar" is a first name and 42 a "département" (country subdivision) number, which I would associate to casual users. Here I could quantify whether people with usernames of this form tend to pick weaker passwords. Insert your favorite prejudice here about lame and skilled username patterns, and quantify how the password diversity of this group fares in comparison with others.

Is it true that the most common passwords were associated to usernames that were also common? Does username frequency correlate with password frequency? Are there more people with unique usernames or people with unique passwords?

In some countries it is customary to annotate usernames with the user's year of birth. Filtering on such usernames could give insight about the correlation between age and password quality, or identify which passwords are more or less popular given the user age. You could try to check correctness of the filter using the fact that some of those people may have used their birthdate (including the year) as a password.

If a seemingly rare password in the dataset only occurs for two distinct user names, then maybe those two user names actually correspond to the same user. Do such usernames have a low edit distance? Could you use this to learn general rules to determine, given two usernames, whether they seem to correspond to the same person?

I just gave those off the top of my head, and I'm not at all working in this field, but I'd have no trouble imagining interesting applications for this data that would not have been possible with the passwords alone.

2 comments

meowface 4153 days ago

I feel like most of those research questions could be answered if it was a "username -> password strength" mapping, in addition to a hash to study duplicate trends, rather than just "username -> password". Obviously there is no objective ranking of "password strength", but a decent approximation could be provided.

There are serious risks to having your username and password in a public list. Yes, all of these usernames and passwords were already technically publicly released, but to a lazy and ignorant script kiddie, finding or even being aware of those lists can be outside their grasp.

By aggregating everything into one list, you 1) increase the search engine visibility for all credentials, which means someone Googling the username of, say, an Internet commenter who pissed them off may find a plaintext password they could use to impact the person's life with much higher probability (I work in information security and have seen that happen on many occasions), 2) encourage script kiddies and fraudsters to spend time working through the list to find working accounts that other criminals have missed in the past decade, and 3) undo any work that paste sites like Pastebin and file sharing sites like Mediafire have done to remove copies of the database dumps. 1) may not apply if it strictly remains a torrent, but it'll probably be floating around public paste sites within a few days, which would likely mean search engine visibility for every username on it.

If even 0.01% of the users on this list have accounts compromised due to its release, then I don't think that cost justifies the research benefits relative to a more redacted version of the list.

link

belorn 4152 days ago

> I feel like most of those research questions could be answered if

If the person who releases this kind of information has the foresight to know what the questions are going to be, they could provide the answers directly rather than go half-way and modify the data. It would likely be less work than trying to produce anonymized data that is both useful and secure.

What I see used in cases like this is one of two options. Either full public access, or restricted access where only a few selected get the chance to do the research. The 0.01% misuse is thus balanced to that choice, rather than the theoretical case of anonymized data.

link

m8urn 4153 days ago

As I explained in the article I seriously doubt that any more than a tiny number of these passwords are still valid. And there is no reason for them to be, having already been widely available, indexed (and cached) by every search engine, archived at archive.org, and downloaded by thousands or tens of thousands of people. Anyone who would use this data maliciously probably already has it.

Much of this data is the same data monitored by sites like haveibeenpwned.com and a dozen others. Facebook scrapes these. Lastpass will send you alerts. The risk here is minimal; the research value is much more than you realize.

link

meowface 4153 days ago

>Anyone who would use this data maliciously probably already has it.

You might be surprised. The fact that these dumps are supposedly quite old certainly mitigates the risk, but I've seen cases of primary email accounts being taken over from a plaintext password in a dump 5+ years old. No one ever tried it on the email because it wasn't in the dump and wasn't identical to the username, though it was very close.

Aggregators like haveibeenpwned.com and Lastpass responsibly use the passwords they scrape, they don't release them all in a big batch like this. Many cybercriminals do the same kind of scraping and share these aggregated lists privately, but they're always going to be missing things, so there's no question they're all going to be pulling in your list, too. And odds are there's going to be at least one dump that a lot of them missed which yours has.

I do understand there is some research benefit here, but even in the best possible scenario I don't think the value from the research outweighs the costs.

link

m8urn 4153 days ago

First of all, a good number of these passwords were simply gathered through google. Some were gathered via the archive.org archive of pastebin pastes and their normal web page archive. Some were from forums that were located via google. This data is already out there, being aggregated doesn't make it any easier to hack these people.

Try searching for "Cucum01:Ber02" or "shawman:badman" and you will see how many passwords are indexed. I have hundreds of searches like these that I monitor and scrape.

Second, I regularly share my data with the owners of password checking sites such as haveibeenpwned to make sure users are able to be aware of these breaches. Releasing this data isn't something I have taken lightly, I debated it for years. I have weighed the risks and felt it was important to release the raw data, although not everyone will agree with me on this. I made a good effort to minimize the risks to actual users.

Finally, keep in mind that most users are already at risk simply because they have bad passwords. Ten percent of users have a password on the top 1000 list. A large percentage of users are at risk because the websites they are on don't have proper security. This is how people get hacked, not because of a password found on this list.

link

jschwartzi 4153 days ago

Still, the whole purpose of a password is to remain secret. He's certainly doing these users a disservice by releasing this list regardless of the hypothetical likelihood of the data already being available. Basically the arguments for doing this all seem to boil down to "they should already know their passwords are compromised" which nobody can guarantee is the case.

I agree that having a crappy password puts you at risk, but what about the people who genuinely tried to use some common sense but are on this list anyway? Is it their fault for not religiously keeping up with the latest indexed password lists?

link

pbreit 4153 days ago

OK, I'll bite: can you give us some ideas on how this would lead to a genuine advancement in user authentication (that we wouldn't have with username/pw de-linked)?

link

totony 4153 days ago

Example:

Username: mickael

Password: mickael69

EDIT: Just to be more precise, there is a correlation here, and with so much data a lot can be known. Patterns can then be forbidden from password fields so the website is less prone to dictionary attacks.

link

pbreit 4152 days ago

So what would you do here? Disallow "mickael" from the password? That's pretty user-hostile and almost completely pointless.

link

totony 4152 days ago

Is it pointless to reduce the attack vector against your website? And, no, for a banking system, it is not that user-hostile to say things like "we have found that using <pattern> in your password makes it easy for people to guess, please choose a more complicated password".

link

pbreit 4153 days ago

All possibly interesting questions (certainly not to me) but I fail to see how they would lead to any genuine advancements in authentication.

link