|
I feel like most of those research questions could be answered if it was a "username -> password strength" mapping, in addition to a hash to study duplicate trends, rather than just "username -> password". Obviously there is no objective ranking of "password strength", but a decent approximation could be provided. There are serious risks to having your username and password in a public list. Yes, all of these usernames and passwords were already technically publicly released, but to a lazy and ignorant script kiddie, finding or even being aware of those lists can be outside their grasp. By aggregating everything into one list, you 1) increase the search engine visibility for all credentials, which means someone Googling the username of, say, an Internet commenter who pissed them off may find a plaintext password they could use to impact the person's life with much higher probability (I work in information security and have seen that happen on many occasions), 2) encourage script kiddies and fraudsters to spend time working through the list to find working accounts that other criminals have missed in the past decade, and 3) undo any work that paste sites like Pastebin and file sharing sites like Mediafire have done to remove copies of the database dumps. 1) may not apply if it strictly remains a torrent, but it'll probably be floating around public paste sites within a few days, which would likely mean search engine visibility for every username on it. If even 0.01% of the users on this list have accounts compromised due to its release, then I don't think that cost justifies the research benefits relative to a more redacted version of the list. |
If the person who releases this kind of information has the foresight to know what the questions are going to be, they could provide the answers directly rather than go half-way and modify the data. It would likely be less work than trying to produce anonymized data that is both useful and secure.
What I see used in cases like this is one of two options. Either full public access, or restricted access where only a few selected get the chance to do the research. The 0.01% misuse is thus balanced to that choice, rather than the theoretical case of anonymized data.