Hacker News new | ask | show | jobs
by royce 3039 days ago
The ranking in Troy's list is based entirely on how common the words are. Here are the top 10, with their relative frequency:

  c4a8d09ca3762af61e59520943dc26494f8941b:123456 (20760336)
  f7c3bc1d808e04732adf679965ccc34ca7ae3441:123456789 (7016669)
  b1b3773a05c0ed0176787a4f1574ff0075f7521e:qwerty (3599486)
  5baa61e4c9b93f3f0682250b6cf8331b7ee68fd8:password (3303003)
  3d4f2bf07dc1be38b20cd6e46949a1071f9d0e3d:111111 (2900049)
  7c222fb2927d828af22f592134e8932480637c0d:12345678 (2680521)
  6367c48dd193d56ea7b0baad25b19455e529f5ee:abc123 (2670319)
  e38ad214943daad1d64c102faec29de4afe9da3d:password1 (2310111)
  20eabe5d64b0e216796e834f52d61fd0b70332fc:1234567 (2298084)
  8cb2237d0679ca88db6464eac60da96345513964:12345 (2088998)
So ... what is the "right" threshold for N?

  $ for topx in 1 100 1000 5000 10000 20000 50000 100000 200000 500000 1000000; do \
    echo -n "$topx: "; head -n ${topx} pwned-passwords-2.0.txt | tail -1; 
  done

  1: 7C4A8D09CA3762AF61E59520943DC26494F8941B:20760336
  100: 482FA19D5C487CB69ACDA19EEE861CC69D82CC94:272371
  1000: 5B9FE558F673D63309BEB13BFA5DA6C30A3CA1BF:64912
  5000: FE648FC459A6F6EF6CD347BEE3D494766239BBB5:19860
  10000: 2682A3DBA7A1452EE7EE9980F195C6A768055DA6:11055
  20000: 53490A3C8567342B57B6A4FF24908DF73182B357:6309
  50000: 7517CD23A308BBCD05E5AD24AA6AD054237ED470:3153
  100000: BA6D6A41B9548C523833627A8B0E5170558BE1EA:1752
  200000: E50E6893264519636E90E95B6B1A85D0A691E0B1:931
  500000: AF8DF653177BBB3FEE2DA68D314B94CB5281B4F3:381
  1000000: BDD57A4CAA691A3441C1190C6F087B58B2EE3EF6:186
  2000000: C824AF24AA8F2FD99AD6842DC0E4B49100D96161:93
  10000000: 352DB7177AB7848DF1C102234401097FE40EB87D:22
The third field indicates how common the password is in the corpus (for example, the single most common password - "123456" - appears in the corpus 20,760,366 times).

So ... based on this data ... what is a reasonable value for that count, such that if the value is exceeded, the user should be disallowed from using the password? How much real-world online or offline resistance is provided by disallowing, say, passwords used at least 186 times in the corpus (roughly a million passwords, though 5201 passwords are at the 186 mark)? (The answer should be self-evident; if it isn't, I can provide more background).

Put another way ... if the corpus was only 1M in size, those right-hand values would be much smaller. How could you determine the threshold then? What I'm trying to illustrate here is that it's not the absolute value of that commonality number that matters; it's the relative rank. But that relative rank can't be determined via the API; you must analyze the entire corpus directly - and then discard the vast majority of it for blacklisting purposes.

I totally get that the threshold might vary per implementation. But it varies much less once the hash is slow enough, and the authentication service is suitable rate-limited. In other words, any system that would get real benefit from a 1-million-word blacklist is one that needs to be improved elsewhere instead.

But Troy didn't provide any guidance about that, or even how to judge for yourself what the threshold might be. He just provided an API to blacklist a corpus of passwords that is three orders of magnitude larger than a properly designed system would ever need.

1. https://blogs.dropbox.com/tech/2012/04/zxcvbn-realistic-pass...