Hacker News new | ask | show | jobs
by chinesempire 2306 days ago
using ETS with Erlang (or Elixir) I get sub 50μs (30μs on avergae) lookup times.

Memory usage is quite high, around 95 bytes per element, bu I'm sure that by spending more than 5 minutes on it, like I did, one can take it down considerably

For reference, this is the code I used

I converted the SHA hashes to MD5 to save memory, given we don't care about collisions (which are very unlikely anyway), we just want to know if the password was there or not.

    defmodule Pwned do
      def load do
        :ets.new(:table, [:named_table, :set])

        File.stream!("pwned-passwords-sha1-ordered-by-hash-v5.txt")
        |> Stream.each(fn line ->
          <<hash::binary-size(40), ":", _rest::binary>> = String.trim_trailing(line)
          hash = :crypto.hash(:md5, Base.decode16!(hash))
          :ets.insert(:table, {:binary.copy(hash), true})
        end)
        |> Stream.run()
      end

      def lookup_hash(hash) do
        hash = :crypto.hash(:md5, Base.decode16!(hash))

        case :ets.lookup(:table, hash) do
          [] -> false
          _ -> true
        end
      end

      def lookup_password(password) do
        lookup_hash(:crypto.hash(:sha, password) |> Base.encode16())
      end
    end
1 comments

I agree these performance numbers don't seem great. Using q (another interpreted language; not compiled) I get 5µsec on my Macbook Air:

Here's my load script:

    \wget https://downloads.pwnedpasswords.com/passwords/pwned-passwords-sha1-ordered-by-hash-v5.7z
    \7z -so e pwned-passwords-sha1-ordered-by-hash-v5.7z pwned-passwords-sha1-ordered-by-hash-v5.txt | cut -c1-40 | xxd -r -p > hibp.input
    `:hibp 1: `s#0N 20#read1 `:hibp.input
I can then shut down this process, and start a new one:

    q)hibp:get`:hibp; / this mmaps the artefact almost instantly
    q)\t:1000 {x~hibp[hibp bin x]} .Q.sha1 "1234567890"
    5
It's so fast I need to run it 1000 times to take just 5msec (5µsec average lookup time!). I imagine converting to md5 would be substantially faster since there's a 16-byte scalar type in q I would be able to use.