Hacker News new | ask | show | jobs
by orware 1942 days ago
As an aside (or perhaps related note), over this past weekend I was doing my first real-world usage of SQLite, since I typically rely on conventional databases for my day-to-day work.

I've read the article shared here in this thread in the past and read about the usefulness of SQLite generally, particularly for read-heavy situations, so I thought it would be a nice experiment to pull in the "Have I Been Pwned Password" List and import it into the SQLite and see how fast it could be queried.

At first I didn't find too many resources out there of the same thing having been done, but after searching some more throughout the weekend I did find out some nice nuggets of information that helped (in particular, I learned that adding the "WITHOUT ROWID" option at the end of a SQLite CREATE TABLE statement will help to reduce the overall database size considerably).

The desire to put things into SQLite for the "Have I Been Pwned Password" actually stemmed from a misinterpretation of the API page since it seemed at first that the API was not free (although I had thought it was), since one of the first things I read was the need for a $3.50/mo charge. Later on, I then figured out that the password checking API is still free, it's just the other pieces of the API that require the API Key / monthly charges (doh!).

If you run a query against Troy's API, it's pretty damn fast all things considered, so I'll likely revert to using his API directly, but I was thinking of uploading the SQLite versions to some cloud hosting somewhere and potentially share it with others if anybody is interested.

On to some quick stats collected over the weekend (using an import process powered by some PowerShell):

  50 million took ~6 hours 22 minutes to complete (with ROWID):
  Start Time: Sunday, February 14, 2021 9:12:29 AM
  End Time: Sunday, February 14, 2021 3:34:46 PM

  50 million took ~5 hours 46 minutes to complete (without ROWID):
  Start Time: Sunday, February 14, 2021 7:51:38 PM
  End Time: Monday, February 15, 2021 2:05:30 AM
File size comparison:

  5.28 GB (50 million rows, with ROWID)
  2.61 GB (50 million rows, without ROWID)
Search Time comparison:

Example performance of Select-String (line-by-line search):

  TotalMilliseconds : 216245.7246
Example performance of the SQL query:

  TotalMilliseconds : 20.8543
I then ended up finding this implementation of a Binary Search option (rather than the naive approach Select-String takes above) that Chris Dent put together here: https://www.indented.co.uk/powershell-file-based-binary-sear...

Example performance of this approach was more in line with the SQLite database:

  TotalMilliseconds : 40.3749
(For these runs comparing against the Binary Search option, the SQLite query ran in):

  TotalMilliseconds : 4.9491
All in all, now that I have discovered the Binary Search option that one would be workable without needing to do anything at all (except to make sure to download the "Ordered By Hash" version of the hashed password list) and use it directly, although I believe once the import has completed of the full password file into SQLite (using the WITHOUT ROWID option) the file size for the database should be fairly close to the actual txt file size (the text file size is 25.1 GB currently with close to 630 million lines in it...based on the smaller batch conversions to SQLite, I think the full import should be in the low 30 GB range).

Tagging along with the top comment in this thread however, I do see what they mean about insert performance since I did try a few experiments there as well, but went ahead and kept things with the 10k batch inserts at a time (I think with individual inserts I was calculating things to potentially take 216 days or something ridiculuous, and switching to batch inserts made it go down to about 3 days, but based on some comments I read elsewhere over the weekend, even that seemed like it might be kind of high...at the moment I'm trying a full import outside of my PowerShell script using DBeaver's functionality for doing so, so I'll see if it completes faster...seems like it might since based on the current database's size it seems like it's more than 50% complete and it's probably only been about 12 hours).