| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by willcodeforfoo 1705 days ago

Tangentially-related: What's the state of the art for storing a bunch[1] of hashes like the OP (or pHash, etc.) in PostgreSQL and querying by hamming distance in a reasonable time?[2]

pg_similarity? pg_trgm? cube?

[1]: 10–50 Million

[2]: < 200ms

3 comments

Twixes 1705 days ago

AFAIK there isn't any great way. pg_similarity etc. offer useful functions, but that doesn't help with big-O here. And the indexing capabilities that exist are only useful for geometry/geography, not abstract metric spaces (the mathematical sense of "metric space", examples of metrics being Hamming distance or Levenshtein distance). I haven't found a DBMS optimized for metric spaces at all actually. The second best solution I've got is just using a columnar DBMS like ClickHouse, which still needs to scan all values, but at least that reads values from a blob which _only stores that specific column_ – hugely faster than parsing whole rows. The lack of the ideal solution is why I'm building Emdrive, an RDBMS with first class support for similarity search, based on indexing with an M-tree variation. Still very early stages ;) https://github.com/Twixes/emdrive

link

willcodeforfoo 1704 days ago

This looks like a really interesting project! I'm going to check it out.

link

innagadadavida 1705 days ago

One option when your DB doesn't have those primitives is to convert the bits to word 011 -> "unsetbit2 setbit1 setbit0" then treat that column to have a text index - this is equivalent to doing hamming distance search. I did this with MySQL for 20M gifs for near duplicate detection and it worked very well.

link

willcodeforfoo 1704 days ago

Interesting idea, I'll have to give this a shot!

link

varelaz 1705 days ago

I don't know if it makes sense to query hamming distance for hash. Closest hashes don't guarantee closest images at all. You can check for amount of parts matching by query like: select video_id from video_hashes where hash in (...) group by video_id order by count(distinct hash) desc limit 10

technically it can be fast since selection on hash could be very narrow. You need only index by hash, video_id.

link

CaveTech 1705 days ago

OP is referring to phashes aka perceptual hashes, where closest hashes should indeed indicate similarity.

link