Hacker News new | ask | show | jobs
by LuisMondragon 1725 days ago
Hi, my interest got piqued. I'm developing a similarity feature where I compare embeddings of a sentence and its translation. I wanted to know if the hashing method would be faster that the pytorch multiplication by which I get the sentence similarities. Going from strings to bytes, hashing and comparing is very fast. But if I get the embeddings, turn them into bytes, hash them and compare them, both methods take almost the same time.

I used this Python library: https://github.com/trendmicro/tlsh.

1 comments

You pay a decent cost to do the hash, it’s a compression algorithm of sorts. But the data is a fraction of the size and comparison is way faster. If you do many of these or compare the same ones more than once you amortise the cost very quickly