| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by notpublic 530 days ago

Instead of doing a diff, curious if Normalized compression distance (NCD)[1] will yield a better result. It is very simple algorithm:

to compare two images, i1 and i2

  l1  = length(gzip(i1))
  l2  = length(gzip(i2))
  l12 = length(gzip(concatenate(i1, i2))

  ncd = (l12 - min(l1, l2))/max(l1, l2)

Here is a nice article where I found out about this long ago.

https://yieldthought.com/post/95722882055/machine-learning-t...

From the article:

"Basically it states that the degree of similarity between two objects can be approximated by the degree to which you can better compress them by concatenating them into one object rather than compressing them individually."

[1] https://en.wikipedia.org/wiki/Normalized_compression_distanc...

1 comments

johnisgood 530 days ago

Oh interesting, I remember comparing images before, I think I was doing a diff as well, so I suppose this would have worked? Nice to know! They were very small images though.

It probably would have added the overhead from compression which in my case would have been detrimental.

link

notpublic 530 days ago

Do try it. We use it for text search in one of our apps and works remarkably well. Basically to find which chunks contain the given text. Since the text can span multiple chunks, a simple string search will not work.

link