Hacker News new | ask | show | jobs
by jrfinkel 5171 days ago
Merging is pretty simple, and could probably use a little more TLC. The way we do it is that when we get a new document in the system, if the similarity score is above some threshold for documents in two different clusters we will consider merging those clusters. We then make the yes/no decision by comparing random documents from both clusters and averaging the scores, but the threshold we use here is a bit lower than the non-merging decisions (since we have the additional information of this new document doing a good job of linking the clusters).
1 comments

this doesn't work in the case where there are two (and only two) similar documents get ingested into the system as new singleton clusters at the same time; this case is very rare so it is not a big issue to you, i guess.