|
|
|
|
|
by antirez
430 days ago
|
|
Thank you, tptacek. I was able to verify, thanks to the Internet Archive caching of "pg" for the post of 3 years ago, that the entries are quite similar in the case of "pg". Consider that it captures just the statistical patterns in very common words, so you are not likely to see users that you believe are "similar" to yourself. Notably: montrose may likely be a really be a secondary account of PG, and was also found as a cross reference in the original work of three years ago. Also note that vector similarity is not reciprocal, one thing can have a top scoring item, but such item may have much more items nearer, like in the 2D space when you have a cluster of points and a point nearby but a bit far apart. Unfortunately I don't think this technique works very well for actual duplicated accounts discovery because often times people post just a few comments in fake accounts. So there is not enough data, if not for the exception where one consistently uses another account to cover their identity. EDIT: at the end of the post I added the visual representations of pg and montrose. |
|
I worked on a search engine for patents that used the first, our evaluations showed it was much better than other patent search engines and we had no trouble selling it because customers could feel the difference in demos.
I tried dimensional reduction on the BERT vectors and in all cases I tried I found this made relevance worse. (BERT has learned a lot already which is being thrown away, there isn't more to learn from my particular documents)
I don't think either of these helps with the "finding articles authored by the same person" because one assumes the same person always uses the same words whereas documents about the topic use synonyms that will be turned up by (1) and (2). There is a big literature on the topic of determining authorship based on style
https://en.wikipedia.org/wiki/Stylometry
[1] With https://sbert.net/ this is so easy.