Hacker News new | ask | show | jobs
by microtonal 647 days ago
It’s quite interesting that we end up using cosine similarity. Most networks are trained with a softmax layer at the end (e.g. next word prediction). Given the close relation between softmax and logistic regression, it might make more sense to use σ(u.v) as the similarity function.