Hacker News new | ask | show | jobs
by escanor 1296 days ago
great work! just a note regarding tf-idf, when you mention log10: i think you're missing the point on the reason of log and most importantly base 10. namely, using log10 gives us a perspective on the number of digits of the term/document frequency. if a term "A" occurs 23 times and a term "B" occurs 50, they will have a very close representation (because both numbers are 2 digits ones).

anyway, thanks for the submission