|
|
|
|
|
by turtlesoup
2223 days ago
|
|
There are some subtleties (e.g. hyphens, derived forms, bigrams, etc.) but the biggest problem is that most English dictionaries don't have entries for every scientific word / piece of internet slang. I ended up tokenizing Wikipedia for a blacklist and still missed a lot :( |
|
That sounds like an impressive project in itself :)