Hacker News new | ask | show | jobs
by pdyck 2724 days ago
I had the same experience when trying to do NER on customer support requests. My model performed great for research datasets but it was mediocre at best for my own dataset. Do you have any suggestions on how to achieve better results in domains where mistakes, bad punctuation, etc are common?
1 comments

Label more training data.

Do more clustering.

Label more training data.

Strip out more garbage.

Label more training data.

PS you can get an idea of how much value additional training data will give you by training models on various subsets of your dataset (e.g. 10%, 20%...), evaluating them against the same test dataset, and plotting the results.