Hacker News new | ask | show | jobs
by rsmith49 2652 days ago
That is a good point. It helps to think of Precision and Recall (so by extension F1-Score) from your test data as random variables sampled from a distribution modeling the probability of getting each value in your sample based on a "True" precision/recall value. I won't go too deep into the math, but this was part of the approach in the confidence calculations towards the end of the paper: being able to factor in the uncertainty of your classification metrics to confidence calculations.

To formally answer your question, the main things that matter in determining how stable your F1-Score from your test set is are: - Size of the test set - % of test set that has the label (in our case feedback tag) - the values found for precision and recall