|
|
|
|
|
by rsmith49
2652 days ago
|
|
That is a good point. It helps to think of Precision and Recall (so by extension F1-Score) from your test data as random variables sampled from a distribution modeling the probability of getting each value in your sample based on a "True" precision/recall value. I won't go too deep into the math, but this was part of the approach in the confidence calculations towards the end of the paper: being able to factor in the uncertainty of your classification metrics to confidence calculations. To formally answer your question, the main things that matter in determining how stable your F1-Score from your test set is are:
- Size of the test set
- % of test set that has the label (in our case feedback tag)
- the values found for precision and recall |
|