Hacker News new | ask | show | jobs
by actuary 4886 days ago
This is absolutely not true for insurance data, where the task is to predict expected losses per policy and (in any given year) perhaps only 1% of policies will have any losses at all. Even if your statement were true, this sort of analysis has nothing to do with business intelligence. The goal is to minimize adverse selection in a competitive marketplace. There is no such thing as "good enough". (If there were, I would be out of a job.)
1 comments

Try stratified sampling. Removing records without claims only increases the variance of the denominator which is much less variable. You actually can eliminate the majority of the data and find results that are the same to several decimal places. Note this only works with very large datasets without extremely high cardinality variables.

That said, 50000 is too few. For a dataset of this size, 20 million records is likely more reasonable. The actual answer depends on the variance of the individual predictors and their correlation with each other.