| HN Mirror

I'm the author of the post. I've wanted a tool like this for a while, and I coincidentally discovered the SMOTE paper recently. It's simple enough to throw together a prototype in a few hours, and it requires very little understanding of the data set.

I was looking for something with a certain balance between speed/effort and statistical robustness. I wanted a big data set for testing pilosa performance, not for training ML models, or anything that really cares about the statistics. However, hundreds of repeated records can make histograms look glitchy, so I wanted to avoid that naive approach. Something like SMOTE fit that need well.