How to generate an arbitrarily large amount of test data

Y	Hacker News new \| ask \| show \| jobs

	How to generate an arbitrarily large amount of test data (pilosa.com)
	5 points by skingkong 2496 days ago

1 comments

akalmans 2496 days ago

The SMOTE algorithm is fairly old - isn't there something newer that may be more relevant?

link

alanbernstein 2496 days ago

I'm the author of the post. I've wanted a tool like this for a while, and I coincidentally discovered the SMOTE paper recently. It's simple enough to throw together a prototype in a few hours, and it requires very little understanding of the data set.

I was looking for something with a certain balance between speed/effort and statistical robustness. I wanted a big data set for testing pilosa performance, not for training ML models, or anything that really cares about the statistics. However, hundreds of repeated records can make histograms look glitchy, so I wanted to avoid that naive approach. Something like SMOTE fit that need well.

link

juandes 2489 days ago

I agree with you. I have a bit of experience using SMOTE and one of the things that make me keep using it is its simplicity, and how versatile it is. Just like you, a couple of days ago I wrote a small prototype on how to balance an already synthetic dataset and was very, very satisfied with the results. I'll share it with you in case you are interested,

https://kite.com/blog/python/smote-python-imbalanced-learn-f...

link