Hacker News new | ask | show | jobs
Not_notMNIST: Generate your own datasets
1 points by RafazZ 3437 days ago
[Teaser](http://zafar.cc/images/letters.png)

[Personal Blog](http://zafar.cc/not-notmnist-dataset-generation/)

[GitHub Link](https://github.com/zafartahirov/not_notMNIST)

I wrote a little script that you can use to generate datasets for classification (like MNIST or notMNIST).

It takes fonts that you have, and creates images + label/features pickle that you can load into Python.

A more detailed explanation here: http://zafar.cc/not-notmnist-dataset-generation/ I would really appreciate any critique, issue requests, and pull requests on GitHub: https://github.com/zafartahirov/not_notMNIST

The benefits that I personally see is that if you want to test your classification on datasets that involve Unicode characters, you can. The problem is that you have to have a lot of fonts to be able to generate a decent dataset. If you have a lot of fonts in your language, I would appreciate if you could share the dataset :) I generated some using Hiragana, but I don't have a license for a lot of fonts, so it is more of a demo (check GitHub). I would really love to have a dataset for Chinese, Arabic, Hebrew, Cyrillic, etc.

1 comments

Great stuff!