Hacker News new | ask | show | jobs
by rhdunn 761 days ago
From the TinyStories dataset card [1] the dataset is generated by GPT-3.5 and GPT-4. Reading the discussions in the community tab [2] it looks like there are a lot of incomplete or misspelled words, incorrect grammar, and even Chinese characters in the dataset.

As such, I'd be weary of using that dataset to train or evaluate models.

[1] https://huggingface.co/datasets/roneneldan/TinyStories

[2] https://huggingface.co/datasets/roneneldan/TinyStories/discu...

1 comments

It’s just used for checking that the implementation is correct. The dataset is just a toy dataset it doesn’t matter if it has misspelled words