| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by rhdunn 807 days ago

From the TinyStories dataset card [1] the dataset is generated by GPT-3.5 and GPT-4. Reading the discussions in the community tab [2] it looks like there are a lot of incomplete or misspelled words, incorrect grammar, and even Chinese characters in the dataset.

As such, I'd be weary of using that dataset to train or evaluate models.

[1] https://huggingface.co/datasets/roneneldan/TinyStories

[2] https://huggingface.co/datasets/roneneldan/TinyStories/discu...

1 comments

nwoli 807 days ago

It’s just used for checking that the implementation is correct. The dataset is just a toy dataset it doesn’t matter if it has misspelled words

link