Hacker News new | ask | show | jobs
by Der_Einzige 1843 days ago
Still upset about the pile as a dataset (the one used to train this model) because I wrote an issue asking if they would include my dataset debatesum into the dataset[1]. They mentioned that they planned to do it and even wrote another issue indicating that they would include it in the pile.

Months go by, and I ask why this dataset is still not included. They told me that the pile is complete and will not be updated anymore. I didn't have time to write my own PR for this and had no idea that there was a time limit to get it in.

Makes me sad. I don't know why this company has decided to stop updating the pile...

[1] https://github.com/EleutherAI/the-pile/issues/56

2 comments

They're not a company. They made some decision to not add more to the Pile, as they think it is complete enough. However, you could go to the Discord and propose a restarting of the Pile project, they allow those. On the Github, it looks like the next step after a restart would be the Pile v2, a multilingual dataset. Good luck!
You should fork it! I'd use your fork.