Hacker News new | ask | show | jobs
by patelajay285 863 days ago
Hi everyone, there are no easy tools for synthetic data generation or training and aligning LLMs simply in Python. Most of the stuff out there are messy adhoc scripts.

DataDreamer is an open source Python package with a nice API from the University of Pennsylvania that does all this that we’re actively developing. Will be here to answer questions.

https://github.com/datadreamer-dev/DataDreamer

1 comments

The API looks nice, congratulations. Will experiment with it. One small silly question: why did you choose to specify the dependencies inside the src dir with the requirements format - rather than inside the pyproject?
Thanks! It makes it easier to run with the existing run scripts I have on our large university GPU cluster. :) no other reason