Hacker News new | ask | show | jobs
by thrwayaistartup 890 days ago
Or just... write 100 good prompt-repsonse pairs yourself.

2024 will be the year of synthetic data. 2025 will be the year of "you know you can use your own brain and type out 100 datapoints faster and cheaper than generating and filtering assloads of synthetic data, right?"

Maybe we can even skip 2024 :)

2 comments

Databricks had their employees write up 15,000 of them. https://www.databricks.com/blog/2023/04/12/dolly-first-open-...
Favorite part of this piece:

> We were initially skeptical whether we would get to 10,000 results. But with nightly leaderboard gamification, we managed to break 15,000 results within a week. Out of fear of eating into our productivity, we closed the contest.

I've hosted a few of these corporate data labeling events. If sufficiently gamified / there's a good enough UX, they can be surprisingly engaging. It helps a lot if you have a large employee base though. Distributing results over 5000 employees is exponentially easier than even 50 - in practicality, even larger than the orders of magnitude.

I’ve worked at plenty of places where we did a ton of labeling by hand.

People concerned with data quality from LLMs should really see the inconsistencies we came up with!

Anybody have this downloaded and can paste a few examples?
Yes and no, for text type stuff? Yes you're right. But I think in the vision space synthetic data will remain useful for a lot of things. I'm currently working on building a pipeline for personal projects to go from CAD models of environment to segmented training data. So far it looks almost as useful as real world data at a fraction of the cost of manual labeling.