| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by thrwayaistartup 890 days ago

Or just... write 100 good prompt-repsonse pairs yourself.

2024 will be the year of synthetic data. 2025 will be the year of "you know you can use your own brain and type out 100 datapoints faster and cheaper than generating and filtering assloads of synthetic data, right?"

Maybe we can even skip 2024 :)

2 comments

sp332 890 days ago

Databricks had their employees write up 15,000 of them. https://www.databricks.com/blog/2023/04/12/dolly-first-open-...

link

icyfox 890 days ago

Favorite part of this piece:

> We were initially skeptical whether we would get to 10,000 results. But with nightly leaderboard gamification, we managed to break 15,000 results within a week. Out of fear of eating into our productivity, we closed the contest.

I've hosted a few of these corporate data labeling events. If sufficiently gamified / there's a good enough UX, they can be surprisingly engaging. It helps a lot if you have a large employee base though. Distributing results over 5000 employees is exponentially easier than even 50 - in practicality, even larger than the orders of magnitude.

link

code_runner 890 days ago

I’ve worked at plenty of places where we did a ton of labeling by hand.

People concerned with data quality from LLMs should really see the inconsistencies we came up with!

link

ThrowawayTestr 890 days ago

Anybody have this downloaded and can paste a few examples?

link

sp332 890 days ago

You can browse them here: https://huggingface.co/datasets/databricks/databricks-dolly-...

link

schreiaj 890 days ago

Yes and no, for text type stuff? Yes you're right. But I think in the vision space synthetic data will remain useful for a lot of things. I'm currently working on building a pipeline for personal projects to go from CAD models of environment to segmented training data. So far it looks almost as useful as real world data at a fraction of the cost of manual labeling.

link