Hacker News new | ask | show | jobs
by buyucu 483 days ago
Opening data is an invitation to lawsuits. That is why even the most die-hard open source enthusiasts are reluctant. It is also why people train a model and generate data with it, rather than sharing the original datasets.

These datasets are huge, and it's practically impossible to make sure they are clean of illegal or embarrassing stuff.

2 comments

Sounds like a job for AI.
I understand the reasoning and I hope there is legislation in the future that basically goes "If you can't produce the data, you can't charge more than this for it". Basically, LLM producers will have to treat their product as a commodity product that can only be priced based on the compute resources plus some overhead.