Hacker News new | ask | show | jobs
by nradov 546 days ago
There is an enormous "iceberg" of untapped non-public data locked behind paywalls or licensing agreements. The next frontier will be spending money and human effort to get access to that data, then transform it into something useful for training.
1 comments

ah yes the beautiful iceberg of internal documentation, legal paperwork, and meeting notes.

the highest quality language data that exists is in the public domain