Hacker News new | ask | show | jobs
by gcollard- 547 days ago
All the publicly accessible sources you mentioned have already been scraped or licensed to avoid legal issues. This is why it’s often said, “there’s no public data left to train on.”

For evidence of this, consider observing non-English-speaking young children (ages 2–6) using ChatGPT’s voice mode. The multimodal model frequently interprets a significant portion of their speech as “thank you for watching my video,” reflecting child-like patterns learned from YouTube videos.