Hacker News new | ask | show | jobs
by smj-edison 120 days ago
Could it be since a lot of the data is trained on captions? At least if I'm remembering correctly, that's what they use to create the association between what's seen and what's said.