Y
Hacker News
new
|
ask
|
show
|
jobs
by
smj-edison
120 days ago
Could it be since a lot of the data is trained on captions? At least if I'm remembering correctly, that's what they use to create the association between what's seen and what's said.