Y
Hacker News
new
|
ask
|
show
|
jobs
by
preetsojitra
38 days ago
Meta's Perception Encoder Audio-Visual, its CLIP like but has three modality: Audio, Video and Text