Hacker News new | ask | show | jobs
by preetsojitra 38 days ago
Meta's Perception Encoder Audio-Visual, its CLIP like but has three modality: Audio, Video and Text