Subformer: Multilingual video dubbing with speaker diarization and voice cloning

Hi HN,

We built Subformer (https://subformer.com), a web app that dubs videos into other languages while keeping speaker identity intact.

Most “AI dubbing” pipelines are just ASR → translation → TTS, which breaks as soon as you have multiple speakers. We instead run:

- VAD + speaker diarization - Audio Demixing - Global speaker clustering - Per-segment ASR + translation - Per-speaker TTS (voice cloning or synthetic) - Timeline-aligned remuxing back into the video

The tricky parts were diarization drift on long videos, timing mismatches after translation, and keeping costs sane when doing multilingual TTS at scale.

It’s still early, but it already works well for things like interviews, TV clips, and YouTube videos with multiple speakers.

Would love feedback from people who work on audio, speech, or localization.

https://subformer.com