| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by EarlyOom 742 days ago

TLDR: There are dozens of audio transcription APIs, but nothing for video and visual transcriptions. So we built one.

If you want visual chaptering, summarization, OCR / text-extraction, audio transcriptions, and sentiment analysis on your videos, there’s really nothing out there. We tried stitching this together with several audio/video understanding APIs but kept running into rate limits, hallucinations, high costs and poor accuracy.

Analyzing Audio Podcasts: https://vlm-docs.nos.run/guides/guide-audio-podcasts

Understanding Video Podcasts: https://vlm-docs.nos.run/guides/guide-video-podcasts

1 comments

arthurdelerue 742 days ago

I'm not sure why you say that current video transcriptions are bad. I use Whisper on NLP Cloud for video transcription (https://docs.nlpcloud.com/#automatic-speech-recognition) and it works very well.

As far as I understand, video transcription is a no-brainer as long as you install ffmpeg.

link

EarlyOom 741 days ago

Hi Arthur! There's a bit of confusion here. It looks like you're referring to _audio_ transcription; that is, passing the audio component into an ASR pipeline (like Whisper, Otter etc.) to generate a transcript of any spoken words. Our pipleline is meant for fine-grained 'transcriptions' of the _visual_ content of the video. For instance, any text on screen, contents of plots and graphs, the clothing worn by any participants, etc. (though we do transcribe the audio as well, its a multimodal pipeline!).

link