Hacker News new | ask | show | jobs
by rememberlenny 1711 days ago
I would strongly advise against using Google's ML apis.

First, at my company Milk Video, we are huge fans of Assembly AI. The quality, speed and cost of their transcription is galaxies beyond the competition.

Having worked in machine learning focused companies for a few years, I have been researching this exact question. I'm curious how I can better forecast the amount of ML talent I should expect to build into our team (we are a seed stage company), and how much I can confidently outsource to best-in-class.

A lot of the ML services we use now are utilities that we don't want to manage (speech-to-text, video content processing, etc), and also want to see improve. We took a lot of time to decide who we outsource these things to, like working with AssemblyAI, because we were very conscious of the pace of improvement in speech-to-text quality.

When we were comparing products, the most important questions were:

1. How accurate is the speech-to-text API

1.a Word error rate

1.b Time attributed to start/end word

2. How fast does it process our content

3. How much does it cost

AssemblyAI was the only tool that used modern web patterns (ie. not Googles horrible API or other non-tech based companies trying to provide transcript services) that made it easy to integrate with in a short Sunday morning. The API is also surprisingly better than other speech-to-text services, because its trained for the kind of audio/video content being produced today (instead of old call center data, or perfect audio from studio-grade media).

Google's api forced you to manage your asset hosting in GCP, handle tons of unnecessary configuration around auth/file access/identity, and its insanely slow/inaccurate. Some other transcription services we used were embarrassingly horrible from a developer experience perspective, in that they also required you to actually talk to a person before giving you access.

The reason Assembly is so great is that you can literally make an API request with a media file url (video or audio), and boom, you get a nice intuitive JSON formatted transcript response. You can also add params to get speakers, get topic analysis, personal information detection, and it's just a matter of changing the payload in the first API request.

I'm very passionate about this because I spent so much time fighting previously implemented transcript services, and want to help anyone avoid the pain because Assembly really does it correctly.

1 comments

How good is their speaker labeling? We've been using the Google API but their diarization has been basically unusable for our application (transcripts of group conversations).
Dylan from Assembly here. If you want to send me one of your audio files (my email is in my profile) I'd be happy to send you back the diarized results from our API.

You can also signup for a free account and test from the dashboard without having to write any code if that's easier.

Other than lots of crosstalk in your group conversations - is there anything else challenging about your audio (eg, distance from microphones, background noise, etc?)