Hacker News new | ask | show | jobs
by jazzyjackson 885 days ago
on the contrary I'm really disappointed in how long its taking anything to get into production.

Whisper and self-hostable LLMs had a cambrian explosion about 1 year ago, I attended a GPT4 hackathon last March and in 48 hours saw people hook up Speech2Text -> LLM -> Text2Speech pipelines for their live demos. I thought we would all have babelfish by June.

Months later I later attended some conferences with international speakers that really wanted to have live, translated-on-the-fly captions, but there wasn't anything off the shelf they could use. I found a helpful repo to use whisper with rolling transcription but struggled to get the python prerequisites installed (involving hardlinking to a tensorflow repo for my particular version of m1 CPU). It was humbling and also hype-busting to realize that it takes time to productize, and that the LLMs are not magic that can write these applications themselves.

In the meantime even Google hasn't bothered to run the improved transcription models on YouTube videos. They are still old 80% accurate tech that's useless on anyone with an accent.

7 comments

> on the contrary I'm really disappointed in how long its taking anything to get into production.

I agree. I was thinking about making a Jarvis like bot which should be pretty easy at this point. The main problem was that my iPhone doesn’t easily allow for pressing a button upon which it starts listening. You always need to unlock first at which the whole screen gets unlocked too. Maybe these kind of GUI-focussed interfaces are blocking a lot of ideas? At the same time it’s great that people will come up with new devices and these will compete somewhat with phones.

The tap on back might work without unlock and i think that can be set to a custom shortcut
smart, I just played around with it, I can use the shortcuts app to create an action (like turn off the lamp or 'identify music via shazam') and then go to accessibility->touch settings and use double tap to trigger the action.

It works without unlocking but the phone has to be awake, so I can hit the power button once to wake and then double tap the back to trigger a shortcut without unlocking, confirmed to work without needing to identify me via face ID or anything.

There is also an accessibility shortcut for triple-clicking the side button but it only allows for toggling accessibility features.

Check out WhisperLive: https://github.com/collabora/WhisperLive

If you're grappling with the slow march from cool tech demos to real-world language model apps, you might wanna check out WhisperLive. It's this rad open-source project that’s all about leveraging Whisper models for slick live transcription. Think real-time, on-the-fly translated captions for those global meetups. It's a neat example of practical, user-focused tech in action. Dive into the details on their GitHub page

> I'm really disappointed in how long its taking anything to get into production.

> It was humbling and also hype-busting to realize that it takes time to productize

Yep, looks like you found out why it’s taking so long to get this new tech into production. The gap between nothing and a proof of concept is, in some ways, much smaller than the gap between proof of concept and commercial product.

I built this last March. It captures audio from a live HLS stream and transcribes and translates into 18 languages on the fly. Used by a customer with about 25K international employees for their internal events. Works surprisingly well.
Fabulous, guess that's the other part of productizing: a paying customer!
I'd be interested if you ever dig anything up for this. I hacked together a kind of crude tool to snapshot audio and translate / caption it on the fly:

https://captioner.richardson.co.nz/

I would very much like to improve on this but the live translation / captioning still has some more work to go in this space.

Source was here: https://github.com/Rodeoclash/captioner

I was going to suggest considering looking into vosk but... clearly that suggestion isn't very useful to you. :)
I have a similar frustration with the lack of tooling around all this stuff.

Like, you had the time to train a bajillion parameter model with a ton of attendant code, but an installation script was a bridge too far. I get that python dependency management sucks, but you had to do it at least once for yourself.

Of course, here I am reinstalling CUdnn for the umpteenth time because this software is provided free of charge and it sprinkles magical fairy dust on my GPU so perhaps I shouldn't whine about it.

You're focuded on whisper/voice stuff...

I was making a more general statement... I havent even had time to personally look at any voice stuff...

Too many Shiny Things and too much ADHD in the koolaide.