| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by samstave 886 days ago
	Aside: is it just me, or is anyone else just as dumbfounded with how quickly literally every aspect of AI and LLMs and Models and blah blah blah is going? Am I weird in just having my head spin - even though I've also been at leading edge tech before, but this is just me yelling at these new algos on my lawn?

2 comments

jazzyjackson 886 days ago

on the contrary I'm really disappointed in how long its taking anything to get into production.

Whisper and self-hostable LLMs had a cambrian explosion about 1 year ago, I attended a GPT4 hackathon last March and in 48 hours saw people hook up Speech2Text -> LLM -> Text2Speech pipelines for their live demos. I thought we would all have babelfish by June.

Months later I later attended some conferences with international speakers that really wanted to have live, translated-on-the-fly captions, but there wasn't anything off the shelf they could use. I found a helpful repo to use whisper with rolling transcription but struggled to get the python prerequisites installed (involving hardlinking to a tensorflow repo for my particular version of m1 CPU). It was humbling and also hype-busting to realize that it takes time to productize, and that the LLMs are not magic that can write these applications themselves.

In the meantime even Google hasn't bothered to run the improved transcription models on YouTube videos. They are still old 80% accurate tech that's useless on anyone with an accent.

link

huijzer 886 days ago

> on the contrary I'm really disappointed in how long its taking anything to get into production.

I agree. I was thinking about making a Jarvis like bot which should be pretty easy at this point. The main problem was that my iPhone doesn’t easily allow for pressing a button upon which it starts listening. You always need to unlock first at which the whole screen gets unlocked too. Maybe these kind of GUI-focussed interfaces are blocking a lot of ideas? At the same time it’s great that people will come up with new devices and these will compete somewhat with phones.

link

Havoc 886 days ago

The tap on back might work without unlock and i think that can be set to a custom shortcut

link

jazzyjackson 886 days ago

smart, I just played around with it, I can use the shortcuts app to create an action (like turn off the lamp or 'identify music via shazam') and then go to accessibility->touch settings and use double tap to trigger the action.

It works without unlocking but the phone has to be awake, so I can hit the power button once to wake and then double tap the back to trigger a shortcut without unlocking, confirmed to work without needing to identify me via face ID or anything.

There is also an accessibility shortcut for triple-clicking the side button but it only allows for toggling accessibility features.

link

vineet202 886 days ago

Check out WhisperLive: https://github.com/collabora/WhisperLive

If you're grappling with the slow march from cool tech demos to real-world language model apps, you might wanna check out WhisperLive. It's this rad open-source project that’s all about leveraging Whisper models for slick live transcription. Think real-time, on-the-fly translated captions for those global meetups. It's a neat example of practical, user-focused tech in action. Dive into the details on their GitHub page

link

taneq 886 days ago

> I'm really disappointed in how long its taking anything to get into production.

> It was humbling and also hype-busting to realize that it takes time to productize

Yep, looks like you found out why it’s taking so long to get this new tech into production. The gap between nothing and a proof of concept is, in some ways, much smaller than the gap between proof of concept and commercial product.

link

ricketycricket 886 days ago

I built this last March. It captures audio from a live HLS stream and transcribes and translates into 18 languages on the fly. Used by a customer with about 25K international employees for their internal events. Works surprisingly well.

link

jazzyjackson 886 days ago

Fabulous, guess that's the other part of productizing: a paying customer!

link

Rodeoclash 886 days ago

I'd be interested if you ever dig anything up for this. I hacked together a kind of crude tool to snapshot audio and translate / caption it on the fly:

https://captioner.richardson.co.nz/

I would very much like to improve on this but the live translation / captioning still has some more work to go in this space.

Source was here: https://github.com/Rodeoclash/captioner

link

follower 886 days ago

I was going to suggest considering looking into vosk but... clearly that suggestion isn't very useful to you. :)

link

pksebben 886 days ago

I have a similar frustration with the lack of tooling around all this stuff.

Like, you had the time to train a bajillion parameter model with a ton of attendant code, but an installation script was a bridge too far. I get that python dependency management sucks, but you had to do it at least once for yourself.

Of course, here I am reinstalling CUdnn for the umpteenth time because this software is provided free of charge and it sprinkles magical fairy dust on my GPU so perhaps I shouldn't whine about it.

link

samstave 886 days ago

You're focuded on whisper/voice stuff...

I was making a more general statement... I havent even had time to personally look at any voice stuff...

Too many Shiny Things and too much ADHD in the koolaide.

link

colechristensen 886 days ago

This is the structure of revolutions, particularly of this kind. Exponential growth looks like this.

In particular with the generation / recognition abilities of ML models, they have this feature of being a curiosity but not quite useful... so if a speech recognition program goes from 50% accuracy to 75% accuracy it's a huge accomplishment but the program is still approximately as useless when it's done. Going from 98% to 99% accuracy on the other hand still cuts the errors in half, but it's super impressive going from something that's useful but makes mistakes to making half as many mistakes. Once you hit the threshold of minimum usefulness the exponential growth seems like it's sudden and amazing when it's actually been going on for a long time.

At the same time, we've had a few great improvements in methodology with how models are designs (like transformers) and the first iterations showed how impressive things could be but were full of inefficiencies and we're watching those go away rather quickly.

link

lioeters 886 days ago

> structure of revolutions

For anyone who hasn't heard of it, this phrase is a reference to the theory of paradigm shifts in scientific progress, introduced in the book "The Structure of Scientific Revolutions" by Thomas Kuhn.

https://en.wikipedia.org/wiki/The_Structure_of_Scientific_Re...

link