| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kkielhofner 1082 days ago

> Yes, for my demo I am using whisper.cpp however there is an implementation that also uses faster-whisper.

The benchmarks I referenced above show a GTX 1070 beating an Threadripper PRO 5955WX by at least 5x. Our inference server implementation runs CPU-only as well and is based on the same core as faster-whisper (ctranslate2) but our feature extractor and audio handling makes it slightly faster that faster-whisper. The general point is GPUs are so vastly architecturally and physically different - a $1K CPU can barely do large-v2 in realtime, while a $1k RTX 3090 is 17x realtime (4090 is 27x realtime).

Many demos, etc online that feature local Whisper use tiny - we've found that in the real world, under real conditions, Whisper medium is the minimum for quality speech recognition with these tasks and many of our users end up using large-v2. Using the same benchmarks above, this puts the floor for response time at 1.5 seconds (medium) for ~3 seconds of speech on CPU - just to get the transcript. I understand you're early but if you can eventually break five seconds with this all-local on any CPU in the world I would be very, very surprised and impressed! I suspect you'll find that even with the worst internet connection in the world OpenAI is still faster than llama.cpp, etc on CPU (they use GPUs, of course):

With highly tuned Whisper, LLM, and TTS our inference server is around three-four seconds all in (Whisper, LLM, TTS) for this task - on an RTX 3090 and I don't consider that usable (the LLM is almost all of that). Imagine trying to have a conversation with a person and every time you say something they stare at you blankly for 5-10 seconds (or more). Frustrating to say the least...

I suppose the point is that for these tasks Apple, Amazon, Google, OpenAI, etc all use GPUs (or equivalent) for their commercial products and that is the benchmark in terms of user expectations - and it's often still not fast enough and merely tolerated. For these tasks if you're bringing a CPU to a GPU fight you're going to lose - an RTX 3090 (for example) has nearly 20,000 cores and 935 GB/s of memory bandwidth. All of the software tricks and optimization in the world can't make CPUs compete with that.

That said, what Apple is doing with Apple Neural is very exciting but that's another accessibility issue - outside of HN most people don't have the latest and greatest Apple hardware (or Apple hardware at all). Not like many people just have GPUs lying around either but for today and the foreseeable future, given the fundamental physical realities, "it is what it is" - you either have specialized hardware or you wait.

Accessibility is important to us as well (why we support CPU only) but I question the value of accessible if it isn't near practical, and for many people waiting at least several seconds for a voice response puts these kinds of tasks in "take out your phone and do it there" territory, or in the case of already being on a desktop open a browser tab, type it out, and read it.

I DM'd you on twitter from @toverainc - let's do something!

1 comments

jzombie 1081 days ago

There's a lot to read here, and maybe this is already implemented, but my initial thoughts were it was waiting for the full response to be generated before starting to read it.

link