| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by blopker 5 days ago

Nice! I really like how many variations on this idea are coming out. MacWhisper used to be great, but is kinda of a buggy mess now.

I'm making my own, for personal use. I did a survey of many and they all (that I could find) skip the fundamentals.

The major issues that I've run into:

- Crash recovery. Most of these apps are incredibly buggy and crash all the time, taking the recorded audio with them. Macwhisper is incredibly bad at this.

- Disk space. Many of these apps save wav files to disk. After a few hours of meetings, you may end up with gigabytes eaten.

- Microphone bleed. People don't always use headphones, the system mic will pick up the speaker sounds, causing duplicate (approximately) transcriptions.

I've yet to find a solution that handles all these correctly, let alone having high quality transcriptions.

Anyway, most of these apps are built around https://github.com/FluidInference/FluidAudio, if anyone is curious. Their readme has a big list of similar apps as well.

6 comments

AG342 5 days ago

Crash recovery is definitely something that I want to spend a bit more time on. I'm not entirely sure how Trace handles crashing right in the middle of a recording, so I'm going to put a bit of time aside in the next few days to properly explore this and see if I can come up with an elegant solution to it.

I think I've got the other two bits covered. I pushed an update yesterday that adds active echo cancellation so that audio playing through the speakers (or leaky headphones) won't get transcribed twice if it is picked up by the microphone. It can be disabled in preferences, but it's on by default.

The disk space issue is one that I considered as well. By default, Trace deletes the actual audio recordings as soon as transcription is successfully completed, so the idea is you keep just the markdown transcript rather than the gigabytes of raw audio. If you want, there's a preference to disable the auto-deletion. There's a bit more on the support page here https://traceapp.info/support (search for "Auto-deletion of audio").

FluidAudio is a big part of this and is actually used in two places during a session. It runs the Parakeet EOU model for the instant recap (which isn't hugely accurate, but it's good enough for the job) and after the call it's also used to transcribe the recording, depending on which engine you've selected (Trace offers a fast and an accurate one). If the fast engine is selected, we use FluidAudio with the Parakeet-TDT 0.6b v3 model for transcription, which then goes through Pyannote and WeSpeaker for diarization. If the accurate engine is selected, we use WhisperKit with the Whisper large-v3-turbo model for transcription, and SpeakerKit for diarization.

link

kstenerud 4 days ago

For crash resilient data, you have a few options:

- Journaling file structures (telegraph what you're about to write, then write it, then signal completion)

- memmap your important data structures to a file (they will be flushed to disk no matter how your app dies - short of a power loss)

- post-crash dump (put last-minute writers in a crash handler to save it to disk)

A journaling file structure is the most secure, because it's designed with the assumption that writing will eventually fail. memmapped structs are easy and cheap, and get you 99% of the way there (only power loss will lose your data). Crash-time writing is doable with a crash handler like KSCrash, but there are many ways an app can crash without triggering a crash handler (thermal kill, exceeding quota, memory jetsam, etc). You also need to write your data in a signal-safe manner.

link

scosman 5 days ago

I had the same experience so started building my own. All problems are solvable, just working on the polish.

- crash recovery: part one is use ADTS aac (even if process crashes, audio is saved up until it does). Part two is isolating the transcription/summaries in separate XPC services.

- disk space: AAC 64kbps mono soles it. Could use Opus for further reduction but both are small.

- speaker bleed: macOS voice isolation processing solves this. It’s a nightmare to get setup, but works great once done.

- library: using argmax SDK - by a bunch of ex-Apple on device AI folks.

It it wasn’t for CoreAudio, I’d say it was easy to make. Argmax, Whisper, and llama.cpp - wrapped in the right architecture, mostly just work.

I’m having fun nerding out on the details like custom vocabulary (get the names of the people in here meeting right), inferring speaker names from transcript, calendar integration, nice UI, etc.

link

jv22222 5 days ago

Nice tip on FluidAudio that's the kind of thing I've been looking for. Thanks!

link

victorbjorklund 4 days ago

Handy works good with crash recovery (mostly from me turning off the computer mid-recording because I forgot about the recording)

link

highmastdon 5 days ago

I’m using MacParakeet these days. If your language is supported, definitely give it a try. It’s much faster and lower footprint

link

Folcon 5 days ago

> I've yet to find a solution that handles all these correctly, let alone having high quality transcriptions.

Wait really? I honestly would have thought this was a solved problem by now, especially high quality transcriptions bit, just out of curiosity, is the problem that the quality isn't high enough?

link

blopker 5 days ago

There are still a few unsolved problems that require tuning for specific applications. Applications that own the video call have a much easier time, they have access to each individual audio stream. Applications like this, however, have to deal with overlapping voices from a single stream. If it's trying to attribute each utterance to an individual, separating the voices is tough, or can lead to confusing transcripts. There are many little problems like this which make it a tough problem in real world usage. Domain specific terms, or proper nouns is another source of inaccuracy.

link

sofixa 4 days ago

> Wait really? I honestly would have thought this was a solved problem by now, especially high quality transcriptions bit, just out of curiosity, is the problem that the quality isn't high enough?

If I had to guess, all of those apps are probably vibecoded, hence the variable quality.

link