Hacker News new | ask | show | jobs
by autoencoders 1672 days ago
Hey HN!

I like podcasting, but I hate editing them. I tend to stutter and have a lot of filler words in my podcast. That's why I created Cleanvoice, in order to spend less time editing them. Cleanvoice is an ML tool which removes filler words, mouth sounds, stuttering and dead air from your podcast. To use it, just upload your podcast - wait some minutes - download the cleaned audio.

It's still not perfect, but it's at a stage where I can blindly use it on every single one of my podcast.

I would love to hear your feedback!

4 comments

Neat! I love products that come out of a personal need.

Is it possible for you to do a live, personal demo? No logins or anything. I'm thinking something where you tell people to start up their audio and then give them a quick prompt like "Describe your breakfast yesterday." Record for 30 seconds, and then let them play back the original and cleaned versions. You could limit them to, say, 5 goes, with a different prompt each time.

I suggest it because a) a little personal investment makes it more likely they'll give you their email address for signing up, and b) many potential customers underestimate how much they need something like this.

I like your idea, makes sense.

My biggest fear is that without login, people will start abusing it in ways that I don't expect. Definitely considering it. Thanks you!

That's a good fear to have. That's the kind of thing I would set up some monitoring for and then wait to see. You might get a few jerks. But those same jerks might also be the sort of people who would sign up with a bunch of fake emails, so gating on an email address may not be much better than gating on a fresh-issued cookie.

Thanks for listening, and good luck with your project!

Have you compared this to other commercial options such as Descript? Looks really great at a glance, thanks for sharing!
I tried to use Descript for my podcast, but it has some issues.

1) It doesn't work well if you have a strong accent. As an non-native speaker, the transcription were quite bad, making the editing quite bad.

2) Cleanvoice works with multiple languages, descript doesn't.

3) Cleanvoice can remove stutters (not always, but it tries) and mouth sounds like lip smacking, teeth clicking. Descript can't. This is not a big deal for most, but since I stutter alot this was essential.

My approach is different from Descript. They use a transcription service, and then they edit the audio based on the text. I work directly on the phonetics level. Allowing me to have more control over audio.

Depending on the needs, either one is better. I guess you should try it for yourself and compare.

I will try it thanks! We work with lots of accents and we've found the same with Descript that it fails for example with a strong French accent. Translstion is really key for us also, looking forward to seeing systems trained on more accents.
I use Descript and it is absolutely lovely. There are a bunch in this space that I would not be surprised being merged or acquired. Would love to see Descript & GetWelder merging together.

While Cleanvoice has some niche features that Descript doesn't offer I would not be surprised to find them rolling these features out in the next major release they're doing. IMO the founder of Cleanvoice should sell/join Descript.

Without giving away your secret sauce, what are your approaches to the cleaning process? Is it a combination of different passes of algos or is it something more generic and "sausage machine-like" like a neural network?
The audio is edited in several phases. It uses different algorithms, but most of them are deep learning based. It is surely overengineered, but as a Data Scientist, ML is the most fun part for me.
How is the latency and, if it's sufficiently low, could this realistically be applied to "nearly live" content?

That scenario seems really appealing for conferences, even if it just quietens down the verbal ticks, but I'm guessing if the lag is too great it would get like a bad lip sync issue

How does real time makes sense in the first place for an algorithm that gets 1 minute of audio and gives you back 50s? You are gonna have to fill the gaps anyway with something not meaningful.
An awareness of your point was precisely why I mentioned "quietening down verbal ticks" (ie 1 minutes gives you back 1 minute but with the ticks removed/muffled)

To me this seems like it could be worthwhile even if it results in silence or less prominent umms and other filler - I've sat through enough conference talks by technically gifted people who I very much wanted to hear but who unfortunately make their talks much harder to follow due to the ticks. It might even help relax some nervous speakers if they knew any of these that creep in were being suppressed.

I understand now. Very niche, but I applaud the effort to give voice to people who have something to say, instead of those who know how to talk in public.
Silence is meaningful, but pretty awkward when not deliberate!
Tools like this are designed to remove awkward silences.

What it sounds like the GP is after is something more like hiss and pop removal (to use an only vinyl analogy) and that’s a different and also simpler problem to solve. I’d wager there are already tools on the market for that.

Very insightful :). Now I need an AI to tell me when silence is deliberate or not. :)
It would be a huge engineering endeavour, which I wouldn't be capable of doing. That said, things like background noise and some sounds can be removed. See Krisp.ai
Nvidia RTX voice does similar. It's pretty similar to other technology though where it focuses more on removing background noise. It actually works very well. It would definitely be interesting to see it also filter speech itself. But I feel like this would be hard to do without introducing extra latency. If someone is saying "umm" or some other filler before a word you kinda need to know what that word will be to determine if it's filler or not. So it almost can't be done without introducing latency as it would need some future speech to determine if filler or not.
To do this, the speaker would have to wear an EEG cap. You're talking about cutting the mic before a verbal tic happens.

With an EEG cap, though, I bet a smart person familiar with the methods could bash something together in a day that would work.

True. You don't even need a full CAP. Just some channels in the visual cortex. (With more advanced AI) So you would just need to hear a headband or one of those EEG which look more elegant.
Izotope plugins already do some of these things but not all. In particular their de-clicking algorithm is pretty good but definitely not automatic or low latency.
Do you do any audio segmentation to remove the filler words and such?
Based on the OP's username, surely one of the deep learning algorithms is a denoising autoencoder, right?
I literally just bought your product, thank you very much, I needed this and wondered why no one had made it yet.
I appreciate it! If you have any issues or need help, feel free to reach out. (You can use the chat in the app.)