Hacker News new | ask | show | jobs
by nickjj 1672 days ago
As someone who has personally edited over a hundred 1-2 hour podcasts with a new guest every time removing umms, ahhs, dead air and filler words is soul crushing. It has gotten to the point where after 2 years of running my podcast[0] I'm seriously considering stopping the show because I'm getting burnt out from editing and without sponsors it's not feasible to hire an editor, but even with the show making no money I would happily pay triple your asking price if I could click a button and have the problem solved in a way that matched a human's ability to edit out filler words.

It really is the difference between being able to edit a 1 hour episode in 1 real life hour (editing at 2x speed) vs literally spending 5 hours to edit 1 hour when there's a lot of filler words or ums. That's due to having to stop every few seconds, think about when to cut it and perform the cut. This is using a heavily optimized keyboard shortcut focused workflow too.

I hope you don't mind constructive criticism but in my opinion your "after" version doesn't sound natural. This isn't an attack on your service specifically, because the outcome is the same with all of the automated tools I've tried. I haven't tried them all but I did play with a few of them.

For example in your case the pause between "Removing" and "filler" doesn't match the pace of the rest of the sentence and the transition from "very" to "time" has a very hard cut. This is also a 10 word clip that's about 6 seconds. If you listened to a 1 hour podcast episode that was edited like this it would be much more noticeable.

There's so many intricate and subtle details around when and what to cut to remove these things in a way where it's not noticeable. Are there any paths moving forward in AI / ML that can lead to this being indistinguishable from being humanly edited?

I debated deleting this comment before posting it because it's a combination of feedback but also saying the service isn't something I would buy in its current state but I'd like to think it's more beneficial to post this to show there is a real demand for this service if it can be executed flawlessly.

[0]: https://runninginproduction.com/

7 comments

I use the editing software “descript” and this process (removing ums) takes just a few minutes even for a long show, because you just delete the words in text. They even have a button, remove “ums”. It’s a game changer.
Meta, but your comment was (IMHO) a great example of constructive criticism. Show HN is about that, not just staying silent and letting the users work die.
Funnily enough I was about to start building this then found descript[1]. It transcribes the text and allows you to edit the transcription then export it as audio.

[1] https://www.descript.com/

revoldiv.com has a similar feature set
I’ve not edited anywhere near as much as you have but I agree, it’s so tedious and by the end of an editing session you can really start to resent the guest and all their verbal ticks. I find I get a good idea for what the waveforms look like for some noises and can see them coming and preemptively split the track the start with a decent success rate.

Using RiversideFM to get two locally recordings is also a big help.

I was sat next to an audio editor and producer at a wedding recently and we got on to this topic and he said “your number one job when editing an interview is to make the host sound good and then just do the minimum on the guest, otherwise you’ll waste too much time”.

Doing the kind of editing 8 hours a day I can see why he says that.

Yeah it's weird. I have these in depth technical conversations with every guest where it's great, I love this part. The frequency of verbal ticks and filler content really takes an edit from "this isn't too bad" to "what the fuck am I doing with my life?" all based on how many times you need to remove filler content within the first 5 minutes of editing a 90 minute show.

I'm kind of surprised that wedding producer openly said that. My philosophy has always been the opposite. One of my main goals of the show is to make the guest walk away thinking this was the best podcast experience they ever had from start to finish as well as do everything I can to make them come off as good as possible.

I rarely cut content but most episodes have hundreds of manual edits to remove filler content and create a more concise flow by removing long pauses because my 2nd main goal is to optimize for the listener. I keep the edits organic at the same time by leaving in some filler content and subtle things like a deep inhale or a sigh because there's a lot of meaning around that when it comes to sentiment and tone, the same can be said for sometimes leaving in an extra 500ms pause to amplify the meaning behind something. At the same time, sometimes filler content gets left in because it flowed too quickly into the next word so cutting it sounds too unnatural as if it clipped.

This is why I think it's a crazy hard problem to get a machine to be able to make decisions like this.

I do use separate recordings (we each record our track locally), it definitely helps eliminate the few cases where we talk over each other or being able to lower the volume of a laugh so it doesn't overpower what the other person said while still keeping it in because it's a good part of a conversation and a snort or laugh can easily be the difference between a listener wondering if the guest was offended or happily agreeing with something.

The edit on the page is not the best. I agree!. Mainly, if your recording is unnatural (like that one) the edit is also unnatural. However, the tool works better in an interview podcast. I would strongly recommend to just upload a sample, and you would see a big difference.

Regarding if ML would be indistinguishable from humanly edit. Hard to tell. I think it will be like self-driving cars in the future. 98% edits good 2% bad edits.

This is a super cool product, congratulations. I especially like the extremely clear value proposition on the homepage. I know what this is and who it is for right away.

My first impression of the unnatural recording was that it must be that way to make it easier to get a good result, but then the result doesn't sound natural either. I think a lot of this is the drawn out uterrances made the speaker vary their pitch/cadence a lot more than usual. Once edited to remove the gap, the sudden change is very noticeable.

I don't think that's due to your software, but just a fact of the unnatural source audio. I think a different, more realistic source audio could let you have a really awesome example, without it being disingenuous or not representative of real-world results.

Thanks for jumping into the ring and answering questions in here!

Thank you for the suggestion and your impression. I agree, a better example would highlight it better.
Is it possible you’ve set the bar too high for yourself? What if you timebox the editing effort and just focus on the most egregious issues? Certainly you would get more complaints but how much impact would there really be?
It might not make much of a difference with a hobbyist's podcast, but filler/mouth sounds won't fly in professional productions for a variety of reasons (time constraints, professional standards, wanting to make hosts/guests sound good, etc.)
Professional productions also have deadlines and budgets and obsessive grooming won’t fly either.

I’m not suggesting no edits, just relaxing things a bit so the burden doesn’t become an existential threat to the podcast.

> Is it possible you’ve set the bar too high for yourself?

Probably but I have no way to turn this off and be happy with myself.

I try to approach everything I do from the angle of "what needs to be done to make this as good as it can be with my current skill set?". From a listener's perspective if I had to listen to something with a bunch of mouth noises, ums every 3 seconds or long pauses I would end up focusing on that instead of the topics being covered. It would give off a wrong impression that conflicts with my core values.

>Probably but I have no way to turn this off and be happy with myself.

I can appreciate this, but...

>It has gotten to the point where after 2 years of running my podcast I'm seriously considering *stopping the show* because I'm getting burnt out from editing and without sponsors it's not feasible to hire an editor, but even with the show making no money I would happily pay triple your asking price if I could click a button and have the problem solved in a way that matched a human's ability to edit out filler words.

(emphasis mine)

I don't think it's actually the case, but extrapolating a list of priorities from this, I can only arrive at the following:

Priority #1 - no aahs, umms, slurps or smacks

Priority #2 - no ads or obvious sponsors

Priority #3 - surfacing hard-won lessons from experienced folks for the world to learn from

Maybe that resonates, maybe it doesn't, but to me it seems upside down.

I'm only commenting because what you're describing used to be me. I used to do this type of editing for recordings of live audio production and I've gone down the rabbit hole you're describing above. The problem is there's no obvious point of 'done', and chasing perfection in the output can become a pathological obsession. You can get so lost in mating phase angle at each end of a trim or taking an eraser to get rid of a sleeve drag across the desk that you lose sight of the totality of it. Ultimately you end up in a weird uncanny valley, like those folks that keep 'fixing' their face with plastic surgery. Once you get to that point, you can no longer identify specific issues to correct, you just fall into a diffuse unease.

For me podcasts are a way to join a conversation that I wouldn't otherwise have an opportunity to listen to. I don't see them as a show or corporate media product, and the more they start moving that direction the less inclined I am to listen to them. Julia Childs had a quote that I've found oddly applicable in this context: 'It's so beautifully arranged on the plate, you know someone's fingers have been all over it.'

Hope this doesn't come across as negative. Good luck!

Thanks a lot for the reply.

> I don't think it's actually the case

Do you mean the editing process isn't what's making me want to stop the show?

For perspective, phrases like sleeve drag aren't even in my vocabulary. I mainly do my best to quickly get rid of filler content without it sounding like there's hard cuts. It's not chasing absolute perfection where I'm zoomed into the waveform so much it looks like an oscilloscope while I hem and haw about there being a 35ms or 50ms pause between 2 words, or agonizing if I should leave an um in there so things don't sound over processed.

Here's a screenshot while editing an episode where the guest was extremely fluent and I didn't have to edit much filler content: https://i.imgur.com/7CBZ1yc.jpg, for context the episode was 90 minutes long but I zoomed into the point where you can see a ~10 minute chunk (normally I'm zoomed in much more while actively editing). This is a best case scenario where I "only" had to do 305 cuts for a 90 minute show. In the worst case scenario it's gone as high as 1,800 cuts for 90 minutes.

I try to keep things organic while being respectful to listeners. All of the cuts you see there are related to removing filler content (umms, ahhs, mouth noises and long pauses). I also remove their dead air when I talk to avoid any of their mic's background noise overlapping my voice since it's all recorded in an uncontrolled environment.

The before and after is pretty staggering even with a fairly minimal amount of filler editing. To be honest I would feel embarrassed posting the unedited version of most episodes.

It's also very interesting because in a way I think posting a much less edited version where I kept all of the filler content in wouldn't save me much time in the end. Not to sound too over confident but I'm really confident in my ability to perform quality assurance of each episode while I'm doing the editing. I haven't listened to a single episode in its final form because I've gone through each sentence and phrase multiple times during the editing process. For example I'll start playing it, hit a cut point, make the cut, rewind a bit and ensure things flow smoothly, then continue onwards.

If I did a much less edited approach I would still need to listen to the show at 2x speed, so no matter what I'm spending 30 minutes listening to 1 raw hour. However I'm also creating timestamped show notes like you see here https://runninginproduction.com/podcast/99-a-custom-electron... along the way while editing so I have to pause to write these down.

Basically I would still be spending quite a lot of time to produce things and I don't think I can outsource that because it would involve finding someone who is not just an audio editor but they would need a ton of domain knowledge around 100 different assorted technologies. A lot of those timestamped notes aren't verbatim quotes. I'm mixing quotes with trying to keep it concise to fit into 1 line. I'm also making judgment calls on what to include because not everything is worth making a note over, otherwise there would be one every 30 seconds (I used to do this in earlier episodes).

Personally I would rather have a transcript with timestamped links where each guest is broken up into their own paragraphs but to have them done right costs a lot of money. Every machine generated transcript service I used had really bad grammar issues and mistakes. A human reviewed one would be well over $100 per episode to make which is a lot when the show already has a net loss on every episode (hosting).

That quote you mentioned was really good by the way. I'd like to think my editing style is more on the side of someone occasionally using their hand to make sure the food doesn't slide off the plate while you run the plate over from the kitchen to the customer. That's how I feel during the editing process. I'm trying to get through it as fast as possible but taking great care to ensure a high quality meal arrives to the customer. I'm optimizing for folks wanting to come back to their favorite restaurant on a regular basis, not serve an artificial feeling $10,000 plate to a king.

> > I don't think it's actually the case

> Do you mean the editing process isn't what's making me want to stop the show?

No this is just confusing language on my part. What I meant was that I don't actually think those are your list of priorities in order, but that is how they could be extrapolated based on which part has to give.

OK so after your description of your workflow I think I was reading too much into where you were at specifically with regards to the content clean-up. I was worried that you were hovering over every sentence trying to optimize it and was just trying to talk you down off the ledge. :) For some reason I tend to gravitate towards jobs where I'm at my best when nobody knows I did anything at all. Editing is probably one of the best examples of this and, as a result, it's hard for anyone that hasn't done it to truly appreciate how much work there is behind it.

(Some of this is selfishly motivated btw, I've been following your podcast since the spring and don't want it to go offline lol. If i have to listen to some CTO's lips smack every time he gets ready to talk I'll allow it. :) )

> For some reason I tend to gravitate towards jobs where I'm at my best when nobody knows I did anything at all.

Yes, this is perfectly said. It's exactly how I feel and what I strive for. I think most folks would be surprised if they listened to a before / after even if all that was done was occasionally remove filler content and mouth noises. It's like that one business analogy iceberg picture with "success" being the 10% that's above water and the other 90% is buried with all sorts of things you never hear about.

> Some of this is selfishly motivated btw, I've been following your podcast since the spring and don't want it to go offline lol

That really means a lot and I'm happy to hear you like the show but unless a big pile of money falls from the sky to afford hiring a dedicated editor and human reviewed transcripts then I have to pull the plug. I've already been feeling this way for 3-4 months but tried to power through it. I've reached the point of feeling resentment and disgust just thinking about opening my editing tool of choice and it's taking its toll. It sucks because I would love to record the show until the day I die but these are the cards I'm dealt and I have to choose sanity over suffering at this point.

There's no middle ground due to the last half of my previous reply.

What post-processing do you do already to catch the low hanging fruit? Izotope? I reckon putting in 100 hours of editing and not being able to get an hour down to sub an hour means there is something which could be optimised out quite quickly.
> What post-processing do you do already to catch the low hanging fruit?

None, everything is manual.

I use DaVinci Resolve to do the editing where both the guest and myself have separate tracks. Then I line up the tracks (only takes a few seconds) and start playing things from the beginning at 2x speed. I stop to make cuts mostly to remove filler content.

Through out this process of editing I'm also creating show notes as I go. An example of the end result is here https://runninginproduction.com/podcast/103-great-question-m.... Basically every few minutes I recap what was said into a 1 sentence bullet point with a timestamp. Along the way I list out techs used as tags and list out reference links / libraries into a Markdown document. Then once I'm done editing the show I write a few paragraphs which is a TL;DR of the episode.

All in all if the guest uses minimal filler words or noises it takes about 1 real life hour per 1 hour of recorded content to do all of the above. For context, the episode I linked has someone who I would bucket into a category of speaking very fluently with minimal filler content. I was able to blaze through that one.

I also have a 2560x1440 display and use the "always on top" feature of most window managers to layer the Markdown document and a preview of the page just above the waveform in DaVinci Resolve so I can quickly make cuts and update the notes with minimal mouse movement. Almost everything is keyboard driven.

What tools can be used to speed up that process?

It sounds like the show notes are the most costly part I would assume? I imagined you were exhausting yourself on scrubbing through manually and editing little clicks, lip smacks, inhales out slowly. The former is much harder to automate away but the latter is definitely easy with some commercial audio plugins.
I've timed myself going through episodes where the guest spoke very fluently vs guests where I had to stop every few seconds to cut a filler word. The latter takes multiple hours longer which makes me think the time consuming part isn't the show notes, but the mechanical editing. Each note only takes about 30 seconds based on listening to the last few minutes of what was said.

It is mentally taxing though, it means during the whole editing process my brain is constantly identifying and removing filler content, listening for specific tech choices to tag, listening for specific references that could be interesting to link, listening for mentions of libraries to link and also digesting the main takeaway of what's being said to sum it up into a note. All of this happens in 1 pass during the editing process. I tried doing it in 2 passes where I only focused on mechanical editing the first time around and doing the show notes on the 2nd but it took longer in the end.