If only this wasn't necessary - a lot of online stuff has had a single 'take' done and minimal editing out of such time-wasting utterances or silences. And once you start to notice such characteristics in some poor speakers it can be a complete deal-breaker in terms of actually learning something.
I have quite literally gone through long lectures and edited out such filler words (Audacity is good for this), where the material is sufficiently compelling (an extreme rarity).
It's really a telling level of contempt for an audience to allow unedited material containing excessive fillers to be released. I'm not at all a fan of the "one take, FI/SI" school of podcasts, and will bail out of virtually anything that features this.
For vapid voiceovers, I'll often just watch the video with sound off. My response is similar to how Douglas Adams described Marvin the Android hearing people count.
Not sure how well it removes "ehm"s, but I use unsilence[1] a lot for lectures. It removes the silent bits from a video file. It isn't a browser plugin however. You have to download the lecture before converting.