We haven't found a good provider yet to do this properly for our use case, but SpeakerText, Koemei and VoiceBase are examples of companies that offer these functionalities.
Unfortunately SpeakerText doesn't offer non-post-processed prices, Koemei integrated it into their own product and VoiceBase didn't offer post-processing on request, which we would need for integration into our product.
Those formats don't accommodate for timestamps per spoken word though, which would be possible with machine transcription and which I would pay a premium for.
Unfortunately SpeakerText doesn't offer non-post-processed prices, Koemei integrated it into their own product and VoiceBase didn't offer post-processing on request, which we would need for integration into our product.
Which format will become mainstream probably depends on HTML5 adoption, which is detailed here http://www.3playmedia.com/how-it-works/how-to-guides/html5-v... Currently WebVTT seems to be in the lead.
Those formats don't accommodate for timestamps per spoken word though, which would be possible with machine transcription and which I would pay a premium for.