Despite the green user, I've actually worked with these guys and seen this project up close and in person, and the amount of data they're harvesting here is pretty overwhelming.
There are a few other places doing transcription nowadays, but they're just doing that, and this is a bit richer an API for getting more tailoring done against your source data.
Either way, if you're looking for a way to add audio transcription to your podcast or vlog, this is a cool service. If you're looking to make that audio searchable with the fewest number of steps, this is probably the coolest service around.
This is actually really cool! I'm guessing it is English only though? I don't see examples of any other languages and due to the complexity of word->audio matching I imagine other languages aren't supported.
I think a better title would be "Audio search engine for Podcasts using FluidDATA". It had briefly gotten my hopes up that I'd be able to make a search engine for my music, based just on the title.
Cool project, and a mammoth undertaking in terms of scraping and data processing. Would you be able to share any details on what your ingestion infrastructure looks like?
We were planning on writing up a blog post to go over what our backend looks like. But essentially we have written a crawler to discover audio on the internet and a distributed processing framework to download, extract metadata, and transcribe the audio.
We've iterated through a few storage solutions and have settled on using GlusterFS+zfs running on Storinators. So far we have about 350TB of data indexed in our collection.
That's pretty neat. After you download the audio and process it, do you delete the data, or store it for safe keeping? 350TB is a healthy chunk of data.
We are co-locating some of our infrastructure. The backend that does the data processing is running in a rack on our own hardware. The user facing portions are hosted in GCE.
Are you looking for a specific sound, or is it a word in the English language? These cats can help if it's a word in the English language: I'm not sure if they can search for specific sounds, although I'm sure it's possible down the road.
If you haven't stumbled across it yet, you can check out the FluidDATA web search that let's you search millions of podcasts by phrase or mention here https://fluiddata.com/
Thanks, I just messed around with it a bit and enjoyed the discovery. You all seem to have a ton of content across the web processed, it's very interesting.
Despite the green user, I've actually worked with these guys and seen this project up close and in person, and the amount of data they're harvesting here is pretty overwhelming.
There are a few other places doing transcription nowadays, but they're just doing that, and this is a bit richer an API for getting more tailoring done against your source data.
Either way, if you're looking for a way to add audio transcription to your podcast or vlog, this is a cool service. If you're looking to make that audio searchable with the fewest number of steps, this is probably the coolest service around.