| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chatmasta 2992 days ago
	Cool project, and a mammoth undertaking in terms of scraping and data processing. Would you be able to share any details on what your ingestion infrastructure looks like?

1 comments

mathatoms 2992 days ago

We were planning on writing up a blog post to go over what our backend looks like. But essentially we have written a crawler to discover audio on the internet and a distributed processing framework to download, extract metadata, and transcribe the audio.

We've iterated through a few storage solutions and have settled on using GlusterFS+zfs running on Storinators. So far we have about 350TB of data indexed in our collection.

link

dandancanfly 2992 days ago

That's pretty neat. After you download the audio and process it, do you delete the data, or store it for safe keeping? 350TB is a healthy chunk of data.

link

mathatoms 2992 days ago

We have enough storage to hold on to the data. We keep the data around so we can retranscribe files as we update our language models.

link

chatmasta 2992 days ago

Wow! Sounds awesome, I would love to read a blog post on that.

Are you co-locating the hardware? What is bandwidth pricing like?

link

mathatoms 2992 days ago

Thanks. Our blog is located at https://blog.bitplatter.com.

We are co-locating some of our infrastructure. The backend that does the data processing is running in a rack on our own hardware. The user facing portions are hosted in GCE.

link