Hacker News new | ask | show | jobs
by rspeer 3384 days ago
When I was an undergrad freshman, I took a job with a research group as a data annotator. My job was to go through the Switchboard corpus (recordings of hour-long phone calls that people agreed to have recorded, in exchange for having the long-distance charges paid) and label features such as who was speaking, whether the pitch of the voice was rising or falling, whether the vowels were elongated, vocal fry, and stuff like that.

But the most time-consuming and mind-numbing part of it was just annotating the words in the sound file.

The interface for all of this was a terrible GUI hacked in on top of some Solaris sound editor, and it couldn't do things for you like find the moments that words began, or say "hey the pitch is obviously falling here" because frequency tracking is a thing computers can do, or anything.

There's still a lot more voice data to annotate in the world, and maybe having a flexible Python tool like this will make the next undergrad doing the grunt work much more effective at it.

1 comments

I agree on most of your observations.

However, please note that other tools are better suited than aeneas if one wants to align at phoneme level: gentle, Kaldi, SPPAS, etc.

aeneas' goals are covering as many languages as possible, fast computing, targeting (sub)sentence granularity (e.g., ebook-audiobook or closed captions). Phoneme-level annotation really requires more sophisticated techniques, like HMM/GMM/NN as implemented by the tools mentioned above. Yet, aeneas can be used to quickly bootstrap e.g. a manually-reviewed alignment.