Hacker News new | ask | show | jobs
by cjy 3314 days ago
I'm also learning Mandarin and was wondering if this was possible (for a different show) just the other week! Thanks for the article, will be looking forward to Part 2 and 3. Also, is there an easy way to extract all the frames with unique subtitles?
2 comments

Once you have the text corresponding to each frame, you can de-dupe it with its neighbors based on Levenshtein distance (can't use exact-match because of recognition errors). I found that for this show subtitles generally hang on-screen for 1-3 seconds, so you wouldn't have to do many comparisons.
cjy - please can you help me to find more double-subtitles? (Chinese and English, synced)

I have a program to add spaces between Chinese words, colours for the tones, pinyin, and a literal translation.

http://pingtype.github.io

I already made a feature to list all the unique words in a movie, sort them by their frequency, and make a study sheet. I also made bash script generator to use ffmpeg to cut the movie to the subtitle time.

All I need to do now is recombine the subtitles based on the words, to make videos with lots of example sentences.

It's much easier to study with a real English translation though, instead of a literal word-for-word transcription. If you could help me get more input data (names of movies or songs, srt files), that would be wonderful!