Hacker News new | ask | show | jobs
by olsgaarddk 2492 days ago
That was my initial goal, but I had a lot of trouble with vanilla MeCab not understanding a lot of the text. But this was before neologd, so i think it would work better now.

I don’t have the source code on me, but I scraped it from a website that publishes subtitles. The scraping was easy, the cleaning not, and I believe this spreadsheet is generated from my first attempt at cleaning.

A lot of sources in Japanese nlp and linguistics have a bad habit of changing url often, so it bitrots easily. Sorry.