Hacker News new | ask | show | jobs
by msiemens 3319 days ago
Did you consider putting the "center" of the detector somewhere other than in the middle of the vector? what would happen if you had 6 before, and 2 after, or 5 before, and 3 after? Another thought I had: for performance reasons, it might be nice to have something more compact than a one-hot vector for each letter. Have you looked at determining sets of characters which have a similar impact on hyphenation, and encoding them together?

These are interesting suggestions! It sure would be interesting to do actual research on how to optimize the hyphenation even more. It also would be interesting to play with the hyperparameters and network architecture to see what impact they have on the hyphenation accuracy. Alas, I'm a student so time is rather scarce.

PS: do you have the extracted list of wiktionary hyphenations sitting in a text file somewhere that you could put up? I'm fixin' to quickly compare the accuracy to TeX's German hyphenation (once the 30+GiB TeXLive repository finishes downloading).

Sure! The GitHub repository actually contains a Rust program to process a Wiktionary XML dump into a word list for training, but if you want to skip straight ahead, I've uploaded the dataset I used to https://gist.githubusercontent.com/msiemens/2aac63cf8d1b88c4... [6 MB, licensed under CC BY-SA 3.0].

PPS: You could improve the display of code blocks in your site on desktop by adding [...]

Thanks for the suggestion, I'll look into it!