Hacker News new | ask | show | jobs
by Qwertious 684 days ago
This is great news, it's been needed for ages - handwriting is more than just funky OCR, it's OCR as applied to vector lines with a defined stroke order. So for instance, a lowercase e and c might render to the exact same pixels due to the 'loop' of the e overlapping itself, but if we know the stroke started in the middle of the line and then retreads itself, we can know for sure we're looking at an 'e'. That's simply not possible in e.g. Tesseract.
2 comments

This project is just funky OCR, i.e. "offline" handwriting recognition that operates on the pixels of the final image only. That means it works on scans, but can't take stroke order information into account.

What you're talking about would be "online" handwriting recognition, where timing information about each stroke is available.

Yes, that's totally correct! The current version of the plugin supports only so called "offline" HTR, which operates on images. This is ultimately determined by the underlying machine learning model.

I have developed another model however (based on a somewhat recent Google paper by Carbune et al. 2020), that operates on pen dynamics and thereby implements online HTR, see here:

https://github.com/PellelNitram/OnlineHTR

This model is open-source as well and will be part of the HTR system for Xournal++ in the future. Feel free to give it a try yourself locally.

One question that has been bothering me a long time and prevented online HTR so far for me is how to find text on a page in temporal domain (i.e. in online domain and not offline domain). If you have any ideas on that, please do let me know as I would greatly appreciate that! One possible way is a transformer model - but again that feels a bit overkill and introduces a context length.

Well, that explains why I could never find a decent stroke-order-aware HWR system that wasn't a service. Sigh. What idiot invented this terminology?
Yes, you're right, stroke-order-aware HWR are hard to find. One reason for that is the lack of good datasets for machine learning model training!

As such, my stroke-order-aware attempt over at https://github.com/PellelNitram/OnlineHTR/ uses a dataset from 2000 with around 12,000 samples. Contrary, the internal Google dataset is reported to feature around 16,000,000 samples :-D.

This is a great observation!

Currently, the machine learning model only supports offline HTR (i.e. using images) but online HTR (i.e. using pen time series data) is in the making, see here:

https://github.com/PellelNitram/OnlineHTR/