Hacker News new | ask | show | jobs
by kimburgess 1975 days ago
Relying on purely on regex misses so much context available from a document. I've been working on some tooling [1] in this space recently and a core epiphany was noting you can model written language as an AST and then reason about it in this form rather than opaque blocks of text (or flat, sequential text fragments as with Typerighter). An even better realisation was that others had already noted this too and built a mature ecosystem based on this concept [2].

[1]: https://github.com/place-labs/orthograph-err

[2]: https://textlint.github.io/

2 comments

This is definitely true – in this sense, our initial corpus of regexes are the booster stage for this project, in that they enabled us to produce something useful for journalists in a reasonable timeframe. Typerighter's built as a platform for matching text, so we're not tied to regex – at the moment, we're migrating many rules to LanguageTool, which is a part of our pool of matchers and has a more sophisticated set of NLP tools. (And a great project – thanks LT maintainers!)

Thanks sharing these projects, other suggestions are very welcome – we'd be interested in adding new matchers based on different tech if they were a good fit for the use case.

Will you (are you) contributing any of the rules back to TL? Or are they to specific to your org?
Taking a look at the corpus, the rules we have currently migrated are very specific to our style guide, and we'd likely be unable to contribute large chunks of the corpus for IP reasons. But this certainly seems possible for more general grammar or style corrections if there was a need – although LT's lists of rules are already quite comprehensive!
I had this same thought, but I wonder if it really matters for this use-case. The rules are actually quite simple much of the time – they're spelling and stylistic corrections.

I suspect the biggest problem with using regexes is over-suggestion, trying to correct American English spellings in a quote for example, but I suspect this is a pretty good balance of features, usability, and correctness.

One issue that comes with more complex systems like you mention is that the bugs become more complex. I'd imagine it's fairly easy for a journalist using this tool to know why an incorrect suggestion has been made, and that makes it easy for them to disregard it. While the error rate may improve with more complex analysis, those errors that do still happen are likely to be less understandable.