| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by kimburgess 1975 days ago

Relying on purely on regex misses so much context available from a document. I've been working on some tooling [1] in this space recently and a core epiphany was noting you can model written language as an AST and then reason about it in this form rather than opaque blocks of text (or flat, sequential text fragments as with Typerighter). An even better realisation was that others had already noted this too and built a mature ecosystem based on this concept [2].

[1]: https://github.com/place-labs/orthograph-err

[2]: https://textlint.github.io/

2 comments

js_herbert 1975 days ago

This is definitely true – in this sense, our initial corpus of regexes are the booster stage for this project, in that they enabled us to produce something useful for journalists in a reasonable timeframe. Typerighter's built as a platform for matching text, so we're not tied to regex – at the moment, we're migrating many rules to LanguageTool, which is a part of our pool of matchers and has a more sophisticated set of NLP tools. (And a great project – thanks LT maintainers!)

Thanks sharing these projects, other suggestions are very welcome – we'd be interested in adding new matchers based on different tech if they were a good fit for the use case.

link

RobAley 1975 days ago

Will you (are you) contributing any of the rules back to TL? Or are they to specific to your org?

link

js_herbert 1974 days ago

Taking a look at the corpus, the rules we have currently migrated are very specific to our style guide, and we'd likely be unable to contribute large chunks of the corpus for IP reasons. But this certainly seems possible for more general grammar or style corrections if there was a need – although LT's lists of rules are already quite comprehensive!

link

danpalmer 1975 days ago

I had this same thought, but I wonder if it really matters for this use-case. The rules are actually quite simple much of the time – they're spelling and stylistic corrections.

I suspect the biggest problem with using regexes is over-suggestion, trying to correct American English spellings in a quote for example, but I suspect this is a pretty good balance of features, usability, and correctness.

One issue that comes with more complex systems like you mention is that the bugs become more complex. I'd imagine it's fairly easy for a journalist using this tool to know why an incorrect suggestion has been made, and that makes it easy for them to disregard it. While the error rate may improve with more complex analysis, those errors that do still happen are likely to be less understandable.

link