| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mdpacer 3763 days ago

> What kinds of NLP technique does this system use?

It depends on your interpretation of NLP. In a sense, all of the rules are hard coded, and so it does string token processing that happens to be informed by contributed interpretations of style guides' rules for usage. Thus, most of the NLP has been performed by the human programmers interpreting those rules.

Though we are interested in extensions in the direction of robust machine NLP approaches able to meet the other goals of proselint, that presents many challenges (including some I mention in response to your third question). Nonetheless, this is an active area of research.

> Is it possible to specify new rules in a high-level way?

In short, no, but it is an area of active research on our part to develop a rule-templating engine for exactly this purpose. "High-level" is subjective though, so there may always be someone who intends to ask about a level higher than the interface that we provide at the time that this question is asked.

> Can it learn from examples?

In a sense, yes, all of the rules have been learned by people from the example text in guides and translated to linting rules. But I do not think that was your intended question.

If instead you mean: you would provide it a set of examples of your writing and it would induce a rule, no it does not do that currently, and may not for quite some time.

Stylistic rule induction is a difficult – though interesting – problem (as is rule induction more generally). It is not something we are intrinsically opposed to, but the simplest version of learning from examples would violate two core principles of the design of proselint.

First, our rules are taken from and organised around the advice provided by respected authors in their writing on linguistic style.

Second, any inductive method will be intrinsically uncertain about the rules that it induces. This uncertainty will always be opposed to our aim of having a low false alarm rate, making inductive methods possible but subject to extensive tuning and testing. This suggests that further development of a test set outside of the examples provided would be needed, to ensure coverage of any of the rules that the examples would suggest inducing.

Additionally, almost all state-of-the-art machine learning systems would require a set of relevant labeled examples of usage errors and non-errors that would somehow generalise to the examples that you would like to provide it. Even specifying the data format would be difficult; if you have any insights as to how this would be done, please develop them below, it can only be helpful and aid progress in this direction.

> Does it work on a sentence-by-sentence basis only, or does it "grasp" complete paragraphs?

I think the easiest way for you to answer this question is for you to see it in action at this website: http://proselint.com/write/

I should mention that longer range dependencies require greater computational power which brushes up against another aim of proselint, to be fast enough to run on reasonably large files as a real-time linter. This may not always be the case in all instantiations of proselint, but for now this is true.

If you have paragraph level rules that you might want to suggest (like the issue I just created when writing this response: https://github.com/amperser/proselint/issues/310), please do! It is even more helpful if you can find an authoritative reference to include as part of your issue, because that will be needed to incorporate the rule into proselint.