Hacker News new | ask | show | jobs
by GertG 5238 days ago
I like the idea, but have some remarks/questions:

1. I don't understand some of the rules at all, possibly because I haven't read "Style: Toward Clarity and Grace". In particular, the that/which rules seem to detect any that or which after two words. Why?

2. With just the rules that you've implemented and from your example output, it's already very clear that this approach is too crude to be practical. It simply gives too many false positives. For example, of the 5 "smells" found by the a/an rule, 4 are false positives.

To fix this rule, and many others, in a satisfying manner, you'll need to go beyond "knowledge-free" regular expression parsing and add linguistic knowledge to the system (word lists, part of speech tagging, pronunciation, syntactical parsing etc.).

For one example where you'd really need not just annotated word lists but a proper grammar, look at the false positive for the "No serial comma" rule: "In traditional markets, buyers and sellers are responsible for making".

3. I really like the fact that the rules are easily extensible. It might be challenging to maintain that while allowing for the kind of linguistically rich rules proposed above.

4. Even in its current form, the system could do with some indication of weight and/or certainty. Obviously, not all rules are equally important, nor equally reliable. After adding linguistic knowledge, this would become all the more important, since linguistic ambiguity makes many rules probabilistic by definition.

3 comments

Thanks - a few responses:

1. For that/which:

http://www.kentlaw.edu/academics/lrw/grinker/LwtaThat_Versus...

I should note that some of these "rules" are bogus (a point made by Williams) and are regularly broken by great writers. Williams argues that writers should feel free to break them, but that they should just be aware of them and the possibility that they will be judged by others not aware of these rule's dubious status.

2. I would agree with you it would be impractical if someone were going to mechanically just apply all the "hits" and treat them as changes to be made---but that's not what it is for. The idea is that the writer would be reviewing the hits and making judgement calls in each case. For example, in the serial comma case, a smart writer would see that "oh, this is fine" and continue. So I think of the high false positive rate as a plus, esp. for the kind of writing this is most useful for (i.e., stuff you're going to publish & need to proofread/edit the heck out of).

I definitely agree that more linguistic knowledge would be awesome (and my Mathematica version had some of this).

that/which is probably there to highlight the dubious restrictive vs non-restrictive "rule": http://andromeda.rutgers.edu/~jlynch/Writing/t.html#that
yep - and I agree about the dubiousness (see my comment above).
To be anything close to useful, this system will need to accept input from reviewers, and build a library of white-listed exceptions to whatever the final formulation of these rules are. Those exceptions need to be weighted, too, since many of the rules are stylistic or subjective in nature.