| The "analyze" feature works pretty well. My comments underindex on "this" - because I have drilled into my communication style never to use pronouns without clear one-word antecedents, meaning I use "this" less frequently that I would otherwise. They also underindex on "should" - a word I have drilled OUT of my communication style, since it is judgy and triggers a defensive reaction in others when used. (If required, I prefer "ought to") My comments also underindex on personal pronouns (I, my). Again, my thought on good, interesting writing is that these are to be avoided. In case anyone cares. |
I suppose it's possible the "analyze"-reported proportions are a lot more precise and reliably diagnostic than I imagine. I haven't yet looked in detail at the statistical method.
Also, of course, it would require integration with NLP tooling such as WordNet (or whatever's SOTA there something like a decade and a half on) and a bit of Porter stemming to do part-of-speech tagging. If one 0.7GB dataset is heavyweight where this is running, that could be a nonstarter; stemming is trivial and I recall WordNet being acceptably fast if maybe memory hungry on a decade ago's kinda crappy laptop, but I could see it requiring some expensive materialization just to get datasets to inspect. (How exactly do we define "more common" for eg "smooth?" Versus semantic words, all words, both, or some combination? Do we need another dataset filtered to semantic words? Etc.)
If we're dreaming and I can also have a pony, then it would be neat to see both the current flavor, one focused on semantics as above, and one focused specifically on syntax as this one coincidentally often seems to act like. I would be tempted to offer an implementation, but I'm allergic to Python this decade.