Hacker News new | ask | show | jobs
by sugarfactory 3985 days ago
This might also be useful for combating attempts to reveal authors' identities using forensic linguistics.
3 comments

I've thought about that problem a bit.

First of all, forensic linguistics is much much less powerful than it's made out to be. In particular there's no really good way to get a confidence estimate out of the prediction. All you can get is a likelihood ratio between N different suspects. You can't really get "definitely a match" vs "I don't know".

Anyway. The best solution would be to adopt some constraint, e.g. by being forced to write in haikus, being forced to write in "upgoer 5 style" Basic English, being forced to only write sentences that Google Translate stably round trips English<->French, etc.

The accuracy of a forensic linguistic algorithm trained on normal text and run on the stylistically constrained text is completely unknown. Hopefully this evidence would be inadmissible even by forensic science standards. Then again, maybe not.

Not really. Authorship attribution methods typically rely on function words (the most frequent words), which cannot be readily substituted, as that would require rewriting the whole sentence.
How would you ensure that you replace the right parts?
I don't think you can be. You could train authorship attribution on many kinds of features. But checking against common methods would probably go a long way in avoiding detection.