Hacker News new | ask | show | jobs
by deathwarmedover 2170 days ago
When the author started with "this is not fool-proof by any means" the first thing that came to my mind was linguistic fingerprinting.
1 comments

He does advise to use google translate to move the content from the original language to a different one and then back to alleviate this.
I think this is a great first stab at the problem, but for two reasons I think a robust solution needs more work:

- The first is that, as someone else pointed out, Google is almost certainly logging your translation queries.

- Secondly, even if you do it offline (as someone else suggested) the approach itself might not work. Success in linguistic forensics isn't based (as we might naively assume) on catching obscure words that a particular individual has a tendency to overuse. It's based on subtle shifts in the relative frequency of functional words. Depending on the proximity of the source and target language, round-trip machine translation might not change this.

In forensic linguistics you typically measure a lot of metrics, not just word frequencies, use of punctuation and whitespace, sentence lengths and structures etc. Attribution also isn't the only use of forensic linguistics. You can also look at influences, deas, people, publications etc. For instance in order to infer something about the reader, analyze influence networks etc.

I got interested in forensic linguistics many years ago when an article in a somewhat shady publication mentioned me. I got curious and started reading anything I could find on the topic. I was eventually able to identify the author, but mostly by tricking him to admit it after I had a ranked list of candidates. He was second on a list of about 4-5 people (out of a candidate set of perhaps 300). Not half bad for the rather crude methods I used. I was rather pleased with myself.

I've used similar techniques later to look at influence networks in companies.

Interestingly, at Google Translate now:

Upcoming changes to history

Translation history will soon only be available when you are signed in and will be centrally managed within My Activity. Past history will be cleared during this upgrade, so make sure to save translations you want to remember for ease of access later.

Ha, "Hey Google, NSA here, do you have the server logs of people translating this passage?"

Wait, why ask Google, they probably can just look in their own surveillance database.

Geez, if Google Translate queries are logged, that's... a lot of information.

I guess you could skirt around this by using something to tag the various parts of speech in your original text (using something like Python's NLTK) and replace them with randomly picked synonyms from a thesaurus?

Pretty sure it would obscure the original writer although possibly at the cost of obscuring the original meaning.

If you use wordpress.com from Tor and use Google Translate from Tor, what do they learn more about you than just using wordpress.com?

(I have no clue)

I think what we’re concluding here is that using Google to obscure the linguistic style is flawed, because a state actor could obtain the original linguistic style from Google records, or from their own records of snooped traffic.

In other words: the blog should find a way to obscure linguistic style offline.

They can see the original text, which can the be analyzed.
Wonderful. Then someone DEFINITELY won't want to read your blog.