Very interesting tool, although storing as typos does seem to be a bit visible and prone to mistaken 'correction'. Other approaches to consider might be:
* Changing punctuation for visually identical, but different characters. This would not work for printed documents however.
* Encoding only 'believable' typos, e.g. it's its. You could encode a binary stream across all instances of it(')s, or other substitutions.
* Encoding the stream in whitespace, e.g. Two/One spaces after a full stop. Printed documents would be lossy though (as full stops at line endings would be ambiguous). There are error detection/correction systems that can help though.
Yeah, I need to work on making the displacements and replacements a bit more context-aware (& probably linguistically aware). There are cases where it can "replace" a character with the same character, for example.
I do like your idea about visually similar but distinct character replacement. That would be a really fun one to implement.
I worked on something very similar, my version also mutated punctuation and common phrases/words with synonyms and sentence re-ordering. Instead of steganography the purpose was to create identifiable mutations in text acting as a canary to tie disclosures back to specific recipients. Each party receiving a confidential document had slight mutations unique to their own document and given a copy/paste from a fairly small fragment(s) could be used to identify the owner of the version.
No Sorry it was constructed to catch an employee leaking confidential company information to media. I do not know how you could make this into a product and still maintain its reliability -- the more widely known the mutations are the easier it would be to mitigate the watermarking.
Oh, very cool! I like the data model for the changes. I've been thinking about adding an analysis pass using something similar to make it possible to implement more sophisticated strategies. The tricky bit will be retaining the stream-based approach.
* Changing punctuation for visually identical, but different characters. This would not work for printed documents however.
* Encoding only 'believable' typos, e.g. it's its. You could encode a binary stream across all instances of it(')s, or other substitutions.
* Encoding the stream in whitespace, e.g. Two/One spaces after a full stop. Printed documents would be lossy though (as full stops at line endings would be ambiguous). There are error detection/correction systems that can help though.