Hacker News new | ask | show | jobs
by jinseokim 1604 days ago
Be aware that deleting metadata is never enough: There are too many ways to hide some fingerprints on the PDF document.
1 comments

All you need to do is download two documents and diff them, and delete anything that doesn't overlap.
You'd need to download from different identities; if I was them I'd be injecting user, IP, organization, date, and a signed hash thereof (tamper evidence if someone does something like change a digit in the IP)
The signed hash doesn't matter because you only need to de-identify the document, not pass it off as someone else's. If the organization finds a document with all of the identifying information removed, they know that someone fucked with their DRM but they don't know who.
My thought was that if the publisher is trying to hunt people sharing copies, and they have such a copy, it would be useful to be confident that the metadata you embedded is actually accurate; sure, it's obvious if, say, the IP field is zeroed out, but what if they just changed the last octet to 7, and that results in you spending weeks leaning on an ISP to give you the identity of the wrong person? Granted, that's probably more care than Elsevier is likely to take, but the point is that they're passing data through hostile hands, so it'd be sensible to do something for integrity checking.
Not guaranteed to work. Look up steganography.
Applying SHA256 to 2 different copies of a PDF and receiving the same hash is deterministic proof that uniquely identifying stenographic techniques have not been used.
That doesn't account for any overlaps in tracking data for groups of users.

Instead of a single per-user unique value, I could use several values that track different groups of users. The set of values together would uniquely identify a user, but for any 2 PDFs there would be at least one shared group value that would exist in both.

Using your method, leaking a single PDF would identify a group containing the 2 users of the PDFs you compared. If the groups are randomized for each new article, every PDF you leak would further identify you as the common member of the leaking groups.

This opens up the opportunity for some kind of distributed file submission tool where you can compare hashes of segments of your document with everyone else's documents in some kind of zero-knowledge way, so that no actual piracy happens until enough people submit their document information for the system to create a de-DRMed copy of the document.
This is true, but you have to realize there is a built-in tradeoff regarding specificity. The more "resilient" this approach is to being found out by a hash, the less specific the identification will be.
The point is that it is easily defeated by steganography (i.e., your hashes would all be different).
You just need to strip out the parts of the documents that are different until they hash to the same hash.