Hacker News new | ask | show | jobs
by karyon 2264 days ago
There's also one inherent data privacy problem in MVCC that I've been running into.

Suppose you have an app that lets people anonymously vote or comment on stuff, but only once. The vote in the DB must not have any connection to the person. So, you give the person a flag whether or not they voted already, and store the vote separately.

Now, you'd want to set both values in the same transaction for obvious reasons. But, since Postgres uses MVCC, the two tuples that are added to the database both contain the same transaction ID (XID), so there's the connection between user and vote again.

There seems to be no way to instruct postgres to "clean" those XIDs in any way. What we're doing now is periodically and manually updating every tuple in each affected table with dummy changes, essentially duplicating all tuples with all new XIDs, and then running VACUUM to delete the tuples with the old, potentially-deanonymizing XIDs. We haven't found anything easier...

4 comments

> Now, you'd want to set both values in the same transaction for obvious reasons. But, since Postgres uses MVCC, the two tuples that are added to the database both contain the same transaction ID (XID), so there's the connection between user and vote again.

The simple solution is that at each change, you rewrite the whole election, not just the new votes, and clear out all outdated tuples (basically, that you make the "periodic and manual" process you are currently doing automatic and integrated with the "real" transactions rather than additional side process.)

Alternatively, you don't do the changes in the same database transaction but in the same business domain transaction which is managed outside the database, and where any database artifacts related to the management of the business transaction are deleted and vacuumed after the transaction is completed.

Re "same business transaction": That would probably still allow for correlating the votes with the flags because they'll still be added with some XID, which are monotonically increasing. Of course, it would require more effort and probably even guesswork, but still :)

Rewriting the election in the same transaction sounds smarter than the periodic solution. I need to discuss that with the team, thanks!

Interesting issue, never thought about xid being used in that way before!

Rather than store the user ID, can you store a bcrypt hash of it?

What’s the attack vector you are addressing? I guess access is given to the database for outside auditing?

I’m just thinking if the db or app server is compromised anyway then someone could install a trigger to log the table changes, or change the app logic to log the vote somewhere else, etc.

The attack vector would be an attacker getting their hands at a DB dump including XIDs. Without any particular measures, they'd be able to de-anonymize every vote ever made, which is kind of a worst case scenario for an anonymity-focused app. Granted, since usually there are no dumps including XIDs lying around (e.g. backups don't contain them), this would effectively require an attacker to gain access to the database. When they're that far they could of course do all sorts of bad things, but most of those could probably be fixed with recreating the VM with a backup. Deanonymizing existing votes cannot be fixed :)

We also thought about hashing user IDs, but with any practical number of users in the database it would be possible for an attacker to just hash all existing user IDs and checking them all. So that would be more an obfuscation, making de-anonymization harder, but not impossible.

edit: we considered putting the user's password into that hash as well. that would also enable us to let the user edit their vote later, while still retaining anonymity. But then we'd need to ask them for their PW when voting, or we put a hash of the PW into their session data, and we'd need to restrict changing the PW until the election is over, and it seemed not worthwhile.

The other problem is that PostgreSQL doesn't rewrite data in place, so an attacker can determine the order of the user votes from the physical order of data in the database, and the order of issue votes from constantly scraping your website, thus allowing to deanonimize everything.

You need to use another database for this, specifically one designed to always overwrite data in place, and erase the WAL immediately after commit: it should be easy to write it yourself, assuming the dataset fits in RAM and so you don't need any data structures on disk other than a simple array of records.

Also you need to ensure that higher storage layers don't keep snapshots and don't do copy-on-write.

Could also look into a cryptography-based solution, although not sure if there is a feasible one.

Good points. We believe we have fixed the physical-order-and- scraping-issue by using random uuids as primary keys, showing stuff in the UI always ordered by PK, and periodically doing a CLUSTER, which physically rearranges the table after some index (pk in our case).

We haven't thought about higher-order storage layers. I guess we should do that... Thanks!

> There seems to be no way to instruct postgres to "clean" those XIDs in any way

There is, VACUUM FREEZE

Sadly, since 9.4, that does not clear the XID anymore, see that blue box here: https://www.postgresql.org/docs/9.4/routine-vacuuming.html#V...
Thanks, that was news to me. Interestingly replacing XID with frozen one still makes a valid tuple, so it might be just a few lines patch for you to get what you want.