| This is a topic that I'm intimately familiar with, thanks to a bizarre set of circumstances (and a ton of reverse engineering). Story above, technical details below: Part 1: A couple years ago, I noticed that the number of photos I was tagged in kept going up and down, as a couple of people I knew would disable their accounts occasionally, and re-enable them a couple weeks later. I manually the images from them, but wanted a way to automatically scrape any images I was tagged in, so I wouldn't need to do this manually. I got myself a Facebook Graph API key and created a sample app with full account permissions, only to discover that Facebook won't let you export photos you're tagged in (that you didn't take). The numbers the API reports are wrong, and there's no indication that it's being purposely redacted. As a result, I wrote a tool that crawls a profile given a set of authenticated cookies, and essentially clicks the download link automatically on every photo. This worked decently well for a couple years, and continues to work to this day. Part 2: I had some spare time on my hands in December 2019, and wanted to write a tool to browse chat logs from across a variety of services (Facebook Chat, Hangouts, SMS), such that you'd be able to click a name and see a chronological discussion, regardless of what service it was on. I downloaded the Facebook data dump, figuring that was the easiest way to get access to my Messenger data. The Messenger dump revealed a few things that surprised me:
* The character encoding is messed up, and requires decoding as Latin1, then re-encoding as UTF-8 * Some messages are straight up missing, despite being in the UI. The dump is supposed to include attachments (images are included), but is missing audio messages / voice snippets, presumably among others. * If a user has deleted their Facebook account, the username will appear solely as 'Facebook User', so now you need to figure out who you were actually talking to. Some conversations were very obvious, but others involved wasting a ton of time on and involved dumb techniques (like finding Adium logs of the same chat from an old computer). To identify certain conversations, I started scrolling back through certain Facebook posts (which I wrote), to figure out who had been at certain events with me (to narrow things down). I read a bunch of comment threads that didn't appear to make much sense to me, until I realized that anyone who deletes their account also has their comments removed, so basically all old comment threads are somewhat nonsensical if anyone in the conversation has since deleted their account. For comparison, deleting a reddit account changes the ownership of a comment/post to [deleted], which seems much more appropriate. Presumably wall posts (including happy birthday messages) from people who have since deleted their accounts are also removed, which is exceedingly shitty - if someone sends you a greeting card and then dies several years later, it's not like the post office comes to your house to take your cards back in the middle of the night. Part 3: Because of this, I figured that the only way to mitigate future data loss on Facebook is to consistently archive things. Since the 'download your data' tool is basically useless, I started work on a tool that scrapes the site and "decompiles" pages into raw directed graph DB rows, which can be re-rendered into a new version of the site. It features a reasonably complete implementation of Facebook's TAO (https://www.facebook.com/notes/facebook-engineering/tao-the-...) on top of PostgreSQL, and works decently well - notably, it also maintains things like proper links to profiles and stores all assets offline. Writing a bug-compatible "decompiler"/"recompiler" taught me several things about how the site works (or rather, doesn't). Here's a small list of errata I've discovered along the way: * Objects can have multiple FBIDs * FBIDs can contain comments/reactions * Since there may exist multiple FBIDs for a given object, it's quite common for multiple comment threads to exist for a given item, such that commenters on one don't see the responses on the other (and vice versa). Several of my friends have confirmed finding disjointed discussions on their posts after discovering this bug. * Facebook has several types of deprecated reactions that they store in the DB, which cannot traditionally be viewed from the site anymore. Sucks to be you if you reacted to something that way. * Certain objects can get lost in their UI, with no easy way to find them. Uploading a photo in a post will put it in your Timeline Photos album, but uploading a photo as a comment to someone else's post will basically make it impossible to find again. * The number of reactions/comments on a given post is often wrong - this isn't the traditional bug due to eventual consistency, but rather is due to not adjusting the counts for items when a person deletes their account. To a certain degree, this will show you how many people that interacted on something have departed the site. |
I used "Social Book Manager" in the chrome store to (painfully, repeatedly) flush out all content from my account down to the last Like. Walk away and your account appears 100% empty until a reactivated friend occurs, then all your stuff linked to their content magically reappears for you to try and delete again. So I sat on it for another year, periodically logging back in to flush out whatever hidden content had reappeared.
Having now (this Jan.) deleted my supposedly "empty" 10yr old account, it's good to know the hidden content is being removed properly, as my intent was to 100% scrub myself from their service and boy did I try.