| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mudkip 2303 days ago

There's a few issues that prevent me from currently releasing all of it:

The code is currently in two repositories - the first is a generic runtime I wrote that provides certain subsystems to a bunch of unrelated plugins (a small percentage of the total number of plugins relate to social networking), so I would either need to open source that repository as well, or port those plugins to something else. While I legally own the code for the first repository, it currently forms the basis of a company I founded (the social network plugins were written as a joke for personal reasons, and have no business use case - it was just easier to do this way), so I'd rather not release that just yet. I'd either need to replace the calls in those plugins (not too difficult, just annoying), or release those plugins without patching those calls with a disclaimer saying something like "drop replacement scheduling service here". Almost all of the runtime calls used by the social networking plugins are basically just for scheduling how frequently they should re-scrape things, and use almost none of the other features.

The second repository is in charge of actually viewing/browsing the data, and doesn't concern itself with repeatedly obtaining it. This repository will likely be open sourced soon, once I make the UI nicer and fix some bugs. It also contains some 'one and done' import code, for loading in data from things that just need to run once (importing from legacy chat systems, like Hangouts, Windows Phone's SMS DB, Adium, Pidgin, and a few IRC clients I no longer use anymore).

As for the social networking plugins in the first repo, these fall into one of two categories: the first are "things that access legitimate APIs, or otherwise just grab public data (e.g. YouTube)". The second category, which is mostly just Facebook, is "this logs in as you with your username/password and downloads a bunch of shit by actively impersonating you". The first category isn't really encumbered in any way, only the second one is.

In the case of Facebook, there's a few specific issues that I have to deal with. The first is that there's a bunch of exploits that it uses in order to scrape everything, and they could theoretically be patched if someone at Facebook sees what I'm doing (which is also partially why I'm not using my real name here). The second is that some of the code may or may not be correct - I've had situations while developing it where I made a basic assumption that turned out to be wrong (like objects having one primary key, comments only going two layers deep, or guessing the wrong author of a post by the username in the URL). Since this is mostly guesswork, I occasionally have to mark a bunch of rows as 'untrusted', and invalidate them if I find that an assumption I made turns out to be very wrong. This has happened a few times, and it wouldn't be nice to tell people that their DB dumps have a bunch of errors and they should just throw it all away. Scraping is hugely problematic if you don't fully understand what you're grabbing (as an example, if you just download the HTML for a given page, you can't necessarily grab the images later as all the URLs have expiry tokens in them, so the HTML scraper also needs to be aware of photos). There are certain object types that I haven't fully figured out how to decode, and I don't want to have tons of people constantly re-indexing the same URLs over and over again because some code I wrote is buggy.