Reverse Engineering OKCupid

Y	Hacker News new \| ask \| show \| jobs

	Reverse Engineering OKCupid (davidshimel.com)
	44 points by dsshimel 4898 days ago

5 comments

jacques_chester 4898 days ago

In terms of the backend, OKC have their own webserver, OKWS[0]. It's been discussed on HN before[1].

It's a clever and IMNSHO insufficiently copied architecture with interesting performance and security characteristics.

[0] https://github.com/okws/okws [1] http://news.ycombinator.com/item?id=2077484

link

jonathanjaeger 4898 days ago

About two years ago, some clever techies used the OkCupid subreddit to "hack" the OkCupid frontend. If I remember correctly, they used Javascript to display the number of messages someone received per day and how many they replied to (among other things). Eventually OkCupid came across the info and started putting everything server-side.

link

dsshimel 4898 days ago

What information were they getting from the subreddit? Just usernames or . . . ?

link

oijaf888 4898 days ago

The stoplight indicator and a few other things were done via client side js with non obvious variables names. Once that Greasemonkey script came out it was quickly moved to the server side since there was no reason not to do it there. I believe it was just on the client side due to ease of development when it was being built.

link

jonathanjaeger 4898 days ago

There's an OkCupid subreddit (http://www.reddit.com/r/okcupid) where people ask for advice and exchange funny stories about dates. When more data was stored on the frontend, people would make scripts so you could search for people based on attractiveness and other parameters. For example: http://www.reddit.com/r/OkCupid/comments/qi8iw/understanding...

link

cup 4898 days ago

You could see how long ago the person recieved their last message, how many messages they were getting on average and also I think some scale thing which could extrapolate how frequently they wrote back to messages. It was a long time ago but incredibly interesting / invasive.

link

chacham15 4898 days ago

This method is well known and there are ways of making this more difficult. For example, servers add a hidden request cookie which is a random number embedded in the page that the user is coming from. This forces you to actually parse the page. Then they can move it to javascript making it even more difficult.

link

kysol 4898 days ago

I would just like to point out that not all sites do the following (and I'm unsure if OKC does), but watch out for tripwires as chacham15 said. Some sites that I've had the misfortune of "getting to know" use insignificant or blank inputs as a form of detecting unauthorized access.

One rule I follow is to: Retrieve, Analyse and Regurgitate Everything.

link

dsshimel 4898 days ago

By inputs do you mean CAPTCHAs or something else?

link

simon_weber 4898 days ago

I do a lot of this kind of work for gmusicapi, and I still keep a Windows VM around just to use Fiddler.

Does anyone have recommendations for other tools? I came away from Burp and Charles disappointed in the past, but that was some time ago.

link

goodside 4898 days ago

The most universally effective solution I've seen is Sikuli running in a VM (which is the only sane way to run it, since it hijacks your input devices). Everything else fails in some edge case. What other tool can scrape an interface that uses both HTML and Flash and is only served over HTTPS?

It is brittle, in that it can be broken by cosmetic UI changes, but the maintenance is generally trivial. Also, it's slow as all hell. But sometimes you really need that sledgehammer.

link

berlinbrown 4898 days ago

Linus responds...

link

hnriot 4898 days ago

He does, but sadly his comment is way off. Anyone that's done any amount of HTML scraping will use BeautifullSoup over lxml. The former being easier and more tolerant of html's nuances. The latter being brittle for anything less well formed than XHTML.

link

mickeyp 4897 days ago

Sorry but lxml with ETree will handle any amount of broken html you throw at it. Add in XPath and I find lxml to be a far superior, and more memory efficent, option.

Source: former professional web scraper.

link

berlinbrown 4898 days ago

I didn't want to say it but yea I thought BeautifulSoup has way more development.

I wonder if you disagree with him, he will unleash his wraith upon ye.

link

dsshimel 4898 days ago

I was stunned when I got the notification email to moderate him =D

link

aleyan 4898 days ago

I was surprised by his recommendation. I found lxml, with it's weak xpath implementation to be a poor tool for parsing XML. How can it be a good tool for parsing non conforming HTML?

Perhaps my reply to Linus was snarky, but in my experience Beautiful Soup was easy to use for parsing HTML, where lxml for parsing XML was not. Granted the XML I had to parse used name spaces, but there was no reason for it to be difficult and there was no reason for poorly documenting which xpath features are supported.

link

hnriot 4898 days ago

No, you were right and be was wrong. Just because he made a copy of unix doesn't mean he knows how to parse HTML with python. Anyone that's done a lots of web scraping or (as in my case) crawling/indexing will see the merit of your response.

link

rosenjon 4898 days ago

Was that actually Linus Torvalds? I just assumed it was someone trolling as him. For web scraping, PhantomJS is much better than Beautiful Soup or any xml parsing library. Lots of stuff happens in JS these days, so you need programmatic access to the DOM to really grab data efficiently. In fact, lots of sites hide important values inside JS in order to thwart libraries that post-process html/xml like Beautiful Soup.

link

berlinbrown 4898 days ago

It is interesting, when you read a lot of his stuff. It seems that absolutely did not want to worry about the "user" space. User software.

I always wondered, did he ever write user level code. He did write 'git' eventually. But I wonder what else, what does he think of web application development, etc?

link

icelancer 4897 days ago

>But I wonder what else, what does he think of web application development, etc?

He doesn't.

link