About two years ago, some clever techies used the OkCupid subreddit to "hack" the OkCupid frontend. If I remember correctly, they used Javascript to display the number of messages someone received per day and how many they replied to (among other things). Eventually OkCupid came across the info and started putting everything server-side.
The stoplight indicator and a few other things were done via client side js with non obvious variables names. Once that Greasemonkey script came out it was quickly moved to the server side since there was no reason not to do it there. I believe it was just on the client side due to ease of development when it was being built.
You could see how long ago the person recieved their last message, how many messages they were getting on average and also I think some scale thing which could extrapolate how frequently they wrote back to messages. It was a long time ago but incredibly interesting / invasive.
This method is well known and there are ways of making this more difficult. For example, servers add a hidden request cookie which is a random number embedded in the page that the user is coming from. This forces you to actually parse the page. Then they can move it to javascript making it even more difficult.
I would just like to point out that not all sites do the following (and I'm unsure if OKC does), but watch out for tripwires as chacham15 said. Some sites that I've had the misfortune of "getting to know" use insignificant or blank inputs as a form of detecting unauthorized access.
One rule I follow is to: Retrieve, Analyse and Regurgitate Everything.
The most universally effective solution I've seen is Sikuli running in a VM (which is the only sane way to run it, since it hijacks your input devices). Everything else fails in some edge case. What other tool can scrape an interface that uses both HTML and Flash and is only served over HTTPS?
It is brittle, in that it can be broken by cosmetic UI changes, but the maintenance is generally trivial. Also, it's slow as all hell. But sometimes you really need that sledgehammer.
He does, but sadly his comment is way off. Anyone that's done any amount of HTML scraping will use BeautifullSoup over lxml. The former being easier and more tolerant of html's nuances. The latter being brittle for anything less well formed than XHTML.
Sorry but lxml with ETree will handle any amount of broken html you throw at it. Add in XPath and I find lxml to be a far superior, and more memory efficent, option.
I was surprised by his recommendation. I found lxml, with it's weak xpath implementation to be a poor tool for parsing XML. How can it be a good tool for parsing non conforming HTML?
Perhaps my reply to Linus was snarky, but in my experience Beautiful Soup was easy to use for parsing HTML, where lxml for parsing XML was not. Granted the XML I had to parse used name spaces, but there was no reason for it to be difficult and there was no reason for poorly documenting which xpath features are supported.
No, you were right and be was wrong. Just because he made a copy of unix doesn't mean he knows how to parse HTML with python. Anyone that's done a lots of web scraping or (as in my case) crawling/indexing will see the merit of your response.
Was that actually Linus Torvalds? I just assumed it was someone trolling as him. For web scraping, PhantomJS is much better than Beautiful Soup or any xml parsing library. Lots of stuff happens in JS these days, so you need programmatic access to the DOM to really grab data efficiently. In fact, lots of sites hide important values inside JS in order to thwart libraries that post-process html/xml like Beautiful Soup.
It is interesting, when you read a lot of his stuff. It seems that absolutely did not want to worry about the "user" space. User software.
I always wondered, did he ever write user level code. He did write 'git' eventually. But I wonder what else, what does he think of web application development, etc?
It's a clever and IMNSHO insufficiently copied architecture with interesting performance and security characteristics.
[0] https://github.com/okws/okws [1] http://news.ycombinator.com/item?id=2077484