Hacker News new | ask | show | jobs
by prezjordan 4934 days ago
This should be stressed - sites like Facebook do exactly this. Constant changes mean constantly updating your scraper. When it comes to A/B testing? Your scraper needs to intelligent find the data, which might not always be in the same place.

Sidenote: I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping. I guess this would be a caching nightmare, though.

4 comments

I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping.

In my spare time, I've been playing around with "scrapers" (I like to call them web browsers, personally) that don't even look at markup.

My first attempt used a short list of heuristics that proved to be eerily successful for what I was after. To the point I could throw random websites with similar content (discussion sites, like HN), but vastly dissimilar structures, at it and it would return what I expected about, I'd say, 70% of the time in my tests.

After that, I started introducing some machine learning in an attempt to replicate how I determine what blocks are meaningful. My quick prototype showed mix results, but worked well enough that I feel with some tweaking it could be quite powerful. Sadly, I've become busy with other things and haven't had time to revisit it.

With that, swapping variables and similar techniques to thwart crawlers seems like it would be easily circumvented.

I would be really interested in knowing which heuristics or machine learning techniques produced decent results. That's if I can't convince you to open source the code. I'm working on the same problem at the moment.
What about something like http:// tubes.io
We're fine with scrapers and scraping infrastructure, although tubes.io is a very interesting idea.

I'm more interested in what I can do to write fewer scrapers since the content is, at a high level, relatively similar. I've just started with experiments writing "generic" scrapers that try and extract the data without depending on markup. It's going to eventually work well enough but to get the error rate down to an acceptable level is going to take a lot of tweaking and trial and error.

There's a few papers on this, but not much out there. That's why I was interested in someone else working on the same problem in a different space.

> Sidenote: I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping. I guess this would be a caching nightmare, though.

These guys do a stellar job on the IP addresses: http://www.hidemyass.com/proxy-list -- the good thing is the data is available for an amazing price.

Other sites I have some across will use large images and css sprites to mask price data.

I write a lot of scrapers for fun, rarely profit, just for the buzz

I bet you would only need to randomly shuffle between a few alternatives for all of them. You'd need a dedicated effort to work that one out and the cache implications could be managed. No getting around the trade-off of possible page alternatives vs cache nightmare-ness though, and doing that to json apis would get ugly fast.

At least it's easier to code these tricks than to patch a scraper to get around them.

Yes, Facebook used to do that. I had to scrap it once and was surprised by randomly changing classes around input fields.

but who cares, no one can beat Xpath :)