| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by randomdata 4982 days ago

I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping.

In my spare time, I've been playing around with "scrapers" (I like to call them web browsers, personally) that don't even look at markup.

My first attempt used a short list of heuristics that proved to be eerily successful for what I was after. To the point I could throw random websites with similar content (discussion sites, like HN), but vastly dissimilar structures, at it and it would return what I expected about, I'd say, 70% of the time in my tests.

After that, I started introducing some machine learning in an attempt to replicate how I determine what blocks are meaningful. My quick prototype showed mix results, but worked well enough that I feel with some tweaking it could be quite powerful. Sadly, I've become busy with other things and haven't had time to revisit it.

With that, swapping variables and similar techniques to thwart crawlers seems like it would be easily circumvented.

1 comments

freshhawk 4982 days ago

I would be really interested in knowing which heuristics or machine learning techniques produced decent results. That's if I can't convince you to open source the code. I'm working on the same problem at the moment.

link

rohamg 4981 days ago

What about something like http:// tubes.io

link

freshhawk 4981 days ago

We're fine with scrapers and scraping infrastructure, although tubes.io is a very interesting idea.

I'm more interested in what I can do to write fewer scrapers since the content is, at a high level, relatively similar. I've just started with experiments writing "generic" scrapers that try and extract the data without depending on markup. It's going to eventually work well enough but to get the error rate down to an acceptable level is going to take a lot of tweaking and trial and error.

There's a few papers on this, but not much out there. That's why I was interested in someone else working on the same problem in a different space.

link