|
|
|
|
|
by randomdata
4934 days ago
|
|
I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping. In my spare time, I've been playing around with "scrapers" (I like to call them web browsers, personally) that don't even look at markup. My first attempt used a short list of heuristics that proved to be eerily successful for what I was after. To the point I could throw random websites with similar content (discussion sites, like HN), but vastly dissimilar structures, at it and it would return what I expected about, I'd say, 70% of the time in my tests. After that, I started introducing some machine learning in an attempt to replicate how I determine what blocks are meaningful. My quick prototype showed mix results, but worked well enough that I feel with some tweaking it could be quite powerful. Sadly, I've become busy with other things and haven't had time to revisit it. With that, swapping variables and similar techniques to thwart crawlers seems like it would be easily circumvented. |
|