Hacker News new | ask | show | jobs
by andrew_eit 2105 days ago
I've actually had the opposite experience.

After 'scraping' some forums via their APIs for weeks, I ended up realising that the data and metadata given to me by the API was so restricted (e.g. providing 'recent comments' instead of all comments) that a pure vanilla web scraping approach became the preferred option.

I agree with all the points you mention about the shortcomings though and your argument is sound. This is my opinion in the other direction, APIs come with an element of trust.

1 comments

You both are right. Apis are extortion and scraping is fragile. I once amused myself by crating random invisible divs when generating a server side html page. It made scraping resulting files impossible, and made no difference to the look of the page.
Eh, you might be surprised how effective it is to just use regular expressions instead of something that parses the DOM. Usually there is something to key off of, and while regular expressions aren't good for parsing HTML, they still work just as well as they always have for matching text patterns, which is often what scraping ends up being.
Somewhere, an evil pony is twitching his ears.
a better approach is dynamically generated css, absolute positions and randomized html output. In my SEO days, I wrote a tool to do this exact thing for a different reason.

We had a network of sites, and they all looked the exact same to the user, but to google, each site had a completely different structure, and it kept the network safe for years before a google employee (or we assumed google employee - @google.com email) signed up for the service without us knowing, and discovered the entire network by placing a large order which gave him links across the entire network. Within 1 week of them signing up, our entire network of 10k domains was dead, and everything they linked to was delisted from google. We had to shut down the network, and refund all unused credits from our customers.

Is rule 1 of trying to game PageRank not to avoid Google email addresses?
I didn't say I was smart. Also if you block @google emails you will just get them signing up with another email address, so why even try to block them. And once they pay for a service, you cant not deliver, because that would constitute fraud.

We tried to just fly under their radar, and avoid any automated trigger that would arouse their suspicion.

And your conscience still allowed you to sleep at night?

All this seems like a nice illustration of how the web ecosystem encourages parasitic behavior on so many fronts. It's sad.

What do you mean? We sold links on websites filled with unique content we paid for...

We were masking those domains from google because google penalizes selling backlinks to justify paying for their ads. My conscience is quite clear. When google delisted our network, we refunded our customers and moved on to a smaller invite only network that ran well for years. I left that company 6 or 7 years ago, but I'm sure they are still making some money off hosting and managing private blog networks.

> And your conscience still allowed you to sleep at night?

The average American endorses slavery in their clothes, Christmas decorations and electronics. By comparison creating some bad links in a search engine is so low on my list of moral failings it doesn't even register.

> And once they pay for a service, you cant not deliver, because that would constitute fraud.

Underperform, and then offer them a goodwill refund if they ask for it?

SEO is already to snake-oily to do something like that. We had principals, we did our best for our customers and our clients. If we did fail in trying our best at least we could sleep knowing we weren't actively trying to screw over our paying customers.
That's so awesome lol
I'm surprised I don't see this discussed more in the context of web scraping, but XPath is not only much more powerful, but can also be made robust against such techniques.

Sure, if you change the page structure enough you could defeat it, but it would require more than just adding a few divs. XPath easily lets you mix and match matching against not just CSS classes, but also the page's structure itself, inner text, attributes, and so on. As a result, you can get some really powerful queries without having any kind of complex post-processing of the results.

Xpath is one of the dinosaur technologies that I didn't learn until some time ago, and man was it a great way to find the right element, and pass that to the tool doing traversing and other things - being able to find a div that contains a string in a forest of divs was so damn nice
I doubt it was impossible. For example, the first step a scraper might perform is to delete all invisible divs.
Random invisible divs aren't likely to defeat a moderately motivated scraper, though. Depending on what they're looking for, it could be as simple as getting the inner text of a sufficiently high-up element and matching a regex.

More complex scrape defeating measures I've seen are blobs of JS that need evaluating in order to generate URL parameters (all that needs doing is extract the JS and run it in a JS engine, if you don't want to drive a headless browser, with care of course!) or that need a captcha defeating (just buy some deathbycaptcha API calls).

I've seen this approach backfire a bit too. Rather than having to scrape web content, my work is reduced to pulling out my favorite sandboxed JS interpreter bindings, running the snippet, and extracting the rich object they just created with exactly the data I wanted. You only need a headless browser if there's a meaningful interplay between the JS and the rest of the site.
My favorite is when they provide JSON structures of the data in the included page JavaScript. That's easy mode scraping. :)
Haha, that'd be even better for sure.
I add a delay on the server side for IPs that seem scrappy and throw heavy javascript to blast off the resources. So far, it seems to work well in some cases.