Hacker News new | ask | show | jobs
by learc83 2944 days ago
A company I consulted for was using a paid API to handle search.

Despite the fact that the entire site was available in an easy to scrape XML format, scrapers kept using the search feature.

They were trying very hard to overcome my countermeasures--they had a seemingly limitless pool of IPs, they were rotating user agent strings, and they tried to randomize search behavior.

Everytime I implemented a new countermeasure they'd try to find a way around it. It was maddening because we made everything available for them through the XML feed. They just wouldn't use it.

6 comments

An explicit message to use that feed as part of the countermeasures might be useful. Did/do you do this?
That is kinda sad to hear. The approach should always be to go through the path of least resistance and smallest effect on the website. So for example, if a company has API that can be used instead of scraping their website, then it's always preferred to use the API. Same would go for the XML you mentioned.

It's bad that not everyone works like this; there are quite a lot of people who would rather brute-force a solution than think about it.

The path of least resistance for the bots appears to be that they have a tool that scrapes search results, and nothing to talk to an API.
Interestingly XML is not easy to parse with a lot of these scraping tools that rely on JavaScript... sure, the tools can easily parse HTML and convert to json or csv, but taking xml in random format and doing the same is rather difficult.

It may have been better to just publish the site in HTML format with an easy to find link on front page to access it.

We have kind of the same thing. All data is in an API which is less than a cheap VPS and display messages to them about using the API if they get blocked but they just come back with a new IP every time.
You had a paid API, and people wanted the information for free....

Not unexpected I guess.

After searching "algolia" mentioned below, I figured out the misunderstanding. The company was paying somebody else per search made on their web site. So every time a scraper called the website's search function, it cost the website money.
you've misunderstood: the site used something like algolia (which the site paid for) to index. the scrapers were hitting that service (which was costing the site) rather than parsing the xml (which already had everything).
The paid API was just to handle the search of the company database (couldn't change that for political reasons).

They weren't getting any information that they couldn't get through the XML.

Maybe it was sabotage.