Hacker News new | ask | show | jobs
by fauigerzigerk 2943 days ago
>Anyone can go build a crawler and scrape the web the way Google scrapes it so they can compete with Google.

Unfortunately that is not the case. Many paywalled sites will let googlebot index their content but block other crawlers.

They may have good reasons for doing that in some cases, but as a consequence the level playing field you're talking about no longer exists.

Also, the purpose of using Google as part of some automated process is usually not to compete with Google's search engine, but to complete some specific and limited task.

I don't understand why Google does not have a general search API offering. I'm sure many people would happily pay for it.

2 comments

I have had web crawlers from China crawl my site multiple times a day but never send me traffic. Same with Yandex. I like the bing search engine but often it does not like my site. If it doesn't send any traffic, why let them run up my AWS bill?
I understand that, but I think there are good reasons why we shouldn't always act in the narrowest sense of our self interest (provided we have enough financial wiggle room).

A search monopoly is not good for website owners. It makes us very dependent on the whims of that monopolist.

If you block all crawlers that don't already have a large market share and send back a lot of traffic, you're killing any possibility for new competitors to get a foot in the door.

Also, you're killing any chance for something unexpected to happen, such as someone having a great idea based on crawled data that could change all our lives for the better without ever sending traffic to your site.

Now, I'm not telling you what you can and cannot afford. If crawlers cost me a ton of money that I don't have I would certainly act exactly like you suggested.

> It makes us very dependent on the whims of that monopolist.

Very true. Only allow Google and you are helping them to build their monopoly. And if they have full monopoly they do what they want - including asking you money to be included in the search results.

I can't even imagine how many businesses would be extatic about the ability to do this. Might as well cut out the SEO middleman.
>Many paywalled sites will let googlebot index their content but block other crawlers.

Doesn't that infringe upon Google's own rules? I always thought Google didn't like it when sites served its crawler content that's different from what users get when they follow Google's link.

That's why many paywalled sites give you a few free articles per month if you're coming from a Google search results page.

But it no longer works at all sites. Maybe the rule has been dropped now that paywalls are becoming more popular (with publishers that is)