Hacker News new | ask | show | jobs
by userbinator 2945 days ago
For example, for google.com, you can typically make only around 300 requests per day, and if you reach this limit, you will see a CAPTCHA instead of search results.

300 is pretty easy to achieve if you're "Googling hard enough" (make 5 slightly different queries, go through the 20 pages of results it's willing to show you, repeat 3 times...), and I've seen it trigger far before that if you are searching for more obscure things. It seems almost hostile to those searching for IC part numbers, specific and very exact phrases, and just "non mainstream" content in general.

How sad it is then, that we are told and have internalised the notion that we should use search engines like Google to find things, and yet it prevents us from "trying too hard" to find what we're looking for...

8 comments

From my experience at blekko, 99.9% of the "people" who go deep into the results pages for a single query are actually bots. You're a very unusual user, and there are a lot of bots.
There's a difference between going deep into the results, and progressively refining a query. The former is pretty indicative of bot behavior -- humans rarely go past even the first page of results. I do the latter all the time, and this frequently gets me Google's captcha, especially if I'm doing something like using site: and inurl: operators.
I've managed to trigger Google's bot detector too while trying to find documentation for a certain bank api (legitimate reasons, we were supoosed to integrate and their docs didn't make sense).
I run up against this all the time. My browser is fast from blocking all third party trackers and scripts. My searches are faster, so it thinks they're automated.
Use DDG. It's fine for most things. Use Google as a fallback.
>Use Google as a fallback.

To do this from DDG, prepend your search using !g to search Google

Minor fyi, it doesn't matter if the bang comes at the beginning of the query. You can save the keystroke when refining to fall back to google and just throw it to the end of the query.
This is major, thx.
Yes it is, and since it's IP based, it's even easier if you are for example working from an office and there are multiple people using google.

But that is why they only show recaptcha, you fill it in and you will get extemption cookie for 30 more requests :D

> But that is why they only show recaptcha, you fill it in and you will get extemption cookie for 30 more requests :D

Does that actually work? Whenever I have been searching some obscure things and managed to get the captcha after 6-10 pages, it just goes to loop where it keeps giving it constantly. Though it stops giving it if I change the search terms.

With a VPN on Brave on iOS, Google will only show me infinite captchas.
Instagram is the worst I have come across. If you are on a page with 1000+ pictures trying to find something near the bottom, you have to let it load each new group sequentially, then after a while it starts timing you out for like 60 seconds or longer every couple times you load more. God forbid you accidentally navigate away while scrolling you have to start all over again from the top.

Due to recent events it seems they got scared, locked down their API, then tightened down the request limit to prevent scraping to the point it is hardly usable on desktop anyway.

Linkedin is even worse, imo. Go to any random company page and it'll show you a page, asking you to login. Refresh it again and it'll show you the page, without asking you to login.
That is probably because LinkedIn Authwall algorithm is on A/B testing, they do lots of machine learning so that they stop bots, but mostly scraping Linkedin is quite impossible i would say even on small volumes
This sounds like an issue that is specific to Javascript-controlled browsers. If using a traditional, non-Javascript tcp/tls/http client it is trivial to extract the image urls and other information from the page using a single HTTP request (and from each successive page using more HTTP requests in a single connection, if "has_next_page" is "true"). No "API" needed. Can you provide an example of a single page with 1000+ images?
https://www.instagram.com/ryuji513

It looks like it just hits https://www.instagram.com/graphql/query/... every time you scroll down so if you scroll too fast it just hammers it and throttles your requests to that endpoint.

1. Fetch 1st page.

Note id of user (e.g. 1954202703). this is the "id": value in the url.

Note end_cursor. This is used for the "after": value in the url

Note rhx_gis. This is used to create the "X-Instagram-GIS:" header.

Looking at archive.org, it seems as recently as last year, end_cursor was once all that was needed.

2. Fetch js from ProfilePageContainer url in 1st page (e.g., https://www.instagram.com/static/bundles/base/ProfilePageCon...)

Note queryId (e.g. 42323d64886122307be10013ad2dcc44)

This is used for "query_hash" in the url.

3. Create header "X-Instagram-GIS:"

Apparently this is some MD5 hash of rhx_gis and the query string variables according to this source:

https://www.diggernaut.com/blog/how-to-scrape-pages-infinite...

However a little experimentation revealed that generation of rhx_gis or this hash must also incorporate the user-agent string -- change any character in the user-agent string and the request will fail.

They also put IP address and a Unix time value in a cookie but the cookie can be deleted and the request still succeeds.

For example the final url for the first 12 photos is:

https://www.instagram.com/graphql/query/?query_hash=42323d64...

Overall, seems not too much work for someone who really wants to automate retrieval of Instagram photos. These requests for successive groups of 12 can be RFC 2616 pipelined over a single connection. Not long ago and for some number of years, it was even easier (e.g. just use end_cursor value as "max_id" in url).

They recently removed User Agent and CSRF token from signature creation process. Right now used only rhx_gis parameter and URL decoded variables from query string to generate MD5 signature. However, your findings about user agents looks interesting. I assume they may use user agent to generate rhx_gis. It could explain why auth doesnt work if you change single char in user agent.
The rule set must be more complex. I often use VPN which results in captchas on many pages but I never get one on Google. I guess the 300 queries/IP only count if other parameters indicate crawling.
But it's a bit clunky. I was running searches through an embedded webbrowser in a c# application, which is really an embedded internet explorer and was very quickly presented with a captcha. It was a human viewing the results, but a script constructing the query string, but that was enough the be labelled as a crawler.
often all it takes is to be using webbrowser. I would love to use webbrowser for small search utilities because it's so easy to use, but it seems to be a magnet for problems.
Yea, it was a very general example, since there is at least one rule that is based on rate limiting too, and this 300/IP limit is what have seen on average.
With a search as you type feature (Google instant?) that limit could be hit in a few searches...
Often when I fire up my VPN I get CAPTCHAed on the first search.
Just be a proud false positive.