It looks like it just hits https://www.instagram.com/graphql/query/... every time you scroll down so if you scroll too fast it just hammers it and throttles your requests to that endpoint.
However a little experimentation revealed that generation of rhx_gis or this hash must also incorporate the user-agent string -- change any character in the user-agent string and the request will fail.
They also put IP address and a Unix time value in a cookie but the cookie can be deleted and the request still succeeds.
For example the final url for the first 12 photos is:
Overall, seems not too much work for someone who really wants to automate retrieval of Instagram photos. These requests for successive groups of 12 can be RFC 2616 pipelined over a single connection. Not long ago and for some number of years, it was even easier (e.g. just use end_cursor value as "max_id" in url).
They recently removed User Agent and CSRF token from signature creation process. Right now used only rhx_gis parameter and URL decoded variables from query string to generate MD5 signature.
However, your findings about user agents looks interesting. I assume they may use user agent to generate rhx_gis. It could explain why auth doesnt work if you change single char in user agent.
Note id of user (e.g. 1954202703). this is the "id": value in the url.
Note end_cursor. This is used for the "after": value in the url
Note rhx_gis. This is used to create the "X-Instagram-GIS:" header.
Looking at archive.org, it seems as recently as last year, end_cursor was once all that was needed.
2. Fetch js from ProfilePageContainer url in 1st page (e.g., https://www.instagram.com/static/bundles/base/ProfilePageCon...)
Note queryId (e.g. 42323d64886122307be10013ad2dcc44)
This is used for "query_hash" in the url.
3. Create header "X-Instagram-GIS:"
Apparently this is some MD5 hash of rhx_gis and the query string variables according to this source:
https://www.diggernaut.com/blog/how-to-scrape-pages-infinite...
However a little experimentation revealed that generation of rhx_gis or this hash must also incorporate the user-agent string -- change any character in the user-agent string and the request will fail.
They also put IP address and a Unix time value in a cookie but the cookie can be deleted and the request still succeeds.
For example the final url for the first 12 photos is:
https://www.instagram.com/graphql/query/?query_hash=42323d64...
Overall, seems not too much work for someone who really wants to automate retrieval of Instagram photos. These requests for successive groups of 12 can be RFC 2616 pipelined over a single connection. Not long ago and for some number of years, it was even easier (e.g. just use end_cursor value as "max_id" in url).