| 1. Fetch 1st page. Note id of user (e.g. 1954202703). this is the "id": value in the url. Note end_cursor. This is used for the "after": value in the url Note rhx_gis. This is used to create the "X-Instagram-GIS:" header. Looking at archive.org, it seems as recently as last year, end_cursor was once all that was needed. 2. Fetch js from ProfilePageContainer url in 1st page (e.g., https://www.instagram.com/static/bundles/base/ProfilePageCon...) Note queryId (e.g. 42323d64886122307be10013ad2dcc44) This is used for "query_hash" in the url. 3. Create header "X-Instagram-GIS:" Apparently this is some MD5 hash of rhx_gis and the query string variables
according to this source: https://www.diggernaut.com/blog/how-to-scrape-pages-infinite... However a little experimentation revealed that generation of rhx_gis or this hash must also incorporate the user-agent string -- change any character in the user-agent string and the request will fail. They also put IP address and a Unix time value in a cookie but the cookie can be deleted and the request still succeeds. For example the final url for the first 12 photos is: https://www.instagram.com/graphql/query/?query_hash=42323d64... Overall, seems not too much work for someone who really wants to automate retrieval of Instagram photos. These requests for successive groups of 12 can be RFC 2616 pipelined over a single connection. Not long ago and for some number of years, it was even easier (e.g. just use end_cursor value as "max_id" in url). |