| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by textmode 2943 days ago

1. Fetch 1st page.

Note id of user (e.g. 1954202703). this is the "id": value in the url.

Note end_cursor. This is used for the "after": value in the url

Note rhx_gis. This is used to create the "X-Instagram-GIS:" header.

Looking at archive.org, it seems as recently as last year, end_cursor was once all that was needed.

2. Fetch js from ProfilePageContainer url in 1st page (e.g., https://www.instagram.com/static/bundles/base/ProfilePageCon...)

Note queryId (e.g. 42323d64886122307be10013ad2dcc44)

This is used for "query_hash" in the url.

3. Create header "X-Instagram-GIS:"

Apparently this is some MD5 hash of rhx_gis and the query string variables according to this source:

https://www.diggernaut.com/blog/how-to-scrape-pages-infinite...

However a little experimentation revealed that generation of rhx_gis or this hash must also incorporate the user-agent string -- change any character in the user-agent string and the request will fail.

They also put IP address and a Unix time value in a cookie but the cookie can be deleted and the request still succeeds.

For example the final url for the first 12 photos is:

https://www.instagram.com/graphql/query/?query_hash=42323d64...

Overall, seems not too much work for someone who really wants to automate retrieval of Instagram photos. These requests for successive groups of 12 can be RFC 2616 pipelined over a single connection. Not long ago and for some number of years, it was even easier (e.g. just use end_cursor value as "max_id" in url).

1 comments

diggernaut 2943 days ago

They recently removed User Agent and CSRF token from signature creation process. Right now used only rhx_gis parameter and URL decoded variables from query string to generate MD5 signature. However, your findings about user agents looks interesting. I assume they may use user agent to generate rhx_gis. It could explain why auth doesnt work if you change single char in user agent.

link