Hacker News new | ask | show | jobs
by textmode 2944 days ago
This sounds like an issue that is specific to Javascript-controlled browsers. If using a traditional, non-Javascript tcp/tls/http client it is trivial to extract the image urls and other information from the page using a single HTTP request (and from each successive page using more HTTP requests in a single connection, if "has_next_page" is "true"). No "API" needed. Can you provide an example of a single page with 1000+ images?
1 comments

https://www.instagram.com/ryuji513

It looks like it just hits https://www.instagram.com/graphql/query/... every time you scroll down so if you scroll too fast it just hammers it and throttles your requests to that endpoint.

1. Fetch 1st page.

Note id of user (e.g. 1954202703). this is the "id": value in the url.

Note end_cursor. This is used for the "after": value in the url

Note rhx_gis. This is used to create the "X-Instagram-GIS:" header.

Looking at archive.org, it seems as recently as last year, end_cursor was once all that was needed.

2. Fetch js from ProfilePageContainer url in 1st page (e.g., https://www.instagram.com/static/bundles/base/ProfilePageCon...)

Note queryId (e.g. 42323d64886122307be10013ad2dcc44)

This is used for "query_hash" in the url.

3. Create header "X-Instagram-GIS:"

Apparently this is some MD5 hash of rhx_gis and the query string variables according to this source:

https://www.diggernaut.com/blog/how-to-scrape-pages-infinite...

However a little experimentation revealed that generation of rhx_gis or this hash must also incorporate the user-agent string -- change any character in the user-agent string and the request will fail.

They also put IP address and a Unix time value in a cookie but the cookie can be deleted and the request still succeeds.

For example the final url for the first 12 photos is:

https://www.instagram.com/graphql/query/?query_hash=42323d64...

Overall, seems not too much work for someone who really wants to automate retrieval of Instagram photos. These requests for successive groups of 12 can be RFC 2616 pipelined over a single connection. Not long ago and for some number of years, it was even easier (e.g. just use end_cursor value as "max_id" in url).

They recently removed User Agent and CSRF token from signature creation process. Right now used only rhx_gis parameter and URL decoded variables from query string to generate MD5 signature. However, your findings about user agents looks interesting. I assume they may use user agent to generate rhx_gis. It could explain why auth doesnt work if you change single char in user agent.