Hacker News new | ask | show | jobs
by 0x00000000 2944 days ago
Instagram is the worst I have come across. If you are on a page with 1000+ pictures trying to find something near the bottom, you have to let it load each new group sequentially, then after a while it starts timing you out for like 60 seconds or longer every couple times you load more. God forbid you accidentally navigate away while scrolling you have to start all over again from the top.

Due to recent events it seems they got scared, locked down their API, then tightened down the request limit to prevent scraping to the point it is hardly usable on desktop anyway.

2 comments

Linkedin is even worse, imo. Go to any random company page and it'll show you a page, asking you to login. Refresh it again and it'll show you the page, without asking you to login.
That is probably because LinkedIn Authwall algorithm is on A/B testing, they do lots of machine learning so that they stop bots, but mostly scraping Linkedin is quite impossible i would say even on small volumes
This sounds like an issue that is specific to Javascript-controlled browsers. If using a traditional, non-Javascript tcp/tls/http client it is trivial to extract the image urls and other information from the page using a single HTTP request (and from each successive page using more HTTP requests in a single connection, if "has_next_page" is "true"). No "API" needed. Can you provide an example of a single page with 1000+ images?
https://www.instagram.com/ryuji513

It looks like it just hits https://www.instagram.com/graphql/query/... every time you scroll down so if you scroll too fast it just hammers it and throttles your requests to that endpoint.

1. Fetch 1st page.

Note id of user (e.g. 1954202703). this is the "id": value in the url.

Note end_cursor. This is used for the "after": value in the url

Note rhx_gis. This is used to create the "X-Instagram-GIS:" header.

Looking at archive.org, it seems as recently as last year, end_cursor was once all that was needed.

2. Fetch js from ProfilePageContainer url in 1st page (e.g., https://www.instagram.com/static/bundles/base/ProfilePageCon...)

Note queryId (e.g. 42323d64886122307be10013ad2dcc44)

This is used for "query_hash" in the url.

3. Create header "X-Instagram-GIS:"

Apparently this is some MD5 hash of rhx_gis and the query string variables according to this source:

https://www.diggernaut.com/blog/how-to-scrape-pages-infinite...

However a little experimentation revealed that generation of rhx_gis or this hash must also incorporate the user-agent string -- change any character in the user-agent string and the request will fail.

They also put IP address and a Unix time value in a cookie but the cookie can be deleted and the request still succeeds.

For example the final url for the first 12 photos is:

https://www.instagram.com/graphql/query/?query_hash=42323d64...

Overall, seems not too much work for someone who really wants to automate retrieval of Instagram photos. These requests for successive groups of 12 can be RFC 2616 pipelined over a single connection. Not long ago and for some number of years, it was even easier (e.g. just use end_cursor value as "max_id" in url).

They recently removed User Agent and CSRF token from signature creation process. Right now used only rhx_gis parameter and URL decoded variables from query string to generate MD5 signature. However, your findings about user agents looks interesting. I assume they may use user agent to generate rhx_gis. It could explain why auth doesnt work if you change single char in user agent.