Hacker News new | ask | show | jobs
by lxgr 888 days ago
Would that explain getting past an auth wall though, i.e. loading the HTML page as if the user were logged in but without auth headers and cookies?
1 comments

You may be slightly misreading the write-up. Note the following two bits:

> What we found were user agents purporting to be from a range of devices including mobile devices, all only ever loading a single page without any existing state like cookies.

> The behavior itself is also strange, how did it load these pages which were often behind an authwall without ever logging in or having auth cookies?

I don't think they mean to say that pages behind authentication were successfully loaded without authenticating. If cookies are required to load the page, you aren't loading it without them. So I read this as "The sessions weren't authenticated, so where on earth did they even find these URLs?"

The answer is that there's a real, authenticated user behind a firewall, and every unknown URL this user visits is getting queued up for the crawler to classify later, query string and all. So the crawler's behavior looks like the user's, but offset by a few seconds and without any state. Presumably the auth wall is doing its job and rejecting these requests.

OP here, I was trying to say that these pages were behind an authwall and loading with userids from a specific user but without any of their cookies to support that auth.

This led us to believe this page was MitM rather than crawled directly (as they would not be able to impersonate the user)

That's how I read it also. If the ids you're referring to were in the URL, it's almost certainly URL Filtering. The URLs are fed to the crawler via MITM, so you were basically right.
> I don't think they mean to say that pages behind authentication were successfully loaded without authenticating.

Hm, are you sure? From the article:

> Would render and execute all scripts on that page as if it was that user

> [...] scans pages by grabbing the page contents, sending it to a render queue and then processing it [...]

I know a system that fits the bill for the observed behavior: https://news.ycombinator.com/item?id=39051083

But apparently PAN can do it too: https://news.ycombinator.com/item?id=39051077

> Hm, are you sure? From the article:

> > Would render and execute all scripts on that page as if it was that user

If there is a valid user ID (or other user/session identifier) in the request URL or body, but not valid auth cookies, the system may respond with a page that references the same scripts as the user would get but with no data. In that case the scripts would run (perhaps requesting further resources, directly or by placing things that reference them into the DOM, which is how they know the scripts ran) as they would for the user but just render a “no data” message where the information would be.