Hacker News new | ask | show | jobs
by gwern 582 days ago
I don't think that would make it any clearer. Why would 'doc.gwern.net' be more obviously just a random document than 'gwern.net/doc/www/'?

Regardless, I am puzzled how OP got this URL in the first place. He wasn't supposed to, he was supposed to get the canonical Arxiv PDF link. Because this is one of the cache mirrors/local archives†, rather than a regular hosted document. We block everything in /doc/www/ in robots.txt & HTTP no-archive/crawl/mirror/etc headers, and we use JS to swap out the local URL for the original URL whenever the reader clicks or mouse-overs or interacts with a link to the URL in a web page (and that is the only place they should be publicly listed or accessible). If OP read it on gwern.net by seeing a link to it, and he wanted to copy the URL elsewhere, he should have just gotten the canonical "https://arxiv.org/pdf/2108.07686#page=85"... But somehow he didn't.

OP, do you remember how exactly you grabbed this URL? Is this an old link from before our URL swapping was implemented, or did you deliberately work around it, or did you find some place we forgot to swap, or what?

(If anyone is wondering why I mirror Arxiv PDFs like this in the first place: it's for the PDF preview feature in the popups. Because Arxiv blocks itself from being loaded in iframes we need local mirrors for PDF preview to work at all; local mirrors save a new domain lookup and speeds up the PDF preview a lot because we compress the PDF more thoroughly and Arxiv servers are always overloaded; and because readers can potentially pop up many Arxiv PDFs easily, it saves Arxiv a lot of bandwidth and avoid burdening their servers further, so it's just the responsible thing to do.)

† https://gwern.net/archiving#preemptive-local-archiving

2 comments

Not OP, but the HN crowd here often browses without JS. Quickly testing a no-JS session, I do see your archive URLs instead of arxiv ones.
Yes, without the swapping JS, you wouldn't get the canonical URL. But browsing Gwern.net these days without JS is pretty painful. And in this particular case, there is only one place on Gwern.net that the link exists where you could see it without JS; in the other 5 or 6 links, you could only get there via JS and thus the swapping should've happened. So it is not a safe assumption that OP simply browsed with NoScript.
Hi Gwern, I'm honestly not sure. I have some firefox extension that skips trackers and other redirects. I have like 100 firefox extensions, actually. I'm not sure how most of them work nor what they do exactly, I just trust that they make my browser more "secure" and I tend to download things at random -- especially if I see ads or want certain features in my client (i.e. a browser that auto-rejects cookies).

Happy to try and help you figure this out but when I revisit this specific hyperlink I'm still getting the gwern url & not arxiv

Hm. So you're getting the raw URL but you don't have NoScript / block JS specifically? Can you check in an incognito window and if it still happens, ablate all your extensions in a fresh profile (https://support.mozilla.org/en-US/kb/profile-manager-create-...), which is something you might want to keep handy if you really have 100 extensions running and no idea what they all do...
> Why would 'doc.gwern.net' be more obviously just a random document than 'gwern.net/doc/www/'?

HN only shows the domain next to the title. So now when browsing the front page we only see gwern.net as the source of the doc and initially assume it's some work from you.

I don't think HN shows third-level domains, so the point is moot. There may be exceptions for web services that lend out subdomains like Github[1], but doc.gwern.net would probably still show as gwern.net[2]. If you're willing to see the URL in the browser statusbar or addressbar, then the URL path makes very clear that the actual source is arxiv.org.

[1] Example: gliimly.github.io -> gliimly.github.io https://news.ycombinator.com/item?id=42148808

[2] Example: www.researchgate.net -> researchgate.net https://news.ycombinator.com/item?id=42181345

You're right, I didn't realize that the third-level domains that show up may be due to some kind of whitelisting.

The [2] was not a convincing example because www sound something that'd get special treatment, but then I found this one:

tech.marksblogg.com -> marksblogg.com (https://news.ycombinator.com/item?id=42182519)

which proves you right. TIL.

It's not even really about domains. It's about displaying the "author"/"source" of the content in the link:

https://news.ycombinator.com/item?id=23537725

> https://twitter.com/Foone/status/1011692979877105664

turns to

> (twitter.com/foone)

That brings up the second question though, which is why someone would assume that docs.gwern.net links to a document not by Gwern.
That's why I'm trying to think of a better subdomain.

- archive.gwern.net?

- static.gwern.net?

- thirdparty.gwern.net?

- localarchive.gwern.net?