Hacker News new | ask | show | jobs
Ask HN: Why aren’t the Wayback Machine archived pages indexed by search engines?
4 points by hosa 1936 days ago
I have had this question for some time now... Not indexing archived pages is contributing to a more shitty and a hassle-full web..
5 comments

If the archive competed with the originals for clicks that would (a) make a lot of site-owners cross and (b) would be serving stale content to users if the original page is still up and being updated.
Search engines are designed to give you the best result on the web today.

The Wayback Machine / archive.org is a snapshot in time of the web.

If search engines combined the current web and old web it would be an interesting experiment but possiblly a diff nightmare.

Maybe it’s something that could be a point of differentiation for a new search engine compared to Google.

For anyone wanting to take this on, maybe start with Common Crawl [0]

0. https://commoncrawl.org/the-data/

I used an extension once in Firefox that allows you to view the archived version of any URL (providing the site allowed the WM crawler in their robots.txt). It worked for both working URLs and URLs that 404'd or didn't exist anymore / bitrotted pages.
Yes I know that extension.. But my purpose is different, for researching and education , there are pages that only exist in the internet archive, and so I do not want to search the same thing two times in two different places..
Yeah but if the site issues a 404, you simply right click and the extension will show you the archived copy. Sadly many sites fail to provide a decent 404 page if a resource doesn't exist, they just do redirect spamming or in the worst case: 503 error[0]

(But I understand your need for text search on these services)

[0] https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/503

Perhaps this article can help you somewhat: https://www.netforlawyers.com/content/archive-wayback-machin...
Are you asking technically why or philosophically why?
Both. Shoot me