| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by niteshade 1319 days ago
	For "a next-generation crawling and spidering framework", it's a little surprising to see no support for the WARC[1] format. [1]: https://en.wikipedia.org/wiki/Web_ARChive

2 comments

pr337h4m 1319 days ago

Wonder why the Internet Archive never tried to build a web search engine - their crawls of the entire web could be more comprehensive than Google (assuming Google doesn't archive old copies of websites)

link

dredmorbius 1319 days ago

Brewster Kahle, the IA's founder, did. It is called Alexa Internet, and was sold to Amazon:

<https://en.wikipedia.org/wiki/Alexa_Internet>

<https://help.archive.org/help/wayback-machine-general-inform...>

A condition of that sale was that Alexa would continue to provide the results of its crawls, after a delay, to the Internet Archive. Those crawls form a substantial portion of IA's Wayback Machine archive.

I'm not certain that those archive are ongoing, as Alexa seems to have been largely shut down.

IA are a bit cagey on details, but I believe that there is a general IA-based archival service. There's certainly the "Save Page Now" feature:

  https://web.archive.org/save/<URL>

And the independent but closely-cooperating ArchiveTeam (lead by Jason Scott) tailors crawlers specific to endangered / vulnerable online websites, its Warrior software:

<https://wiki.archiveteam.org/>

link

lakomen 1318 days ago

Interesting, from a consumer's perspective I never liked Alexa. But from a hoster's perspective it was awesome. Especially when you're in the top 1000. It helped my site get more popular.

link

graypegg 1319 days ago

That’s both really intriguing, and horrifying!

It’s already _technically_ impossible to erase something from the internet, but if they removed the barrier to knowing where something was before in order to find it in the archive, it would be truly impossible in every sense of the word.

link

ddorian43 1319 days ago

Crawling should be the easiest part.

link

marginalia_nu 1318 days ago

I don't know if there is an easy part in search. Almost every aspect of it has unique challenges.

Large scale crawling is primarily a challenge in balancing the logistics in a way that is kind to both the crawler and the data consumers.

Distributed crawling, if you go that way, is also non-trivial as you're effectively juggling a shared rapidly mutating state in the dozens gigabytes.

link

marginalia_nu 1318 days ago

At a guess, WARC wants headers and stuff that are at the very least inconvenient to get at with your usual headless browser drivers. I also have a hunch WARC may also not be entirely well defined when archiving js-rendered websites.

link