Hacker News new | ask | show | jobs
by prox 1574 days ago
I kind of wonder how we can make it searchable again. Is this included in this archiving effort?

In any case wonderful work.

1 comments

There is a standard set of tooling for indexing archives: CDX files. [1]

They index WARC archives and can be used to quickly find records. You can build on top of this (and some systems do) to make a proper search front-end.

But in general, these archives are NOT geared towards full-blown search because it would be pretty expensive to keep the indexes in hot cache. Plus you would need to deal with historical versions of records, which is not normally done in search UX.

[1] https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem#CD...

Ah, is the WARC format the reason it's called 'Warrior'? It seems like a very strange name for an archival program.
ArchiveTeam seems very guerrilla in their operations.

I always imagined the Warrior as a camo-faced archivist operating under cover of darkness, preserving data even in the most hostile Yahoo-occupied territory.

Thank you for that information!