Hacker News new | ask | show | jobs
by marcua 1619 days ago
This is great! It looks like you can even grab WARCs using `wget`: https://golangexample.com/put-a-web-archive-warc-on-an-s3-bu...

I'm curious: what sort of fidelity have you seen in `grab-site`'s rewritten static asset URLs? Having to fix URLs that weren't properly rewritten ended up taking me the most time.

1 comments

grab-site doesn't rewrite URLs, it captures the entire request and response of each http request for an asset and stores as-is for archival purposes. The quality of any mutations or transformations performed will be governed by the tooling used to consume the generated WARCs.

More detail on the WARC format can be found below:

https://www.loc.gov/preservation/digital/formats/fdd/fdd0002...

https://en.wikipedia.org/wiki/Web_ARChive

http://fileformats.archiveteam.org/wiki/WARC

This is really helpful! Thank you so much!