Hacker News new | ask | show | jobs
by tamnd 3 days ago
It seems this repo only saves one web page?

What I'm implementing here is mirroring a whole website, with all its subpages, so you can browse it all offline. For example, all essays from paulgraham.com.

3 comments

Oh, I see. In that case, feature-wise, it is actually a modern alternative to HTTrack.

I think the misunderstanding stems from the browser's "Save As" reference in the description. It is misleading. You use "Save As" to save a single page, not an entire website.

Also, the description lacks a clear explanation of the project's purpose. It would be helpful to include a sentence explaining that the program downloads an entire website, not just a single page.

Singlefile supports scoped recursive crawls too: https://github.com/gildas-lormeau/single-file-cli#:~:text=an...

I highly recommend reading the singlefile source or https://archiveweb.page/ to see how they handle closed shadow DOMs, cross-origin iframes, websockets, media urls, deduping large assets, etc.

> For example, all essays from paulgraham.com

Not the same thing, but I made a clone of pg’s website which can be used for exactly that: https://github.com/shawwn/pg

https://shawwn.github.io/pg/

If you want to read all essays, just clone the repo and open any of the .html files. Or any of the .page files which generated them.