|
|
|
|
|
by lappa
380 days ago
|
|
I use the SingleFile extension to archive every page I visit. It's easy to set up, but be warned, it takes up a lot of disk space. $ du -h ~/archive/webpages
1.1T /home/andrew/archive/webpages
https://github.com/gildas-lormeau/SingleFile |
|
1. find a way to dedup media
2. ensure content blockers are doing well
3. for news articles, put it through readability and store the markdown instead. if you wanted to be really fancy, instead you could attempt to programatically create a "template" of sites you've visited with multiple endpoints so the style is retained but you're not storing the content. alternatively a good compression algo could do this, if you had your directory like /home/andrew/archive/boehs.org.tar.gz and inside of the tar all the boehs.org pages you visited are saved
4. add fts and embeddings over the pages