Hacker News new | ask | show | jobs
by al_borland 815 days ago
I use read-it-later type services a lot, and save more than I read. On many occasions I've gone back to finally read things and find that the pages no longer exist. I'm thinking moving to some kind of offline archival version would be a better option.
5 comments

I've used ArchiveBox in the past and it's been great for this purpose: https://github.com/ArchiveBox/ArchiveBox
Hi! This seems amazing and sustainable since it leverages industry standard tools such as yt-dl and chrome headless.

Now I'm curious, what made you stop using it?

I just found myself archiving fewer things over time and it’s been a while since I’ve saved anything. There’s nothing wrong with it though. In fact, I still have it on my machine.
This is great, thanks!
I used to have almost 10k bookmarks that I was keeping from circa 2010 to 2017. Only to realize the majority of them were now useless. Some kind of tool like this is way overdue to become commonspread.
Having a browser (Firefox?) actually innovate in this area instead of just reducing functionality to a carbon copy of Chrome is what is really overdue.
(They missed a chance to have a link to a download of the mtnl file of the github page haha)

Archive.org and wayback machine should ask for people to submit snaps of pages using this tool directly into the archive - especially during world events.

This would allow digtal archeologists to grok the sentiment of the world during that era...

(aside: when I interviewed at twitter they asked me what I thought twitter was, and I said I thought it was a global sentiment engine...)

But kudos to the world for having us now in the AI birth onto the global internet, as a wayback machine, coupled with AIs and LLMs and this tool - will allow one to ask questions about history in ways that will be very interesting.

--

"What was the general media coverage of [topic] in [decade] with respect to how we currently look at it - and are they articles covering [SUBJECT] in this topic for that time period.

etc...

https://github.com/palewire/savepagenow

https://github.com/jjjake/internetarchive

The Internet Archive cannot trust arbitrary content previously archived, so it is more optimal to have whatever archival tools or operations you’re performing to make a request to Wayback to take a snapshot at the same time.

If you’re bookmarking something, archive it too!

Yes, that a better version of what I meant...
I have a lot of old unsorted bookmarks of "I want to look in to this, but don't have time now". Newer stuff is more organized, but I exported the old stuff and haven't looked at them in about five years.

Last week I started organizing them a bit, and it's shocking how much is a 404. Even from major newspapers and such. I have no idea why anyone would take down old content (outside of some specific and rare reasons). Some are also on neither internet archive or archive.today.

I assume when it happens at big sites it’s from a major site design that doesn’t care to keep backward compatibility with old links.
How many programmer-hours are required to have a separate page that translates between URI schemes?

Your comment, to me, implies that the 404 links' content still exists but is not at a canonical URI anymore. I'm assuming converting stuff like /2018/08/foo.html to /newscheme/fetch?foo or whatever isn't that difficult? This whole thing is one of the reasons i haven't ever set up a blog or even a website that has dynamic content, because i can't be assed to decide on a URI scheme that will "just work" with any future engine.

Someone has to have written converters, right? I know you can import some blogs to wordpress (and vice versa, export WP to other engines...)

https://omnivore.app basically entirely filled that void in my life. 100% recommend.
does it archive / save web pages? I am using Omnivore too and I did not find this option.
I believe it saves a reader-mode version of whatever you feed it, by default? I also pull a copy into my Obsidian vault using the plugin/api, but it's easy to implement with the api if you don't want Obsidian too. Makes it very easy to refer to articles from notes later! (or just rip out everything except the part I cared about.)

I've saved shopping carts and logged-in pages regularly, so the markdown reader version in the apps should definitely be independent of the article/page itself being up.

How does it compare to pocket?
Can't say I've used pocket, but I think the newsletter-saving (generated email addresses), open source/selfhostability, and api were differentiators that made me actually start using Omnivore - I wouldn't trust closed source and with premium options for something like this.
Yeah I do like the idea of hosting it myself.

If there was a KOReader integration it would be amazing.

But if its self hosted, then that integration could simply be a SFTP / SSH server that accesses the files.

I use a locally hosted YaCy instance with cached results to work around this scenario. Much of the content I am interested in is kept locally, so it’s good enough. When I have a bunch of “read later” tabs that pile up, I copy all their URLs into the crawler form with “Store to Web Cache” checked and it accomplishes what I described. Just another option to consider.