| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by al_borland 815 days ago
	I use read-it-later type services a lot, and save more than I read. On many occasions I've gone back to finally read things and find that the pages no longer exist. I'm thinking moving to some kind of offline archival version would be a better option.

5 comments

nelsonfigueroa 815 days ago

I've used ArchiveBox in the past and it's been great for this purpose: https://github.com/ArchiveBox/ArchiveBox

link

hu3 815 days ago

Hi! This seems amazing and sustainable since it leverages industry standard tools such as yt-dl and chrome headless.

Now I'm curious, what made you stop using it?

link

nelsonfigueroa 815 days ago

I just found myself archiving fewer things over time and it’s been a while since I’ve saved anything. There’s nothing wrong with it though. In fact, I still have it on my machine.

link

saganus 815 days ago

This is great, thanks!

link

mateo1 815 days ago

I used to have almost 10k bookmarks that I was keeping from circa 2010 to 2017. Only to realize the majority of them were now useless. Some kind of tool like this is way overdue to become commonspread.

link

account42 813 days ago

Having a browser (Firefox?) actually innovate in this area instead of just reducing functionality to a carbon copy of Chrome is what is really overdue.

link

samstave 815 days ago

(They missed a chance to have a link to a download of the mtnl file of the github page haha)

Archive.org and wayback machine should ask for people to submit snaps of pages using this tool directly into the archive - especially during world events.

This would allow digtal archeologists to grok the sentiment of the world during that era...

(aside: when I interviewed at twitter they asked me what I thought twitter was, and I said I thought it was a global sentiment engine...)

But kudos to the world for having us now in the AI birth onto the global internet, as a wayback machine, coupled with AIs and LLMs and this tool - will allow one to ask questions about history in ways that will be very interesting.

"What was the general media coverage of [topic] in [decade] with respect to how we currently look at it - and are they articles covering [SUBJECT] in this topic for that time period.

etc...

link

toomuchtodo 815 days ago

https://github.com/palewire/savepagenow

https://github.com/jjjake/internetarchive

The Internet Archive cannot trust arbitrary content previously archived, so it is more optimal to have whatever archival tools or operations you’re performing to make a request to Wayback to take a snapshot at the same time.

If you’re bookmarking something, archive it too!

link

samstave 815 days ago

Yes, that a better version of what I meant...

link

arp242 815 days ago

I have a lot of old unsorted bookmarks of "I want to look in to this, but don't have time now". Newer stuff is more organized, but I exported the old stuff and haven't looked at them in about five years.

Last week I started organizing them a bit, and it's shocking how much is a 404. Even from major newspapers and such. I have no idea why anyone would take down old content (outside of some specific and rare reasons). Some are also on neither internet archive or archive.today.

link

al_borland 815 days ago

I assume when it happens at big sites it’s from a major site design that doesn’t care to keep backward compatibility with old links.

link

genewitch 815 days ago

How many programmer-hours are required to have a separate page that translates between URI schemes?

Your comment, to me, implies that the 404 links' content still exists but is not at a canonical URI anymore. I'm assuming converting stuff like /2018/08/foo.html to /newscheme/fetch?foo or whatever isn't that difficult? This whole thing is one of the reasons i haven't ever set up a blog or even a website that has dynamic content, because i can't be assed to decide on a URI scheme that will "just work" with any future engine.

Someone has to have written converters, right? I know you can import some blogs to wordpress (and vice versa, export WP to other engines...)

link

Martinussen 815 days ago

https://omnivore.app basically entirely filled that void in my life. 100% recommend.

link

avinassh 815 days ago

does it archive / save web pages? I am using Omnivore too and I did not find this option.

link

Martinussen 815 days ago

I believe it saves a reader-mode version of whatever you feed it, by default? I also pull a copy into my Obsidian vault using the plugin/api, but it's easy to implement with the api if you don't want Obsidian too. Makes it very easy to refer to articles from notes later! (or just rip out everything except the part I cared about.)

I've saved shopping carts and logged-in pages regularly, so the markdown reader version in the apps should definitely be independent of the article/page itself being up.

link

jcul 815 days ago

How does it compare to pocket?

link

Martinussen 815 days ago

Can't say I've used pocket, but I think the newsletter-saving (generated email addresses), open source/selfhostability, and api were differentiators that made me actually start using Omnivore - I wouldn't trust closed source and with premium options for something like this.

link

jcul 814 days ago

Yeah I do like the idea of hosting it myself.

If there was a KOReader integration it would be amazing.

But if its self hosted, then that integration could simply be a SFTP / SSH server that accesses the files.

link

amcpu 815 days ago

I use a locally hosted YaCy instance with cached results to work around this scenario. Much of the content I am interested in is kept locally, so it’s good enough. When I have a bunch of “read later” tabs that pile up, I copy all their URLs into the crawler form with “Store to Web Cache” checked and it accomplishes what I described. Just another option to consider.

link