| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Udo 4736 days ago

This is excellent, thank you for making this! I'm using it right now to make an offline archive of my Reader stuff.

My only gripe would be the tool's inability to continue after a partial run, but since I won't be using this more than once that's probably OK.

All web services should have a handy CLI extraction tool, preferably one that can be run from a CRON call. On that note, I'm very happy with gm_vault, as well.

Edit: getting a lot of XML parse errors, by the way.

1 comments

mihaip 4736 days ago

The tool caches the API responses (in the _raw_data directory), so if you're re-running it, most of the initial requests will be served from the cache.

If the XML parse errors are listing any item IDs, feel free to email them to me (mihai at persistent dot info) and I'll see if there's any workaround from my side.

Edit: If it's "XML parse error when fetching items, retrying with high-fidelity turned off" messages that you're seeing, then those are harmless (assuming no follow-up exceptions). The retry must have succeeded.

link

ivank 4736 days ago

Have you tried the JSON API? (See the requests that Google Reader itself makes.) It requires no cookies and supports getting up 1000 items per continuation.

link

mihaip 4736 days ago

I wrote most of Reader's JSON API in 2006-2007 :)

The tool uses the "high-fidelity" Atom output mode for getting at item bodies. That preserves namespaced XML elements and other extra data from the feed. It uses JSON for everything else, and will fall back to regular Atom output if the high fidelity mode is not well-formed (it was added in late 2010, as things were winding down, and thus never got a lot of testing).

link