Getting all your data out of Google Reader

Y	Hacker News new \| ask \| show \| jobs

	Getting all your data out of Google Reader (blog.persistent.info)
	86 points by kellegous 4736 days ago

9 comments

sp332 4736 days ago

ArchiveTeam is extracting all the data from Google Reader and uploading it to the Internet Archive. Help out by submitting your OPML file: https://news.ycombinator.com/item?id=5958119

link

nod 4736 days ago

Thanks mihaip!

Worked successfully in Windows CMD for me, without using the \bin shell script:

  cd C:\mihaip-readerisdead
  set PYTHON_HOME=C:\mihaip-readerisdead
  C:\path-to-py27 reader_archive\reader_archive.py --output-directory C:\mystuff

Locked up at 251K out of 253K items for me, though. Restarting... success! Looks like it might have locked up trying to start the "Fetching comments" section on my first try.

link

TheShiningOne 4735 days ago

Thanks very much!

I know next to nothing about Windows Command and python. Tried to apply your method and "path-to-py27" is not recognised.

Does it mean that i should put the path to the Python 27 program directory? in what form exactly (on my computer, it is at C:\Python27.

Or does it just mean the path to the directory where the app is already?

I tried without C:\path-to-py27, just typing "reader_archive\reader_archive.py --output-directory C:\mystuff" and got the following response: "Traceback (most recent call last:) File "C:\mihaip-readerisdead set PYTHON_HOME=C:\mihaip-readerisdead\reader_archive\reader_archive.py, line 12, in <module> import base.api ImportError: No module named base.api

Any idea?

link

kcvv 4735 days ago

Copy the folder named 'base' to c:\python27\lib folder. That did the trick for me.

After that, you should be able to just run "reader_archive\reader_archive.py --output-directory C:\mystuff"

link

TheShiningOne 4735 days ago

Thanks!

link

stalled 4735 days ago

Common typo: there are no underscores in PYTHONHOME and PYTHONPATH

(and you should use PYTHONPATH in this case)

http://docs.python.org/2/using/cmdline.html#environment-vari...

link

kcvv 4736 days ago

I'm trying this on windows and seem to be missing base.api module. I can't seem to find this module as well - anyone have a clue where i can get this module?

link

ccera 4736 days ago

In addition to the above methods, you can copy the base folder and paste it into the same folder as reader_archive.py -- that's what I did and it worked fine.

link

ijk 4736 days ago

It's in the \base folder. Set the main folder to be the Python root, as in the grandfather post, and it should be able to find it.

link

kcvv 4736 days ago

Thanks! For what ever reason, setting the python root did not help, but i just copied the 'base' folder to python lib folder and that seems to have done the trick.

link

daniel_reetz 4736 days ago

I also had success (on W7) using this method. Thank you, mihaip, I am truly grateful.

link

case 4736 days ago

Similar thing happened to me — it locked up when it was almost to the end, so I killed and restarted the process, and it finished successfully.

Thanks, Mihai!

link

TheShiningOne 4735 days ago

Nevermind, seems to be working fine now! Just have to wait till it finishes downloading.

link

ccera 4736 days ago

Warning to other impatient users:

I didn't read the instructions too well, so the half hour I spent carefully deleting gigantic/uninteresting feeds out of my subscriptions.xml file was all for naught. Because I didn't know I needed to specify the opml_file on the command line, the script just logged into my Reader account (i.e., it walked me through the browser-based authorization process) and downloaded my subscriptions from there -- including all the gigantic/uninteresting subscriptions that I did NOT care to download.

So now I've gone and downloaded 2,592,159 items, consuming 13 GB of space.

I'm NOT complaining -- I actually think it's AWESOME that this is possible -- but if you don't want to download millions of items, be sure to read the instructions and use the opml_file directive.

link

Udo 4736 days ago

This is excellent, thank you for making this! I'm using it right now to make an offline archive of my Reader stuff.

My only gripe would be the tool's inability to continue after a partial run, but since I won't be using this more than once that's probably OK.

All web services should have a handy CLI extraction tool, preferably one that can be run from a CRON call. On that note, I'm very happy with gm_vault, as well.

Edit: getting a lot of XML parse errors, by the way.

link

mihaip 4736 days ago

The tool caches the API responses (in the _raw_data directory), so if you're re-running it, most of the initial requests will be served from the cache.

If the XML parse errors are listing any item IDs, feel free to email them to me (mihai at persistent dot info) and I'll see if there's any workaround from my side.

Edit: If it's "XML parse error when fetching items, retrying with high-fidelity turned off" messages that you're seeing, then those are harmless (assuming no follow-up exceptions). The retry must have succeeded.

link

ivank 4736 days ago

Have you tried the JSON API? (See the requests that Google Reader itself makes.) It requires no cookies and supports getting up 1000 items per continuation.

link

mihaip 4736 days ago

I wrote most of Reader's JSON API in 2006-2007 :)

The tool uses the "high-fidelity" Atom output mode for getting at item bodies. That preserves namespaced XML elements and other extra data from the feed. It uses JSON for everything else, and will fall back to regular Atom output if the high fidelity mode is not well-formed (it was added in late 2010, as things were winding down, and thus never got a lot of testing).

link

DecoPerson 4736 days ago

Thank you for this! Now I can procrastinate on my own reader app for much longer :)

Should we be concerned with errors like this?

    [W 130629 03:11:54 api:254] Requested item id tag:google.com,2005:reader/item/afe90dad8acde78b (-5771066408489326709), but it was not found in the result

I'm getting ~1-2 per "Fetch N/M item bodies" line.

link

mihaip 4736 days ago

Usually nothing to worry about, see https://github.com/mihaip/readerisdead/commit/19d3159c985b6e...

link

pixsmith 4732 days ago

This is an impressive bit of work. I have had, though, an interesting thing happen, in that it's apparently trying to pull every single item from explore and from suggested items in, to the extent that I get a message saying I have 13 million items, and still going strong -- it pulled about 5 or 6 gig of data down .

Is there some way to avoid all the years of explore and suggested items with reader archive? I tried limiting the maximum number of items to 10.000 but it was still running and growing after 12 hours. Interesting though, what it was able to accomplish in that time.

link

skilesare 4736 days ago

If this does what I think it does(And it seems to be doing it now on my machine), then this is truly, truly awesome.

Thank you. mihaip, if you are ever in Houston I will buy you a beer/ and or a steak dinner.

link

dmtelf 4734 days ago

I'm getting "ImportError: No module named site"

echo %pythonpath% gives c:\readerisdead

I copied 'base' from the readerisdead zipfile to c:\python27\lib & also copied the base folder into the same folder as reader_archive.py

C:\readerisdead\reader_archive\reader_archive.py --output-directory C:\googlereader gives "ImportError: No module named site"

What am I doing wrong? How can I get this to work?

link

drivebyacct2 4736 days ago

I guess archived RSS data for me isn't terribly important since most people seem to hide the rest of their content behind a "More" link to get those precious ad views.

link

ivank 4736 days ago

Really? Pretty much all the feeds I've seen are full-text feeds. For the few that aren't, http://fulltextrssfeed.com/ and http://fullrss.net/ are around.

link

drivebyacct2 4736 days ago

Wow! Thank you very much!

link