Hacker News new | ask | show | jobs
by citelao 1098 days ago
I would be extremely interested in a Google Takeout viewer if you ever end up releasing one.

I dealt with Google Takeout, trying to export my photos to Apple Photos (when Google was planning to charge money for old Google Workspace accounts), and I found it extremely difficult to deal with the file format. The script I wrote (https://github.com/citelao/google_photos_takeout_to_apple_ph...) ended up being decently reliable, but there were a ton of weird mismatches between the EXIF data in Google Photos metadata and the EXIF data in the photos themselves. Although some of that wonkiness was Apple Photos, not Google.

I'd love to see software that could wrangle the mess :)

2 comments

It's called Timelinize (might rename it?), and you can follow it here: https://twitter.com/timelinize (Click "Media" on the Twitter account to view a few screenshots for a preview. More to come!) (There's no website or project page yet because I've been busy developing.)

If you want an invite to try out an early dev preview today, follow @timelinize on Twitter and tweet at it, I'll see about getting you into the Discord.

Some background:

Saving a local copy of my Google Photos has been a passion project of mine since ~2014 (before Google Photos even!). For years it was only focused on downloading the data using APIs -- but then we found out that Google strips location data (from your own photos!) if using the API, so I added Takeout support.

The problem is there was no viewer. Well in 2019 I finally started working on a viewer. It has evolved a lot since it's a very ambitious project and there's nothing quite like it.

It's not just Google Photos: it's any photos and videos. It's also for your text messages and emails. And your location history. And contact list. And chat apps. And really, any files you have. It also supports Facebook, Twitter, and Instagram account exports too. Oh, and iPhone backups.

Timelinize is entity-aware, and it can map identities across data sources (with enough info, or with a manual mapping, or some optional heuristics). It's just not a photo gallery.

It's basically a really detailed view of your life and online history. It's neat because I have my family pictures, my text messages between me and my wife when we were dating (and after of course), and there's different views to explore: map, timeline, conversations, gallery, and more to come (calendar, etc).

We can even place non-geolocated data on a map since we can correlate timestamp and entity. So when we went on our honeymoon, I can see text messages received from friends while we were driving to a beach.

It's really quite immersive and magical and I haven't seen anything quite like it.

And everything is stored on your own computer, it's a GUI app and you have to have enough space to store your stuff. The data is just organized as files within a folder on disk, with a SQLite DB holding the index and the small textual items.

What is the correct tool to properly merge a large set of tar.gz files for which may have an enormous overlap of similar files, and some that have been altered just slightly?

Git plus some parsing seems close in that space, as analyzing the files to create a dendrogram like tree of potential alterations to files over time by levenstein distance may be useful to approximate commit history. However, this doesn't seem to exist or be popular as a tool. There's vimdiff or meld, but they are extremely manual and tedious to the extent of being pointless to try for something like a large history of takeout tar.gz's.

Throwing in the towel completely, borgfs can be helpful to reduce the amount of space they take by de-duplication on the block level, but this is a terrible solution as it doesn't really track file changes in a reasonable way, etc. It is useful to extract the files into a directory without the tar or gz, but this can also cause issues with how to appropriately organize the directory structure over the history.

Any thoughts or projects that do a better job of this?

> What is the correct tool to properly merge a large set of tar.gz files for which may have an enormous overlap of similar files, and some that have been altered just slightly?

Can you elaborate on this? My understanding is that they should all extract into the same target folder without issues because each archive's set of files is distinct. But maybe I'm just assuming wrongly?

What exactly is your goal, too? It sounds like you are trying to find and de-duplicate visually similar images? Like what do you mean by "enormous overlap" or "altered just slightly"?

The problem isn't one takeout overlapping (multiple zips from one date) it's many takeouts over the years (full history).

So for example in 2001, you make a takeout with 30 zips, and then delete half of your photos off of Google. Then in 2007, you have another 20 zips, and delete 25% of your emails and photos to make more room, 2008 again, on up to now.

So now you have a big folder with many zips, and maybe some extracted folders, because things happen over the years, etc.

What's the best tool to merge all of this into one directory?

Got can help for the notes from Google keep that may have had things appended to or removed, photos can be overlapping a bit so really a set union is all that's required for many files, but some will be slightly different like the Google keep notes.

My best thought is to make some git repo and add things in, but to do a levenstein distance on the bits of each file to check if there is overlap in content and to estimate the 'lineage' of a file if there is significant overlap with another. Effectively you reconstruct the git commit tree with the set of all files over all histories. Then you build the git repo history from all of the files.

This would likely just be a local git repo since it would likely be several terabytes of info, but that would be the general idea I guess.

I just haven't found a good tool to actually do this easily unfortunately, but it seems like it would be a very basic , or commonly used scenario (especially for those 'should-be-a-git-repo' directories that everyone made before knowing about git. You know the ones: 'myfile.v1.doc', myfile.v2.doc', 'myfile.final.doc', myfile.reallyfinal.doc', myfile.finalfinal.doc')

_> So now you have a big folder with many zips, and maybe some extracted folders, because things happen over the years, etc._

Oh, right.

Timelinize can do that. Takeout all your data, then import it into Timelinize. Then delete your Takeout (after Timelinize is finished and stable, of course, heh). Then next time you Takeout, just import it all into Timelinize again. (It de-dupes!) Then delete the Takeout, etc. (Maybe Timelinize can do the cleanup for you someday.)

The de-duping depends on the item being recognizable. Best if the data source provides an ID. Otherwise, things like certain metadata and content can be used to determine duplicates.

I'm not chaxor, but as far as I remember I think you're right:

If I had unzipped all the takeout directories into one giant folder, there'd be no conflicts.

Since I didn't do that, I had to do weird multi-pass parsing, since an album could be split across multiple ZIPs. I get a bit neurotic around backups like this, so I'd have loved some sort of virtualized filesystem that non-destructively represented all of those zips "merged together." But in retrospect, I should have just merged the directories into one folder---would have made parsing easier :)

I don't recall substantial problems with duplicates. Just weird renames and EXIF data mismatches. And since I was trying to archive my data, I definitely didn't want similar photos to be deduplicated.

My problems are probably different than chaxor's, though.