| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chaxor 1098 days ago

What is the correct tool to properly merge a large set of tar.gz files for which may have an enormous overlap of similar files, and some that have been altered just slightly?

Git plus some parsing seems close in that space, as analyzing the files to create a dendrogram like tree of potential alterations to files over time by levenstein distance may be useful to approximate commit history. However, this doesn't seem to exist or be popular as a tool. There's vimdiff or meld, but they are extremely manual and tedious to the extent of being pointless to try for something like a large history of takeout tar.gz's.

Throwing in the towel completely, borgfs can be helpful to reduce the amount of space they take by de-duplication on the block level, but this is a terrible solution as it doesn't really track file changes in a reasonable way, etc. It is useful to extract the files into a directory without the tar or gz, but this can also cause issues with how to appropriately organize the directory structure over the history.

Any thoughts or projects that do a better job of this?

1 comments

mholt 1098 days ago

> What is the correct tool to properly merge a large set of tar.gz files for which may have an enormous overlap of similar files, and some that have been altered just slightly?

Can you elaborate on this? My understanding is that they should all extract into the same target folder without issues because each archive's set of files is distinct. But maybe I'm just assuming wrongly?

What exactly is your goal, too? It sounds like you are trying to find and de-duplicate visually similar images? Like what do you mean by "enormous overlap" or "altered just slightly"?

link

chaxor 1092 days ago

The problem isn't one takeout overlapping (multiple zips from one date) it's many takeouts over the years (full history).

So for example in 2001, you make a takeout with 30 zips, and then delete half of your photos off of Google. Then in 2007, you have another 20 zips, and delete 25% of your emails and photos to make more room, 2008 again, on up to now.

So now you have a big folder with many zips, and maybe some extracted folders, because things happen over the years, etc.

What's the best tool to merge all of this into one directory?

Got can help for the notes from Google keep that may have had things appended to or removed, photos can be overlapping a bit so really a set union is all that's required for many files, but some will be slightly different like the Google keep notes.

My best thought is to make some git repo and add things in, but to do a levenstein distance on the bits of each file to check if there is overlap in content and to estimate the 'lineage' of a file if there is significant overlap with another. Effectively you reconstruct the git commit tree with the set of all files over all histories. Then you build the git repo history from all of the files.

This would likely just be a local git repo since it would likely be several terabytes of info, but that would be the general idea I guess.

I just haven't found a good tool to actually do this easily unfortunately, but it seems like it would be a very basic , or commonly used scenario (especially for those 'should-be-a-git-repo' directories that everyone made before knowing about git. You know the ones: 'myfile.v1.doc', myfile.v2.doc', 'myfile.final.doc', myfile.reallyfinal.doc', myfile.finalfinal.doc')

link

mholt 1092 days ago

_> So now you have a big folder with many zips, and maybe some extracted folders, because things happen over the years, etc._

Oh, right.

Timelinize can do that. Takeout all your data, then import it into Timelinize. Then delete your Takeout (after Timelinize is finished and stable, of course, heh). Then next time you Takeout, just import it all into Timelinize again. (It de-dupes!) Then delete the Takeout, etc. (Maybe Timelinize can do the cleanup for you someday.)

The de-duping depends on the item being recognizable. Best if the data source provides an ID. Otherwise, things like certain metadata and content can be used to determine duplicates.

link

citelao 1098 days ago

I'm not chaxor, but as far as I remember I think you're right:

If I had unzipped all the takeout directories into one giant folder, there'd be no conflicts.

Since I didn't do that, I had to do weird multi-pass parsing, since an album could be split across multiple ZIPs. I get a bit neurotic around backups like this, so I'd have loved some sort of virtualized filesystem that non-destructively represented all of those zips "merged together." But in retrospect, I should have just merged the directories into one folder---would have made parsing easier :)

I don't recall substantial problems with duplicates. Just weird renames and EXIF data mismatches. And since I was trying to archive my data, I definitely didn't want similar photos to be deduplicated.

My problems are probably different than chaxor's, though.

link