|
|
|
|
|
by mholt
1098 days ago
|
|
> What is the correct tool to properly merge a large set of tar.gz files for which may have an enormous overlap of similar files, and some that have been altered just slightly? Can you elaborate on this? My understanding is that they should all extract into the same target folder without issues because each archive's set of files is distinct. But maybe I'm just assuming wrongly? What exactly is your goal, too? It sounds like you are trying to find and de-duplicate visually similar images? Like what do you mean by "enormous overlap" or "altered just slightly"? |
|
So for example in 2001, you make a takeout with 30 zips, and then delete half of your photos off of Google. Then in 2007, you have another 20 zips, and delete 25% of your emails and photos to make more room, 2008 again, on up to now.
So now you have a big folder with many zips, and maybe some extracted folders, because things happen over the years, etc.
What's the best tool to merge all of this into one directory?
Got can help for the notes from Google keep that may have had things appended to or removed, photos can be overlapping a bit so really a set union is all that's required for many files, but some will be slightly different like the Google keep notes.
My best thought is to make some git repo and add things in, but to do a levenstein distance on the bits of each file to check if there is overlap in content and to estimate the 'lineage' of a file if there is significant overlap with another. Effectively you reconstruct the git commit tree with the set of all files over all histories. Then you build the git repo history from all of the files.
This would likely just be a local git repo since it would likely be several terabytes of info, but that would be the general idea I guess.
I just haven't found a good tool to actually do this easily unfortunately, but it seems like it would be a very basic , or commonly used scenario (especially for those 'should-be-a-git-repo' directories that everyone made before knowing about git. You know the ones: 'myfile.v1.doc', myfile.v2.doc', 'myfile.final.doc', myfile.reallyfinal.doc', myfile.finalfinal.doc')