| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by ikeboy 4167 days ago
	To calculate the hash, it needs to read the whole file, which this post claims it isn't doing.

3 comments

sarciszewski 4167 days ago

I appear to have overlooked this detail. Good catch! :)

link

aselzer 4167 days ago

Did the author actually verify this with strace (or the mac/windows equivalent)?

It sounds like he guessed this based on I/O activity of the process. It could be enough to hash the beginning of the files, and compare the rest if a match is found in the database.

link

cremno 4167 days ago

Dropbox doesn't read the file content. There is also no proof that Dropbox directly accesses those files.

link

yc1010 4167 days ago

Not really, one could get a unique enough hash by reading the first lets say 10,000 bytes of each file, and it would be faster than hashing the whole file

edit: here i was bored enough > http://pastebin.com/NJEvnG1d

link

x1798DE 4167 days ago

I wrote something that was hashing audiobook files that was taking forever, so I tried using the first N bytes (likely much more than 10kB), but soon found that for any given audiobook, each chapter's MP3 had a large identical header on the front end - I imagine that it was a cover image embedded in the metadata.

I think in the end I just started taking the data from the end of the file, but if you're going with subsets, it's probably better to use a pseudo-randomly selected subset rather than a sequential subset. It doesn't have to be a different pseudo-random subset for each file, but I imagine there's an ideal noise profile in the sampling (maybe white noise is best).

link

yc1010 4166 days ago

Of course you are correct (not sure why my comment was downvoted) but in the context of having a unique enough hash TAKEN QUICKLY 99.999% of time in set of millions of files its good enough, if one needs better hashing they can hash the whole file but this is quite heavy on large files and pointless if there is no need for it by the application

link

sarciszewski 4167 days ago

Maybe:

    File > 12 KB: First 4 KB, last 4 KB, middle 4 KB
    File <= 12 KB: Just hash the damn file

link