Hacker News new | ask | show | jobs
by smilebot 582 days ago
>Due to the prevalence of forking and copy-pasting within the codebase, nearly 75% of files are completely duplicated.

This is surprisingly high. Does the include imported libraries and packages? Since you are hashing at the file level, I am not fully convinced that this is due to people copying entire files over without modification.

1 comments

Probably forks/duplicates of repos in the dataset.
Also commits. I imagine that there is a lot of information to gather from the history of repos in addition to the "static view" of a codebase.

However, it doesn't seem trivial to do deduplication in that case without removing relevant/necessary context.