|
|
|
|
|
by heyitsguay
1751 days ago
|
|
So there's some arguing over terminology, and then the main technical point seems to be that you can reverse-engineer a training dataset from the "virtual machine" built by training a neural network. The decompilation process doesn't just use the neural network though, if I understand correctly it also uses logs from the final training epoch that include error and weight update data. Does this somehow smuggle the training dataset back into the VM? To me, if you're making a statement about the nature of existing ML systems, the statement "reconstruct the source dataset from the cryptic matrices that constitute the software executed by them" would imply that this is possible from trained networks alone. |
|
During the compilation phase, the training dataset is projected on a complex vector space that is constituted by both the "model" of the "neural network" and these logs.
It's just like projecting a shadow over a bidimensional surface: if you discard the data pertaining to one dimension you have no hope to guess what projected it: you need both dimensions.
The logs that are preserved in the compilation process is the part of the vector space that is usually discarded during the "training".
But discarding the "model" would have exactly the same effect: you cannot get back the source dataset from those logs alone. That's why this does not "smuggle the training dataset back".
Indeed the fact that the source dataset is obtainable from the couple "these logs" + "final model", but neither from "these logs" alone nor by the final model alone, proves that a substantial portion of the source dataset is always embedded in the "model", that becomes a derivative work of the sources.