|
|
|
|
|
by r-zip
1750 days ago
|
|
Right. While I appreciate the author's skepticism and diction (there is a lot of misleading terminology thrown around by the ML community), his points don't land. In particular, he argues that there's no learning going on, but then says that there is "absorption" of statistical patterns going on. That's just nitpicking over semantics—to people in the field, the two phrases mean the same thing. The only difference is whether you anthropomorphize a piece of software. The second place the author stumbles is that he makes the (quite grave) mistake you pointed out. The title insinuates that the network contains the "source dataset" itself. He has shown nothing of the sort by including the training logs in his "decompilation". That's like suggesting you have a Swift decompiler that can recover the exact source code from an optimized binary, but you actually require access to the pre-optimized LLVM IR. |
|
IMHO it's a better metaphor then "learning", because learning is a _subjective_ experience that everyone does and using that term lead inevitably to anthropomorphisation.
"Absorb" match the insight of filters and pipelines, that can be easily understood from any CS student, any "ML expert", any lawyer and any other citizen.
____
As for the network, my argument is simple: if I get back the source dataset from the executable, I think we can agree that such dataset is projected on the numerical matrices that such executable record.
Now where is the dataset?
You might argue that it is recorded _only_ into the gradients logged there (the gradients applied to one single "neuron" for each "layer"), but if so you could reconstruct the source dataset from the logs alone, and in fact, you cannot. You need both the "model" and those gradients in the correct order (and the encodings of inputs and outputs, obviously).
You might ask: "fine, but how much of the source dataset is projected into the gradients and how much is projected into the model?"
To answer, we need to consider that
- the vector space that constitutes the executable is non-linear (the "model" part) and hierarchical (the vectors of the gradients are not independent neither between layers nor between samples)
- (initialization apart) all the values (and the operative value) that the "model" contains comes from the source dataset
Thus I argue that a substantial portion of the source dataset is contained in the "model".
This does not exclude that another substantial portion of the source dataset is also contained into the few logged gradients!
And in fact I've never stated that the "model" contained the whole source dataset.
But if the portion contained into the "model" was negligible, you would be able to get back the sources from those logged gradients alone with negligible errors.
AFAIK, it is not possible, but if you can, please teach me how! I'm always more than happy to be proven wrong if I can learn how to do something that I previously thought impossible!