|
|
|
|
|
by D-Machine
101 days ago
|
|
Did you read the post you are responding to? It says: > What's the benefit? Is it speed? Where are the benchmarks? Is it that you can backprop through this computation? Do you do so? The correct parsing of this is: "What's the benefit? [...] Is it [the benefit] that you can backprop through this computation? Do you do so?" There are no details about training nor the (almost-certainly necessarily novel) loss function that would be needed to handle partial / imperfect outputs here, so it is extremely hard to believe any kind of gradient-based training procedure was used to determine / set weight values here. |
|
my understanding was that they are not training at all, which would explain that. they are compiling an interpreter down to a VM that has the shape of a transformer.
ie they are calculating the transformer weights needed to execute the operations of the machine they are generating code for.