|
|
|
|
|
by microtonal
1052 days ago
|
|
Yeah. In curated transformers [1] we are seeing completely deterministic output across multiple popular transformer architectures on a single GPU (there can be variance between GPUs due to different kernels). Of course, it completely depends on what ops and implementations you are using. But most transformers do not use ops that are typically non-deterministic to be fast (like scatter-add). One non-determinism we see with a temperature of 0 is that once you have quantized weights, many predicted pieces will have the same probability, including multiple pieces with the highest probability. And then the sampler (if you are not using a greedy decoder) will sample from those pieces. So, generation is non-deterministic with a temperature of 0. In other words, a temperature of 0 is a poor man’s greedy decoding. (It is totally possible that OpenAI’s implementation switches to a greedy decoder with a temperature of 0). [1] https://github.com/explosion/curated-transformers |
|