|
|
|
|
|
by markonen
638 days ago
|
|
In the performance tests they said they used "consensus among 64 samples" and "re-ranking 1000 samples with a learned scoring function" for the best results. If they did something similar for these human evaluations, rather than just use the single sample, you could see how that would be horrible for personal writing. |
|