| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by markonen 684 days ago
	In the performance tests they said they used "consensus among 64 samples" and "re-ranking 1000 samples with a learned scoring function" for the best results. If they did something similar for these human evaluations, rather than just use the single sample, you could see how that would be horrible for personal writing.

1 comments

janalsncm 684 days ago

I don’t understand how that is generalizable. I’m not going to be able to train a scoring function for any arbitrary task I need to do. In many cases the problem of ranking is at least as hard as generating a response in the first place.

link