| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by scoresmoke 1055 days ago

Discussions about LLM alignment often miss topics of data quality and quantity. It turns out that current models like Llama 2 use 10K+ prompts and responses for supervised fine-tuning (SFT) and 100K+ human preference pairs. While the preferences are pretty easy to annotate, producing a good SFT dataset is uneasy.

https://evalovernite.substack.com/p/rlhf-math-aint-enough

https://doi.org/10.5281/zenodo.8186168