Hacker News new | ask | show | jobs
by scoresmoke 1008 days ago
Discussions about LLM alignment often miss topics of data quality and quantity. It turns out that current models like Llama 2 use 10K+ prompts and responses for supervised fine-tuning (SFT) and 100K+ human preference pairs. While the preferences are pretty easy to annotate, producing a good SFT dataset is uneasy.

https://evalovernite.substack.com/p/rlhf-math-aint-enough

https://doi.org/10.5281/zenodo.8186168