| HN Mirror

To me synthetic data generation makes no sense. Mathematically your LLM is learning a distribution (let’s say of human knowledge). Let’s assume your LLM models human knowledge perfectly. In that case, what can you achieve? Just sampling the same data that your model mapped perfectly.

However, if your models distribution is wrong, you’re basically going to have an even more skewed distribution in models trained using the synthetic data.

To me, it seems like the architecture is the next place for improvements. If you can’t synthesise the entirety of human knowledge using transformers, there’s an issue there.

The smell that points me in that direction is the fact that up until recently, you could quantise models heavily with little drop in performance, but recent Llama3 research shows that’s not the case anymore