| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jonmoore 47 days ago
	I really liked the evaluation method here - testing fidelity by round-tripping through chains of invertible steps. It was striking how even frontier models accumulated errors on seemingly computer-friendly tasks. It would be interesting to know if the stronger results on Python are not just an artefact of the Python-specific evaluation, if they carry over to other common general-purpose languages, and if they are driven by something specific in the training processes.