| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kirchoni 225 days ago
	Interesting overview, though I still wonder how stable G-Eval really is across different model families. Auto-CoT helps with consistency, but I’ve seen drift even between API versions of the same model.

1 comments

zlatkov 225 days ago

That's true. Even small API or model version updates can shift evaluation behavior. G-Eval helps reduce that variance, but it doesn’t eliminate it completely. I think long-term stability will probably require some combination of fixed reference models and calibration datasets.

link