|
|
|
|
|
by halyconWays
5 hours ago
|
|
"We did multi-shot prompting to try and get these two games into comparable states using these two different models." "Well obviously you provided better follow-up prompts to the one that came out better." Also nothing about human-provided plan files and guardrails preclude the one-shot benchmark test. Heavens, I almost said "real coding," but in "real agentic program creation" you'd obviously be doing multi-turn interaction with the agent, but how can you provide a fair test when the model's output n determines your n+1 response? |
|