Hacker News new | ask | show | jobs
by AtlasBarfed 306 days ago
My personal test question keeps bombing, and I think it's something they should be capable of doing?

Are those math contests? Are their questions and answers in the training set?

Let's say that these things really won a math Olympiad by thinking. Ok, I would like it to to write parsers based on a well defined expression or language spec. Not as bad as near unparseable C++ or JavaScript.

The AIs refuse, despite the prompt, to write a complete parser, hallucinate tests, do things like just call the already working compiler on the CLI, force repetitive reprompts that still won't complete the task.

To me, this is a good example of a task I would give AI as a service to see if it will reliably do something that's well specified, moderately annoying, and is most definitely in the training set if they are pulling data from "the internet".

1 comments

> My personal test question keeps bombing, and I think it's something they should be capable of doing?

The problem is that "they" isn't a monolith. How much compute went into your tests? Gpt-5 thinking in ChatGPT Plus uses less compute than Gpt-5 thinking in ChatGPT Pro, which uses less compute than the "high" reasoning effort when "gpt-5" is called via the API, which uses less compute than Gpt-5 Pro in ChatGPT Pro, which uses less compute than custom scaffolds, which uses less compute than what went into the IMO/IOI solutions. This is not just my idle speculation, it's publicly available information.