Hacker News new | ask | show | jobs
by abossy 316 days ago
At my company (Charlie Labs), we've had a tremendous amount of success with context awareness over long-running tasks with GPT-5 since getting access a few weeks ago. We ran an eval to solve 10 real Github issues so that we could measure this against Claude Code and the differences were surprisingly large. You can see our write-up here:

https://charlielabs.ai/research/gpt-5

Often, our tasks take 30-45 minutes and can handle massive context threads in Linear or Github without getting tripped up by things like changes in direction part of the way through the thread.

While 10 issues isn't crazy comprehensive, we found it to be directionally very impressive and we'll likely build upon it to better understand performance going forward.

3 comments

I am not (usually) photosensitive, but the animated static noise on your websites causes noticable flickering on various screens I use and made it impossible for me to read your article.

For better accessibility and a safer experience[1] I would recommend not animating the background, or at least making it easily togglable.

[1] https://developer.mozilla.org/en-US/docs/Web/Accessibility/G...

Removed- sorry, and thank you for the feedback.
Love your responsiveness here!

Edited to add: I am, in fact, photosensitive (due to a genetic retinal condition), and for my eyes, your site as it is very easy to read, and the visualizations look great.

Thank you!

Love that you included the judge prompts in your article.

Please let me know what you would like to see more of. Evals are something we take serious, I think this post was ok enough given our constraints, but I'd like to produce content people find useful and I think we can do a lot better.
Nice,
I concur. Awful UI
Did you sign any kind of agreement with a non disparagement clause to get early access? I'm asking because if you did, your data point isn't useful. It would mean anyone else that tried it and got worse results wouldn't be able to post here. We would just be seeing the successful data points.
Even if they didn't, overly critical or negative commentary will mean their removal from the list of trusted testers
They didn't say anything to us, nothing was approved, just eng <> eng discussion about the model. Also nothing was cherry picked etc etc - I don't care what OAI thinks, I care about producing the best product and showing you our findings.
Waitig 30-45 minutes for code, that you're still going to have to read from top to bottom to make sure it doesn't have anything dumb in it, does not seem like a productivity enhancement. I would quit If I was an engineer and told to do this.
If you're doing nothing in that 30-45 minutes other than stare at a loading screen, you're doing it wrong.

I'm not sold on the efficacy of AI and I share your reservations about having to scrutinise their output, but I see great value in being able to offload a long-running task to someone/something else and only have to check back later. In the meantime, I can be doing something else - like sitting in those planning meetings we all enjoy!

I love sitting in those planning meetings, too. /s

This is exactly right. We've adapted our workflow to kick off a task and then kick off the next one and the next. Then we review the work of each as they come through. It's just CPU pipelining for human workflow.

The process is far from perfect but the throughput is very high. The limiting factor is review. I spend most of my time doing line-by-line review of AI output and asking questions about things I'm unsure of. It's a very different job from the way I historically operated, which involved tight code -> verify loops of manually written code.