Hacker News new | ask | show | jobs
by TechDebtDevin 754 days ago
Well sure, but that wasn't what we were discussing. The original comment says they use that as their benchmark. While their coding task is a bit complex compared to other benchmarking prompts, it's not that crazy. Here is an example of prompts used for benchmarking with Python for reference:

https://huggingface.co/datasets/mbpp?row=98

At the end of the day LLMs in their current iteration aren't intended to do even moderately difficult tasks on their own but it's fun to query them to see progress when new claims are made.

1 comments

The original comment says nothing about benchmarking, they just say that an AI can’t one shot their complex task?
When I read

"My favorite thing to ask the models designed for programming is ....... None of them ever get it right"

I read "benchmark".