|
|
|
|
|
by sumanyusharma
754 days ago
|
|
Hi HN - We built a new benchmark called "Bug In The Code Stack" (BICS) to test how well LLMs can find syntactic bugs in large Python codebases. (similar to a text-based needle-in-the-haystack test) GPT-3.5-Turbo showed lower accuracy on the BICS benchmark than the BABILONG benchmark at the same context length and target depth, indicating that LLMs struggle more on code-based tasks than text-based tasks at long context length. GPT-4o showed the best performance, closely followed by GPT-4-Turbo. The GPT-4-Series especially performed well at long context lengths compared to other models. Gemini-1.0-pro performed the worst, surprisingly worse than Llama3-70B. Generally, longer context length resulted in lower accuracy. However, there were some exceptions to this.
Models react differently to the placement of the bug within the source code. GPT-3.5-Turbo and Claude 3 Opus were the most sensitive, and GPT-4-Series was the least sensitive. Generally, less sensitivity means a more robust model. This benchmark has lots of limitations. I would love your feedback & suggestions on how we can make this benchmark more useful! Link to results: https://hamming.ai/blog/bug-in-the-codestack
Repo: https://github.com/HammingHQ/bug-in-the-code-stack |
|