|
|
|
|
|
by stego-tech
196 days ago
|
|
These models still consistently fail the only benchmark that matters: if I give you a task, can you complete it successfully without making shit up? Thus far they all fail. Code outputs don’t run, or variables aren’t captured correctly, or hallucinations are stated as factual rather than suspect or “I don’t know.” It’s 2000’s PC gaming all over again (“gotta game the benchmark!”). |
|
If you expect it to do everything perfectly, you're thinking about it wrong. If you can't get it to do anything perfectly, you're using it wrong.