Hacker News new | ask | show | jobs
by wesleyyue 702 days ago
I'm building a ai coding assistant (https://double.bot) so I've tried pretty much all the frontier models. I added it this morning to play around with it and it's probably the worst model I've ever played with. Less coherent than 8B models. Worst case of benchmark hacking I've ever seen.

example: https://x.com/WesleyYue/status/1816153964934750691

5 comments

to be fair that's quite a weird request (the initial one) – I feel a human would struggle to understand what you mean
definitely not an articulate request, but the point of using these tools is to speed me up. The less the user has to articulate and the more it can infer correctly, the more helpful it is. Other frontier models don't have this problem.

Llama 405B response would be exactly what I expect

https://x.com/WesleyYue/status/1816157147413278811

That response is bad python though, I can't think of why you'd ever want a dict with Literal typed keys.

Either use a TypedDict if you want the keys to be in a specific set, or, in your case since both the keys and the values are static you should really be using an Enum

What was the expected outcome for you? AFAIK, Python doesn't have a const dictionary. Were you wanting it to refactor into a dataclass?
Yes, there's a few things wrong: 1. If it assumes typescript, it should do `as const` in the first msg 2. If it is python, it should be something like https://x.com/WesleyYue/status/1816157147413278811 which is what I wanted but I didn't want to bother with the typing.
Are you sure the chat history is being passed when the second message is sent? That looks like the kind of response you'd expect if it only received the prompt "in python" with no chat history at all.
Yes, I built the extension. I actually also just went to send another message asking what the first msg was just to double check I didn't have a bug and it does know what the first msg was.
Thanks, that's some really bad accuracy/performance
This makes no sense. Benchmarking code is easier than natural language and Mistral has separate benchmarks for prominent languages.
a bit of surprise since codestral is among best open models so far.