| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wesleyyue 702 days ago
	I'm building a ai coding assistant (https://double.bot) so I've tried pretty much all the frontier models. I added it this morning to play around with it and it's probably the worst model I've ever played with. Less coherent than 8B models. Worst case of benchmark hacking I've ever seen. example: https://x.com/WesleyYue/status/1816153964934750691

5 comments

mpeg 702 days ago

to be fair that's quite a weird request (the initial one) – I feel a human would struggle to understand what you mean

link

wesleyyue 702 days ago

definitely not an articulate request, but the point of using these tools is to speed me up. The less the user has to articulate and the more it can infer correctly, the more helpful it is. Other frontier models don't have this problem.

Llama 405B response would be exactly what I expect

https://x.com/WesleyYue/status/1816157147413278811

link

mpeg 701 days ago

That response is bad python though, I can't think of why you'd ever want a dict with Literal typed keys.

Either use a TypedDict if you want the keys to be in a specific set, or, in your case since both the keys and the values are static you should really be using an Enum

link

ijustlovemath 702 days ago

What was the expected outcome for you? AFAIK, Python doesn't have a const dictionary. Were you wanting it to refactor into a dataclass?

link

wesleyyue 702 days ago

Yes, there's a few things wrong: 1. If it assumes typescript, it should do `as const` in the first msg 2. If it is python, it should be something like https://x.com/WesleyYue/status/1816157147413278811 which is what I wanted but I didn't want to bother with the typing.

link

nabakin 702 days ago

Are you sure the chat history is being passed when the second message is sent? That looks like the kind of response you'd expect if it only received the prompt "in python" with no chat history at all.

link

wesleyyue 702 days ago

Yes, I built the extension. I actually also just went to send another message asking what the first msg was just to double check I didn't have a bug and it does know what the first msg was.

link

nabakin 702 days ago

Thanks, that's some really bad accuracy/performance

link

schleck8 702 days ago

This makes no sense. Benchmarking code is easier than natural language and Mistral has separate benchmarks for prominent languages.

link

treme 701 days ago

a bit of surprise since codestral is among best open models so far.

link