I'm building a ai coding assistant (https://double.bot) so I've tried pretty much all the frontier models. I added it this morning to play around with it and it's probably the worst model I've ever played with. Less coherent than 8B models. Worst case of benchmark hacking I've ever seen.
definitely not an articulate request, but the point of using these tools is to speed me up. The less the user has to articulate and the more it can infer correctly, the more helpful it is. Other frontier models don't have this problem.
Llama 405B response would be exactly what I expect
That response is bad python though, I can't think of why you'd ever want a dict with Literal typed keys.
Either use a TypedDict if you want the keys to be in a specific set, or, in your case since both the keys and the values are static you should really be using an Enum
Yes, there's a few things wrong:
1. If it assumes typescript, it should do `as const` in the first msg
2. If it is python, it should be something like https://x.com/WesleyYue/status/1816157147413278811 which is what I wanted but I didn't want to bother with the typing.
Are you sure the chat history is being passed when the second message is sent? That looks like the kind of response you'd expect if it only received the prompt "in python" with no chat history at all.
Yes, I built the extension. I actually also just went to send another message asking what the first msg was just to double check I didn't have a bug and it does know what the first msg was.