Hacker News new | ask | show | jobs
by spmurrayzzz 754 days ago
I love to ask it to "make me a Node.js library that pings an ipv4 address, but you must use ZERO dependencies, you must only the native Node.js API modules"

The majority of models (both proprietary and open-weight) don't understand:

- by inference, ping means we're talking about ICMP

- ICMP requires raw sockets

- Node.js has no native raw socket API

You can do some CoT trickery to help it reason about the problem and maybe finally get it settled on a variety of solutions (usually some flavor of building a native add-on using C/C++/Rust/Go), or just guide it there step by step yourself, but the back and forth to get there requires a ton of pre-knowledge of the problem space which sorta defeats the purpose. If you just feed it the errors you get verbatim trying to run the code it generates, you end up in painful feedback loops.

(Note: I never expect the models to get this right, it's just a good microcosmic but concrete example of where knowledge & reasoning meets actual programming acumen, so its cool to see how models evolve to get better, if at all, at the task).

1 comments

This is the same level of gotcha that everyone complains about when interviewing. It's mainly just depending on the interviewee having the same assumptions (pings definitely do not have to be icmp) and the same knowledge base, usually bespoke, (node.js peculiarities). I can see that an llm should know whether raw sockets are available, but that's not what you asked.

In fact you deliberately asked for something impossible and hold up undefined behavior as undefined like it's impugning something.

> In fact you deliberately asked for something impossible and hold up undefined behavior as undefined like it's impugning something.

Correct, I did. This is a direct indictment on a given model's ability to plan/reason in this particular context. There are plenty of situations where models will respond with "Sorry, that's not possible". Ask GPT-4 "Tell me how to grow biological wings on a human" and it will respond with something along the lines of "this isn't currently possible, but here's a theoretical exploration of the idea"

GPT-4 gets very close on its own to the node.js question via a similar response breakdown above, provided the prompt is clear and detailed enough. But I test the open weight models in the same way to see if they have the capacity to exhibit similar reasoning or chain of thought process on their own. They usually don't without excessive prompt engineering or few-shot.

I said that I don't expect models to get this right not because I don't _want_ them to, it's because I think its an important milestone when they do. Autoregressive token prediction is unlikely to produce the real outcome im testing for here, but if it ever does thats an interesting finding.