| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bane 503 days ago

To me, the second biggest problem is that the models aren't really conversational yet. They can maintain some state between prompt and response, but in normal human-human interactions responses can be interrupted by either party with additional detail or context provided.

"Write Python code for the game of Tetris" resulting in working code that resembles Tetris is great. But the back and forth asking for clarification or details (or even post-solution adjustments) isn't there. The models dwell and draw almost entirely self-referentially from their own reasoning through the entire exchange.

"Do you want to keep score?" "How should scoring work?" "Do you want aftertouch?" "What about pushing down, should it be instantaneous or at some multiple of the normal drop speed?" "What should that multiple be?"

as well as questions from the prompter that inquire about capabilities and possibilities. "Can you add one 5-part piece that shows up randomly on average every 100 pieces on average?" or "Is it possible to make the drop speed function as an acceleration rather than a linear drop speed?"....these are somewhat possible, but sometimes require the model to re-reason the entire solution again.

So right now, even the best models may or may not provide working code that generates something that resembles a Tetris game, but have no specifics beyond what some internal self-referential reasoning provides, even if that reasoning happens in stages.

Such a capability would help users of these models troubleshoot or fix specific problems or express specific desires....the Tetris game works but has no left-hand L blocks for example. Or the scoring makes no sense. Everything happens in a sort of highly superficial approach where the reasoning is used to fill in gaps in the top-down understand of the problem the model is working on.

3 comments

hnuser123456 503 days ago

I have my own automated LLM developer tool. I give a project description, and the script repeatedly asks the LLM for a code attempt, runs the code, returns the output to the LLM, asking if it passes/fails the project description, repeating until it judges the output as a pass. Once/if it thinks the code is complete, it asks the human user to provide feedback or press enter to accept the last iteration and exit.

For example, I can ask it to write a python script to get the public IP, geolocation, and weather, trying different known free public APIs until it succeeds. But the first successful try was dumping a ton of weather JSON to the console, so I gave feedback to make it human readable with one line each for IP, location, and weather, with a few details for the location and weather. That worked, but it used the wrong units for the region, so I asked it to also use local units, and then both the LLM and myself judged that the project was complete. Now, if I want to accomplish the same project in fewer prompts, I know to specify human-readable output in region-appropriate units.

This only uses text based LLMs, but the logical next step would be to have a multimodal network review images or video of the running program to continue to self-evaluate and improve.

link

mistrial9 503 days ago

are you familiar with

https://gorilla.cs.berkeley.edu/leaderboard.html

link

hnuser123456 502 days ago

I was not, thanks for showing me!

link

MrLeap 503 days ago

> The models dwell and draw almost entirely self-referentially from their own reasoning through the entire exchange.

Sounds like a deficiency in theory of the mind.

Maybe explains some of the outputs I've seen from deepseek where it conjectures about the reasons why you said whatever you said. Perhaps this is where we're at for mitigations for what you've noticed.

link

timdellinger 503 days ago

This sounds like a prompting issue.

If your prompt instructs the model to ask such questions along the way, the model will, in fact, do so!

But yes, it would be nice if the model were smart enough to realize when it's in a situation where it should ask the user a few questions, and when it should just get on with things.

link