| To me, the second biggest problem is that the models aren't really conversational yet. They can maintain some state between prompt and response, but in normal human-human interactions responses can be interrupted by either party with additional detail or context provided. "Write Python code for the game of Tetris" resulting in working code that resembles Tetris is great. But the back and forth asking for clarification or details (or even post-solution adjustments) isn't there. The models dwell and draw almost entirely self-referentially from their own reasoning through the entire exchange. "Do you want to keep score?" "How should scoring work?" "Do you want aftertouch?" "What about pushing down, should it be instantaneous or at some multiple of the normal drop speed?" "What should that multiple be?" as well as questions from the prompter that inquire about capabilities and possibilities. "Can you add one 5-part piece that shows up randomly on average every 100 pieces on average?" or "Is it possible to make the drop speed function as an acceleration rather than a linear drop speed?"....these are somewhat possible, but sometimes require the model to re-reason the entire solution again. So right now, even the best models may or may not provide working code that generates something that resembles a Tetris game, but have no specifics beyond what some internal self-referential reasoning provides, even if that reasoning happens in stages. Such a capability would help users of these models troubleshoot or fix specific problems or express specific desires....the Tetris game works but has no left-hand L blocks for example. Or the scoring makes no sense. Everything happens in a sort of highly superficial approach where the reasoning is used to fill in gaps in the top-down understand of the problem the model is working on. |
For example, I can ask it to write a python script to get the public IP, geolocation, and weather, trying different known free public APIs until it succeeds. But the first successful try was dumping a ton of weather JSON to the console, so I gave feedback to make it human readable with one line each for IP, location, and weather, with a few details for the location and weather. That worked, but it used the wrong units for the region, so I asked it to also use local units, and then both the LLM and myself judged that the project was complete. Now, if I want to accomplish the same project in fewer prompts, I know to specify human-readable output in region-appropriate units.
This only uses text based LLMs, but the logical next step would be to have a multimodal network review images or video of the running program to continue to self-evaluate and improve.