|
I have my own automated LLM developer tool. I give a project description, and the script repeatedly asks the LLM for a code attempt, runs the code, returns the output to the LLM, asking if it passes/fails the project description, repeating until it judges the output as a pass. Once/if it thinks the code is complete, it asks the human user to provide feedback or press enter to accept the last iteration and exit. For example, I can ask it to write a python script to get the public IP, geolocation, and weather, trying different known free public APIs until it succeeds. But the first successful try was dumping a ton of weather JSON to the console, so I gave feedback to make it human readable with one line each for IP, location, and weather, with a few details for the location and weather. That worked, but it used the wrong units for the region, so I asked it to also use local units, and then both the LLM and myself judged that the project was complete. Now, if I want to accomplish the same project in fewer prompts, I know to specify human-readable output in region-appropriate units. This only uses text based LLMs, but the logical next step would be to have a multimodal network review images or video of the running program to continue to self-evaluate and improve. |
https://gorilla.cs.berkeley.edu/leaderboard.html