|
|
|
|
|
by gaflo
8 days ago
|
|
Thanks for documenting your personal observations. I do have a few questions. First, could you expand by giving other examples on how you observed this model to be relentlessly proactive?
From my personal experience with prior frontier models using both Claude Code and Codex I found them to already be quite proactive depending on the domain (although Codex a bit less so, which I personally prefer).
The main task that they seemed to struggle with for me are tasks that naturally have long run times for the programs the agents wrote, as they didn't seem to have a good intuition for when/how to change approach to minimise the time spent on the task. Specificically if you are trying to scrape sites/services that are heavily guarded against programmatic access or running automated tasks that call LLMs (such as indexing or document extraction).
I'm not surprised that for web dev the proactiveness is the most obvious improvement, as I would expect the most common use case with the most training data to be the biggest priority. I have previously built a similar workflow as you described Fable 5 to auto test changes to the website and while it worked somewhat well, it often couldn't identify obvious flaws to the human eye, such as overlapping text or inconsistent font choices as well as bad layout decisions. I do like it for quick prototyping, but the testing and design decisions were not ones I would hand off at this moment.
Did you notice improvements in these areas? Can you share how it does for long running programs? If you want I can give you some more specific instructions to test, but I would also be happy to hear from your own use cases. |
|