the models that you have tried .. are garbage. hmmm Maybe you are not among the many, many, many inside professionals and unofrmed services that have different access than you? money talks?
It is remarkable that folks who tried a garbage LLM like copilot, 3.5, Gemini, or made meta LLMs say naughty words, seem to think these are still SOA. Sometimes I stumble on them and I am shocked at the degradation in quality then realize my settings are wrong. People are vastly underestimating the rate of change here.
People have tried gpt-4, it does the same kind of errors as gpt-3, it just has a bigger set of known things where it does ok so it is immensely more useful.
It is like a calculator that only worked in one digit, and now it works on 2, the improvement is immense but its still nowhere close to replacing mathematicians since it isn't even working on the same kind of problems.
Edit: In several years we might have a perfect calculator that is better than any human at such tasks, but it still doesn't beat humans at stuff unrelated to calculations. Or in the case of LLMs pattern matching texts, humans don't pattern match texts to plan or mentally simulate scenarios etc, that part isn't covered by LLMs. Human level planning with todays LLM level pattern matching on text would be really useful, we see a lot of humans work that way by using the LLM as a pattern matcher, but there is no progress on automating human level planning so far, LLMs aren't it.
Not yet, because the reliability isn't there. You still need to validate everything it does.
E.g. I had it autocompleting a set of 20 variable#s today Something like output.blah=tostring(input[blah]). The kind of work you give to a regex.
In the middle of the list, it decides to go output.blah=some long weitd piece of code, completely unexpected and syntactically invalid.
I am still in my AI evaluation phase, and sometimes I am impressed with what it does. But just as possible is an unexpected total failure. As long is it does that, I can't trust it.