If the utility of all computers is based on the right sequence of magic words why do we call them software engineers instead of something better like "code wizards" that encapsulates the nature of it?
I guess the difference is code is (almost always) totally deterministic. Or at the very least, they're designed so that is a mostly safe assumption.
It doesn't seem likely an LLM will ever do that. Maybe at a certain point of sophistication? But if the model is regularly changing - which they almost all will be, if they're expected to be up-to-date - there is a strong change they'll be different every time they're used.
(I've been getting different behaviour in even relatively narrow ML-based systems for years. Google Assistant is my prime example - I regularly use the phrase "add to my calendar on the 20th of September at 5pm, go to the park". Almost all the time, it works perfectly. But a couple times a year at least, it won't process this as an action - it just does a Google web search for this string.)
Code can be deterministic if you're only doing trivial things or have very simple systems. Beyond that it only has sufficient determinism to cross a threshold into being considered useful. Dead letter queues, uncaught errors, kernel panics, race conditions, deadlocks, cosmic bit flips, dumb avoidable bad choices from unexpressive languages, malicious actors, resource constraints, and so on; the real world of software we live in is a duct taped together mess of half measures that only mostly do what we want at its best. So much of the work in product programming is handling all the things that can go wrong.
So yeah, prompt "engineering" is indeed a silly term, but software "engineering" kicked off the dilution of that word ages ago. And GPT models can be inspected and measured for input and output, prompts can be analyzed for their effects and usefulness, temperature settings even directly control some degree of determinism. It's not like models change on a whim unless you're just using end user products. Anthropic, Huggingface, AWS, OpenAI, they let you pick a release model version in your API calls and stick with it for a long time. If you're self hosting a fine tuned Llama 70b, nobody will ever force you to update it if you get it doing a task to your expectations. The quality of deterministic behavior in AI is currently lower than that of Excel or C code, but it's also serving a wholly different purpose, people want it to be creative and create novel nondeterministic outputs, comparing them is a bit silly.
It doesn't seem likely an LLM will ever do that. Maybe at a certain point of sophistication? But if the model is regularly changing - which they almost all will be, if they're expected to be up-to-date - there is a strong change they'll be different every time they're used.
(I've been getting different behaviour in even relatively narrow ML-based systems for years. Google Assistant is my prime example - I regularly use the phrase "add to my calendar on the 20th of September at 5pm, go to the park". Almost all the time, it works perfectly. But a couple times a year at least, it won't process this as an action - it just does a Google web search for this string.)