The AI community is nowhere close to ditching Python. Most model development and training still use python based toolchains (torch, tf...). The new trends is for popular and useful models to be ported to more efficient stack like C++/GGML for easier usage and inference speed on consumer hardware.
Another popular optimisation is to port models to WASM + GPU because it makes them easy to support a variety of platforms (desktop, mobile...) with a single API and it can still offer great performance (see Google's mediapipe as an exemple of that).
In general if you don't know what you are doing, it's much faster to first figure out a good strategy for a solution in a language that does not suffer from all of the encumbrance C++ brings in.
Python is really great for fast prototyping. It can be argued most AI products so far are result of fast prototyping. So not sure if there is anything wrong with that.
As practical models emerge, at that point it indeed makes sense to port them to C++. But I would not in my wildest dreams suggest prototyping a data model in C++ unless absolutely necessary.
How has Python held it back? Most of the heavy computation lifting is done by C extensions/bindings and the models are compiled to run on CUDA, etc. What am I missing?
Presumably what you're missing is that it's that IshKebab probably doesn't work in AI/ML at all (no links in his profile, but you can judge his post history yourself). Anyone can have voice opinion, but that doesn't mean it's particularly well informed.
You can compile the models to something that runs on edge though, right? For example, Tensorflow is a C++ framework that has Python bindings and a Python library, but when the models are served they are running on C++.
Maybe the act of compilation is an extra step, but I'd much rather have my development be in a high level language that is very suited to experimentation, probing, and testing, and then compile the final result down to something performant.
EDIT: I don't know much about the IOT world, and Tensorflow is likely a bad example as it's not designed to run on edge. So, I could understand that things like llama.cpp, GGML and GGUF are making strides towards easier runtimes. But I still think for dev-time, Python makes sense!
llm-cli looks like it loads model files and it doesn't help with model development. It runs GGML model files. The models aren't written in Rust. Besides the point, GGUF is successor to GGML. There's a variety of ways to convert Pytorch, Keras, etc models to GGML or GGUF.
I dunno, maybe we're talking about different things. I'm saying it's better to do model development in a high level language and then export the training or runtime to a lower level framework, multiple of which exist and have existed. It's becoming simpler to use low-level runtimes (llama.cpp vs Tensorflow). Is that the point you're making?
This is exactly the wrong way around. We've seen the progress we've seen because of the adoption of Python. Even now there are relatively few people that can write code like this and have the ML and math experience to push forward the research.
Another popular optimisation is to port models to WASM + GPU because it makes them easy to support a variety of platforms (desktop, mobile...) with a single API and it can still offer great performance (see Google's mediapipe as an exemple of that).