| It is not llm specific. A large swathe of it isn’t that much Microsoft specific either. And it is a developer feature hidden from end users.
e.g. - In your ollama example, does the developer ask end users to install ollama? Does the dev redistribute ollama and keep it updated? The ONNX format is pretty much a boring de-facto standard for ML model exchange. It is under the linux foundation. The ONNX Runtime is a microsoft thing, but it is an MIT licensed runtime for cross language use and cross OS/HW platform deployment of ML models in the ONNX format. That bit needs to support everything because Microsoft itself ships software on everything.(Mac/linux/iOS/Android/Windows. ORT — https://onnxruntime.ai Here is the Windows ML part of this —https://learn.microsoft.com/en-us/windows/ai/new-windows-ml/... The primary value claims for Windows ML (for a developer using it)—
This eliminates the need to:
Bundle execution providers for specific hardware vendors Create separate app builds for different execution providers Handle execution provider updates manually. Since ‘EP’ is ultra-super-techno-jargon: Here is what GPT-5 provides: Intensional (what an EP is) In ONNX Runtime, an Execution Provider (EP) is a pluggable backend that advertises which ops/kernels it can run and supplies the optimized implementations, memory allocators, and (optionally) graph rewrites for a specific target (CPU, CUDA/TensorRT, Core ML, OpenVINO, etc.). ONNX Runtime then partitions your model graph and assigns each partition to the highest-priority EP that claims it; anything unsupported falls back (by default) to the CPU EP. Extensional (how you use them)
• You pick/priority-order EPs per session; ORT maps graph pieces accordingly and falls back as needed.
• Each EP has its own options (e.g., TensorRT workspace size, OpenVINO device string, QNN context cache).
• Common EPs: CPU, CUDA, TensorRT (NVIDIA), DirectML (Windows), Core ML (Apple), NNAPI (Android), OpenVINO (Intel), ROCm (AMD), QNN (Qualcomm). |