| -Infra: Kubernetes + NVIDIA device plugin; 20× Jetson Nano Dev Kits -Inference: Ollama / llama.cpp (quantized models ≤7B); token streaming (SSE/WebSocket); autoscaler driven by request queue depth + Jetson tegrastats GPU load -Cold starts: warm-pool containers with pre-loaded quantized models (to avoid long load times on 4 GB memory) -Networking: Envoy ingress; TLS. -Observability: Prometheus/Grafana; structured logs; tracing. -Billing: pay-as-you-go; mobile money (M-Pesa/Airtel); per-token + per-GPU-hour meters. -Data: in-region storage; encrypted; no training on customer data. -Limitations: fewer regions vs AWS; GPU supply fluctuates; docs still early. -Roadmap: more African regions, spot-GPU pricing, fine-tune jobs. -Ask: brutal feedback on scaling strategy + CLI UX. |