> SANA-WM uses only ~213K public video clips with metric-scale pose supervision, completes training in 15 days on 64 H100s, and generates each 60-second clip on a single GPU; its distilled variant runs on a single RTX 5090 with NVFP4 quantization to denoise a 60s 720p clip in 34s.