|
|
|
|
|
by sanchitmonga22
94 days ago
|
|
Fair criticism. Our benchmarks are on small models because MetalRT
was built for the voice pipeline use case, where decode latency
on 0.6B-4B models is the bottleneck. You're right that the bigger opportunity on Apple Silicon is large
models that don't fit on consumer GPUs. Expanding MetalRT to 7B,
14B, 32B+ is on the roadmap. The architectural advantages(that MetalRT has) should matter
even more at that scale where everything becomes memory-bandwidth-bound. We'll publish benchmarks on larger models as we add support. If you
have a specific model/size you'd want to see first, that helps us
prioritize. |
|