Hacker News new | ask | show | jobs
by fatihturker 103 days ago
Author here.

The architecture page explains how ternary quantization, dynamic sparsity, and mmap layer streaming work together to push models far beyond normal RAM limits.

Happy to answer questions about the implementation or benchmarks.