|
|
|
|
|
by ZYZ64738
104 days ago
|
|
> NTransformer
High-efficiency C++/CUDA LLM inference engine. Runs Llama 70B on a single RTX 3090 (24GB VRAM) by streaming model layers through GPU memory via PCIe, with optional NVMe direct I/O that bypasses the CPU entirely. untested: https://github.com/xaskasdf/ntransformer |
|