Hacker News new | ask | show | jobs
by mobicham 804 days ago
Hey Daniel! The VRAM is still the same as a pure n-bit model would take. Because we only need meta-data for a single nn.Linear at a time, you only need an additional (3GB-1.7GB)/224 = 5.8MB. If we compress the meta-data as well that would become much lower.
1 comments

Hey :)) Oh I love the idea of async movement from CPU to GPU ahead of time - ingenious! Prefetching a small amount of metadata seems reasonable and very smart!