| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sireat 1201 days ago

Thank you for this!

I have an oldish (circa 2014) dual CPU Xeon v3 (24 cores/48 threads) with 128GB RAM gathering dust.

Have been curious on how fast that old heap would run inference on 65B model.

Time to find out now.

Anyone else try LLaMA on older CPUs with plenty of RAM?

1 comments

MacsHeadroom 1201 days ago

You only need 40GB of RAM for the largest model and inference latency mostly depends on single core performance and memory bus speed because it has to crunch the whole 40GB for every token it produces.

If its slower than you want, figure out which one is your bottleneck. Because even 64GB of faster cheap RAM could be a 50% speedup if your CPU isn't the problem.

link