Hacker News new | ask | show | jobs
by sireat 1201 days ago
Thank you for this!

I have an oldish (circa 2014) dual CPU Xeon v3 (24 cores/48 threads) with 128GB RAM gathering dust.

Have been curious on how fast that old heap would run inference on 65B model.

Time to find out now.

Anyone else try LLaMA on older CPUs with plenty of RAM?

1 comments

You only need 40GB of RAM for the largest model and inference latency mostly depends on single core performance and memory bus speed because it has to crunch the whole 40GB for every token it produces.

If its slower than you want, figure out which one is your bottleneck. Because even 64GB of faster cheap RAM could be a 50% speedup if your CPU isn't the problem.