Hacker News new | ask | show | jobs
by throwaway1249 922 days ago
This comparison is not fair, since the VRAM in the RTX4090 is not enough to hold the whole model in VRAM.

I have tested llama.cpp both on an M2 and in a 4090:

- The prompt ingestion time in M2 is pretty slow. - The extra memory of the M2 allows one to try more interesting models (Mixtral) and run multiple models at the same time.