Hacker News new | ask | show | jobs
by zackangelo 668 days ago
L40S has 48GB of RAM, curious how they're able to run Llama 3.1 70B on it. The weights alone would exceed this. Maybe they mean quantized/fp8?

I just had to implement GPU clustering in my inference stack to support Llama 3.1 70b, and even then I needed 2xA100 80GB SXMs.

I was initially running my inference servers on fly.io because they were so easy to get started with. But I eventually moved elsewhere because the prices were so high. I pointed out to someone there that e-mailed me that it was really expensive vs. others and they basically just waved me away.

For reference, you can get an A100 SXM 80GB spot instance on google cloud right now for $2.04/hr ($5.07 regular).

1 comments

Our standard A100 SXM 80GB price is $3.50/hr, for what it's worth.
For a reference, that's at least 40% more than what H100 sxm would cost if you are willing to reserve for a month (so not apples to apples).

H100 will also be much faster, especially if you are willing to use fp8. Maybe 3-4x