Hacker News new | ask | show | jobs
by scosman 639 days ago
They use GPUs under the hood for inference/fine-tuning and charge by token. Fireworks will even let you deploy a Lora serverless at the same pricing as base model.

But not aware of any “lambda”-like serverless for any old CUDA workload. Given loading times, it wouldn’t really make sense. Something like CloudRun or KNative for GPUs would be cool.