| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by reaslonik 218 days ago
	I'm running the huggingface's .safetensors with vLLM with as little starting parameters as possible. I thought it must not be sending temp right, but after setting temp to something else I got chinese so it should be sending it. Overall if you're memory constrained it's probably still worth to try and fiddle around with it if you can get it to work. Speedwise if you got the memory a 5090 can get ~50-100tok/s for a single query with 32B-AWQ and way more if you have something parallel like open-webui