| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by reissbaker 214 days ago
	Depends on how well the speculator predicts your prompts, assuming you're using speculative decoding — weird prompts are slower, but e.g. TypeScript code diffs should be very fast. For SGLang, you also want to use a larger chunked prefill size and larger max batch sizes for CUDA graphs than the defaults IME.