| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by singhrac 383 days ago
	Production inference is not deterministic because of sharding (i.e. parameter weights on several GPUs on the same machine or MoE), timing-based kernel choices (e.g. torch.backends.cudnn.benchmark), or batched routing in MoEs. Probably best to host a small model yourself.