| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pclmulqdq 906 days ago
	They are putting the whole LLM into SRAM across multiple computing chips, IIRC. That is a very expensive way to go about serving a model, but should give pretty great speed at low batch size.