| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by killerstorm 636 days ago
	Speech is inherently easier to represent as a sequence of tokens than a high-resolution image. Best speech to text is already NN transformer based anyway, so in theory it's only better to use a combined model