| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by odo1242 598 days ago
	What made it so much faster based on just a software update?

3 comments

anon291 598 days ago

Ex-cereberas engineer here. The chip is very powerful and there is no 'one way' to do things. Rearchitecting data flow, changing up data layout, etc can lead to significant performance improvements. That's just my informed speculation. There's likely more perf somewhere

link

campers 598 days ago

  The first implementation of inference on the Wafer Scale Engine and utilized only a fraction of its peak bandwidth, compute, and IO capacity. Today’s release is the culmination of numerous software, hardware, and ML improvements we made to our stack to greatly improve the utilization and real-world performance of Cerebras Inference.
 
  We’ve re-written or optimized the most critical kernels such as MatMul, reduce/broadcast, element wise ops, and activations. Wafer IO has been streamlined to run asynchronously from compute. This release also implements speculative decoding, a widely used technique that uses a small model and large model in tandem to generate answers faster.

link

germanjoey 598 days ago

They said in the announcement that they've implemented speculative decoding, so that might have a lot to do with it.

A big question is what they're using as their draft model; there's ways to do it losslessly, but they could also choose to trade off accuracy for a bigger increase in speed.

It seems they also support only a very short sequence length. (1k tokens)

link

bubblethink 598 days ago

Speculative decoding does not trade off accuracy. You reject the speculated tokens if the original model does not accept them, kind of like branch prediction. All these providers and third parties benchmark each other's solutions, so if there is a drop in accuracy, someone will report it. Their sequence length is 8k.

link