| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by grandmczeb 3038 days ago
	Lots of modern models have very late binding variables which are hard to precompile for (sentence length in MNT, for example). That means you're going to need to do some form of specialization at runtime, so a JIT makes sense.

1 comments

deepnotderp 3037 days ago

Just treat it as an infinite loop , there's no need to JIT in an optimized version that late.

link

grandmczeb 3037 days ago

One of the core operations of the transformer network[1] is a (LxL) x (LxE) matrix multiply (where L is the sentence length and E is the network width). Can you be more specific about how you would get good performance without specializing on L?

[1] https://arxiv.org/abs/1706.03762

link

deepnotderp 3037 days ago

You use the loop based GEMM kernel and inject the loop counters as the input size.

link

grandmczeb 3037 days ago

L can be as small as 1 and bigger than 512. For small L it makes sense to do different optimizations than large L. A loop based GEMM doesn’t help with that.

link