Hacker News new | ask | show | jobs
by grandmczeb 3038 days ago
Lots of modern models have very late binding variables which are hard to precompile for (sentence length in MNT, for example). That means you're going to need to do some form of specialization at runtime, so a JIT makes sense.
1 comments

Just treat it as an infinite loop , there's no need to JIT in an optimized version that late.
One of the core operations of the transformer network[1] is a (LxL) x (LxE) matrix multiply (where L is the sentence length and E is the network width). Can you be more specific about how you would get good performance without specializing on L?

[1] https://arxiv.org/abs/1706.03762

You use the loop based GEMM kernel and inject the loop counters as the input size.
L can be as small as 1 and bigger than 512. For small L it makes sense to do different optimizations than large L. A loop based GEMM doesn’t help with that.