Hacker News new | ask | show | jobs
by parthsareen 268 days ago
Hey! I'm the author of the post. We haven't optimized sampling yet so it's running linearly on the CPU. A lot of SOTA work either does this while the model is running the forward pass or does the masking on the GPU.

The greedy accept is so that the mask doesn't need to be computed. Planning to make this more efficient from either ends.