Hacker News new | ask | show | jobs
by sheepscreek 181 days ago
What’s with GPU engineers using such unreadable variable names (to anyone outside the immediate domain)?

It’s the equivalent of doing this for compound interest rate calculation:

# A = P * (1 + r/n)^(nt) P = 10000 r = 0.06 n = 12 t = 5 A = P (1 + r / n) * (n * t)

Compared to this:

principal = 10_000 annual_interest_rate = 0.06 compounds_per_year = 12 years = 5

future_value = principal * (1 + annual_interest_rate / compounds_per_year) * (compounds_per_year * years)

My question is partly rhetorical - I know the answer lies with the tight research and mathematical origins. But that makes it research code IMO, not what I would consider high quality software code.

4 comments

I think it's a combination of multiple factors. I worked with GPU kernel codes before and the code that you write has a tendency of never being updated or modified. once it works it works perfectly and you do not change it. if you get new hardware you're going to fully rewrite it. so, typically readability is just not useful. also, you're never working with variables that make sense to humans. it's never something tangible. it's always tiles, offsets, indices. i do not think, at least when I was writing the code for GPUS to waste space visual space on better variable naming was worthwhile.
PhD dropout here: When you’re implementing a math algorithm you can’t really self document. So you have the pdf of the paper and a clear formula, then best to link to that and just implement the formula exactly with same variables.
I'm a former Ruby guy who ended up in stats/ML for a time. I think it's all about information density.

Let's use your example of `A = P (1 + r / n) * (n * t)` -- I can immediately see the shape of the function and how all the variables interrelated. If I'm comfortable in the domain, I also know what the variables mean. Finally, this maps perfectly to how the math is written.

If you look at everything in the post, all of the above apply. Every one in the domain has seen Q = query, K = key, V = value a billion times, and some variation of (B, N_h, T, D_h). Frankly, I've had enough exposure that after I see (B, N_h, T, D_h) once, I can parse (32, 8, 16, 16) without thinking.

I like you found this insane when I started studying stats, but overtime I realized there a lot to be gained once you've trained yourself to speak the language.

This brought up memory of Hungarian notation. I think now I will try to use it in my PyTorch code to solve the common problem I have with NN code: keeping track of tensor shapes and their meanings.

  B, T, E = x.size() # batch size, sequence length, embedding dimensionality
  
  q, k, v = self.qkv(x).split(self.embedding, dim=-1)
  q, k, v = map(lambda y: y.view(B, T, self.heads, E // self.heads).transpose(1, 2))

  attention = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
  ...
vs

  B, T, E = bteX.size()

  iHeadSize = E // self.heads
  bteQ, bteK, bteV = self.qkv_E_3E(bteX).split(E, dim=-1)
  bhtiQ, bhtiK, bhtiV = map(lambda y: y.view(B, T, self.heads, iHeadSize).transpose(1, 2))

  bhttAttention = (bhtiQ @ bthiK.transpose(-2, -1)) * (1.0 / iHeadSize)
Looks uglier but might be easier to reason about.
Bad programmers. Researchers usually (though sometimes not) are bad at programming. Hence why I don’t do projects for academia.