| (I work at OpenAI. Before that, I worked at Stripe. I've spent most of my software career thinking about how to build effective engineering cultures.) I think this code is actually well-written and maintainable. This is proven in practice because we've adopted it many places in OpenAI, and I've personally found it very easy to adapt to other use-cases (certainly much more so than the from-scratch Transformer implementations I've written!). As https://news.ycombinator.com/item?id=21456605 points out, the complexity of the code arises from the complexity of the underlying algorithm. Complexity due to software engineering concerns, like Tensorflow scopes, are elegantly handled. [edited for clarity:] Writing a Transformer in 174 lines of code requires a lot of deep thinking about the right underlying abstractions. > but essentially as soon as you enter the 'ml engineer/research engineer/research scientist' layer, it's doomed. We actually don't do this! Our only official technical title is "member of technical staff". (People sometimes choose to self-identify as an engineer or researcher, so you might see that on LinkedIn, but we don't have a distinction internally.) Everyone is responsible for their own code, and people care quite a bit about writing code that others can build on. |
So I don't have a big problem with some of the function definitions which can be compact, as the other comment points out.
The reason I don't like this code is that it does not comment anything on the critical bits. I don't necessarily care about whether you call the input to your matmul 'x' or 'tensor' or 'input' (although consistency is nice).
The thing that would stop be from absorbing and modifying this code is that it does not comment on all the bits that are non obvious to me if I haven't written a Transformer before. For example:
'Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.' - I will have to ask the colleague what that means. Why not write out what the actual issue is instead of mysteriously hinting at some potential problem?
Code like this "q, k, v = map(split_heads, tf.split(c, 3, axis=2))" will require me re-reading the paper section, then printing out all the tensors to think about what tensor would have which shape at which point. Instead of writing relatively useless linecomments like '#Transformer', I would comment all non-trivial shape modifications with the current layout, and what we are trying to achieve while modifying the layout.
The other issue of my original comment was not specifically on that codebase, but I am sure you would admit that the baselines code was pretty much exactly what I was writing about re: ml scripts. That's not to denigrate its incredible usefulness to the community.
Since you mentioned spinning up, I thought I would add a few comments on that as well:
I think the spinning up code base is good at making the code compact, and terrible at making sense of data flow for beginners. There are a lot of line comments, but they do not actually explain what is conceptually going on but often just repeat short-hands.
For example, look at the PPO implementation: https://github.com/openai/spinningup/blob/master/spinup/algo...
Here, the function is returning pi, logp, and log_p_pi (and v). Do you know how incredibly confusing the distinction between these is for beginners? In particular, there is no explanation why logp_pi even needs to be stored in the buffer.
We could recompute it from the states and stop the gradient when computing the likelihood ratio. A sensible tutorial-level comment here may be something along the lines of computing the likelihood in the same forward pass as computing the action, so we can later use it to compute the likelihood ratio. We could also later re-compute this from the buffered states.
I will stop here but I hope my point comes across, whenever I read code from your repos, there are some good parts (conciseness, cute numerical tricks) but there is as general missing sense of thoughtfulness on what the code is really trying to convey to a reader. It shows in the comments and it shows in the code organisation.
As a final note, I have seen this in many organisations and I do not mean to call you out. There is just this quality degradation that inevitably happens when nobody is incentivised (read: promoted, rewarded) to think about these things for an organisation.
Managers at all levels typically don't because they don't get close enough to the subtle issues on a day to day level. If you are lucky, you get senior individual contributors who still look at code and and raise the bar for the entire org. My genuine recommendation to you is to look for that, because a manager won't do that, and more fresh grads can't do it.