Hacker News new | ask | show | jobs
by gdb 2410 days ago
(I work at OpenAI. Before that, I worked at Stripe. I've spent most of my software career thinking about how to build effective engineering cultures.)

I think this code is actually well-written and maintainable. This is proven in practice because we've adopted it many places in OpenAI, and I've personally found it very easy to adapt to other use-cases (certainly much more so than the from-scratch Transformer implementations I've written!).

As https://news.ycombinator.com/item?id=21456605 points out, the complexity of the code arises from the complexity of the underlying algorithm. Complexity due to software engineering concerns, like Tensorflow scopes, are elegantly handled. [edited for clarity:] Writing a Transformer in 174 lines of code requires a lot of deep thinking about the right underlying abstractions.

> but essentially as soon as you enter the 'ml engineer/research engineer/research scientist' layer, it's doomed.

We actually don't do this! Our only official technical title is "member of technical staff". (People sometimes choose to self-identify as an engineer or researcher, so you might see that on LinkedIn, but we don't have a distinction internally.) Everyone is responsible for their own code, and people care quite a bit about writing code that others can build on.

2 comments

Ok, since you took the time to respond, I just want to be constructive as well:

So I don't have a big problem with some of the function definitions which can be compact, as the other comment points out.

The reason I don't like this code is that it does not comment anything on the critical bits. I don't necessarily care about whether you call the input to your matmul 'x' or 'tensor' or 'input' (although consistency is nice).

The thing that would stop be from absorbing and modifying this code is that it does not comment on all the bits that are non obvious to me if I haven't written a Transformer before. For example:

'Same as tf.matrix_band_part(tf.ones([nd, ns]), -1, ns-nd), but doesn't produce garbage on TPUs.' - I will have to ask the colleague what that means. Why not write out what the actual issue is instead of mysteriously hinting at some potential problem?

Code like this "q, k, v = map(split_heads, tf.split(c, 3, axis=2))" will require me re-reading the paper section, then printing out all the tensors to think about what tensor would have which shape at which point. Instead of writing relatively useless linecomments like '#Transformer', I would comment all non-trivial shape modifications with the current layout, and what we are trying to achieve while modifying the layout.

The other issue of my original comment was not specifically on that codebase, but I am sure you would admit that the baselines code was pretty much exactly what I was writing about re: ml scripts. That's not to denigrate its incredible usefulness to the community.

Since you mentioned spinning up, I thought I would add a few comments on that as well:

I think the spinning up code base is good at making the code compact, and terrible at making sense of data flow for beginners. There are a lot of line comments, but they do not actually explain what is conceptually going on but often just repeat short-hands.

For example, look at the PPO implementation: https://github.com/openai/spinningup/blob/master/spinup/algo...

Here, the function is returning pi, logp, and log_p_pi (and v). Do you know how incredibly confusing the distinction between these is for beginners? In particular, there is no explanation why logp_pi even needs to be stored in the buffer.

We could recompute it from the states and stop the gradient when computing the likelihood ratio. A sensible tutorial-level comment here may be something along the lines of computing the likelihood in the same forward pass as computing the action, so we can later use it to compute the likelihood ratio. We could also later re-compute this from the buffered states.

I will stop here but I hope my point comes across, whenever I read code from your repos, there are some good parts (conciseness, cute numerical tricks) but there is as general missing sense of thoughtfulness on what the code is really trying to convey to a reader. It shows in the comments and it shows in the code organisation.

As a final note, I have seen this in many organisations and I do not mean to call you out. There is just this quality degradation that inevitably happens when nobody is incentivised (read: promoted, rewarded) to think about these things for an organisation.

Managers at all levels typically don't because they don't get close enough to the subtle issues on a day to day level. If you are lucky, you get senior individual contributors who still look at code and and raise the bar for the entire org. My genuine recommendation to you is to look for that, because a manager won't do that, and more fresh grads can't do it.

Hello! Spinning Up author here.

Very reasonable point that it is not clearly explained why you need to store logp_pi in the buffer. But the reason is that it would require additional code complexity to calculate it on the fly later. The likelihood ratio requires the denominator to be on the _old_ policy, so if you wanted to compute it on the fly, you would need to have a second policy in the computation graph to preserve the old policy while you change the current policy. You could not simply do a stop_gradient on the current policy and get the same results.

My personal feeling is that tutorial-style explanations like this don't fit nicely into code comment flow. As a result, most tutorial-style descriptions went into material on the Spinning Up website rather than into the code. It isn't 100% comprehensive, certainly, but RL has an enormous surface area (there are tons and tons of little details that teaching material could dive into) and I feel pretty good about what we were able to cover. :)

Thank you for responding. Well, my point is that in particular the gradient on the likelihood ratio is what trips people up. They ask questions like 'why is this ratio not always 1' or similar. This is why I would say explaining what is going where here is critical, i.e. that we save the prior logp_pi (even though we could recompute it) to treat it as a constant value when computing the ratio/the gradient. That would be, from my perspective, the key pedagogical moment of a PPO tutorial. However his is purely subjective and I agree that one can feel differently about where to put explanations.
I’m very sorry to see someone who obviously cares so much to be defending this code. This does not follow best practices, and using complexity of the underlying algorithm is just an excuse. Complex code can be beautiful and well documented.

Writing a complex method in 174 lines is not elegant nor beautiful. Writing a well documented file that can take an engineer in a different specialty and bring them up to speed in 1,000 lines is.

No matter how much you comment your code, you are not going to bring people up to speed on an algorithm that requires background knowledge on dozens of scientific papers, hundreds of pages, with a few code comments. This code is aimed at researchers who are familiar with the techniques and have the necessary background knowledge. For such people, the code is very readable.

Think about it like this: If you write a game engine, are you going to document each function with proofs that explain how the underlying Physics works for people who don't have such knowledge? No, you assume that people who read game engine code have read some basic physics books.

We also have code like that. For example, that's the explicit goal of the Spinning Up repo: https://github.com/openai/spinningup/blob/master/spinup/algo...

In practice, it's much harder to use that code, and we tend not to consume code like that internally. There's a real tradeoff!

ddpg() takes 17 parameters and is over 200 lines long. I'm very far from being a domain expert, but having worked in other complex domains, I'm pretty confident this can be redesigned such that it's both more maintainable and more pleasant to use.
Hello! Spinning Up author here. I would love to hear your thoughts on this! So far I have had a lot of success teaching people about DDPG using this code example, but I'm grateful for every fresh perspective. :)

Feel free to reach out by email, jachiam at openai.

There is no function in the world that should ever take 17 parameters. If the algorithm permits such configuration, as I am sure it does, then it should take a configuration object which has all these values. The object could then be constructed using special purpose factories that take fewer parameters, and then customized from there as needed.

It may be an indication that the whole thing needs refactoring though.

You refactor that way but then you make it unnecessarily more complicated.
Well then I’m sorry for that. It’s a good indicator of a broken culture.