| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tdfirth 1799 days ago

This isn't a criticism - I'm just curious to hear people's thoughts on this. When I look at this code, one of my initial reactions is that it does not seem to be very thoroughly tested. Sure, certain modules have been tested (e.g. `model.quat_affine`) but it's not clear how completely. Meanwhile, other modules, for example `model.folding`, have not been tested at all, despite containing large amounts of complex logic. That kind of code that works with arrays is very easy to get wrong and bugs are difficult to spot.

My experience working with code written by researchers is that it frequently contains a large number of bugs, which brings the whole project into question. I've also found that encouraging them to write tests greatly improves the situation. Additionally, when they get the hang of testing they often come to enjoy it, because it gives them a way to work on the code without running the entire pipeline (which is a very slow feedback loop). It also gives them confidence that a change hasn't lead to a subtle bug somewhere.

Again, I'm not criticising. I am aware that there are many ways to produce high quality software and Google/DeepMind have a good reputation for their standards around code review, testing etc. I am, however, interested to understand how the team that wrote this think about and ensure accuracy.

In general, I hope that testing and code review become a central part of the peer review process for this kind of work. Without it, I don't think we can trust results. We wouldn't accept mathematical proofs that contained errors, so why would we accept programs that are full of bugs?

edit: grammar

4 comments

benschulz 1799 days ago

My understanding is that it has been manually tested. I.e. it has produced correct results to previously intractable problems. I'm not sure how much automated testing would add at that point.

link

dmos62 1798 days ago

Unit testing usually isn't easily replaced by manual testing. If you have, for example, 3 units that can be in 2 different modes each, that's 2^3 different combinations, but only 2*3 unit modes. Testing the end result is more work than testing the units.

link

dekhn 1798 days ago

Discovery science is different from web software engineering. Most discovery scientists use manual testing, not unit testing. Very few actually do integration tests or system tests (this is something I'm trying to change).

And, given the external results of the application, it's unclear to me how much additional value would come from a rigorous testing system.

link

dmos62 1798 days ago

> Very few actually do integration tests or system tests (this is something I'm trying to change).

Care to expand on what you're trying to do?

link

dekhn 1798 days ago

Sure, I'm trying to take the idea of merging continuous integration with workflow/pipelines. It's all stuff that I learned at Google and is non-proprietary. The idea is have presubmit checks that invoke a full instance of a complex pipeline, but on canned (synthetic or pseudoanonymized or somehow not directly connected to the prod system) data, as an integration test. This catches many errors that would be hard to debug later in a prod workfflow.

In a sense, I see software testing/big web data and modern large scale data processing in science as a continuum and I want to bring the practices from the big web data and testing fields to bear on science pipelines.

link

dmos62 1798 days ago

Apart from a shift in mental attitude, is it primarily about getting a dataset for the integration test?

link

allyourhorses 1798 days ago

Prior to this model, protein folding hadn't seen significant advancements in a decade or more. Worrying about the lack of tests in a first of its kind model is very much akin to complaining about the choice of font in a user manual for the world's first warp drive. I understand you're attempting to frame the problem in terms in things you know, but trying to weigh down pioneering research with professional development ceremony is very much counterproductive. The 'missing' ceremony would not have contributed to the strength of AlphaFold's result, the model's only purpose was to compete within the context of an existing validation framework.

link

plutonorm 1799 days ago

Because it passes the huge number of integration tests.

link

miltondts 1798 days ago

Research code is highly volatile: the details and architecture changes a lot. It is much more important to invest the time into writing more experimental code and validate it with e2e functional tests that don't need to change, than to constantly having to rewrite the code and the tests.

link