|
While I agree with the bulk of what you said, I do think it's important to understand that the difference with the Open Science Collaboration isn't really what you said. In my opinion it may be worse. Being a (somewhat disillusioned) economist, I've read some of these papers but certainly not all of them. What I can tell you is that there isn't an "experiment" to replicate. These papers seem to be mostly or entirely computational macroeconomics and econometrics. In stuff like this, they design a model (a simple example is the real business cycle model), pump random data into it and see how different changes effect the model (i.e. the relationship between the volatility of unemployment with the volatility of output) and do those relationships match what we see in the data? The replication as you define it (and I agree with the definition), should be pumping new random data into the model and still yielding the same results. However, it still leaves a few big issues. Such as, are these relationships really in the data? Because some of those relationships change depending on the time frame. So it the results may actually explain what occurred, but they shouldn't be used to explain what will occur later. For reproduction and re-analysis, for this research, they probably need to go together. If we've defined a mathematical model, then it should be possible to program that model across platforms and software and still yield consistent results with different sources of random input data. And for verification, I think this is really important. Because I know I've messed up my programs before and gotten completely reasonable output that turned out to be incorrect. The authors didn't exactly describe how much they verified the programs were doing what they were supposed to do. Honestly, I don't know that I can make myself care too much about the output results from the model until we can agree on what things are important for the model to show in the past, present, and future. And in academic economics, these important characteristics are almost canon and untouchable. |
This hits too close to home for me not to comment on. I do basically exactly this - redevelop models into production-quality code for broader deployment. I do this for 'closed' models as well (code that researchers do not have available for download from a website, for whatever reasons - mostly because they don't care, which is fine). Models being 'closed' this way does not make them 'black boxes' or 'not reproducible' - whatever the code does, needs to be described in the paper(s) anyway (the concepts, not the implementation details).
The way to do a baseline verification of the implementation of models is by having minimal synthetic data sets and doing unit tests on them. Usually people develop their model based on their full 20000-observation or whatever data set, with numbers with 15 digits etc. - the only way to spot mistakes in such an implementation is if they are several orders of magnitude off.
I once found a calculation in some Fortran code that mistook kilometers for meters (or the other way around, can't remember; either way, the result was that one component of the model was off by a factor 1000). This hadn't been discovered in 10+ years, by many users, some of whom (much to my horror) actually used this model to advise on subsidies for certain sectors. Now, it's not that the results where completely unreasonable, because someone would have noticed; it's the small mistakes that are the worst, especially when they are non-linear. Despite that and many examples like it that I have encountered, it proves to be nigh impossible to change software development hygiene of most researchers.