From reading this blog, you may have inferred that I am increasingly becoming a reproducible-research freak.
Science is a collective enterprise. We may all be convinced of the naive positivist idea that there is an immaculate link between facts and theory and that is what makes science something special. But this link is hardly ever verifiable. We trust other people -referees, peers- to monitor each of the steps, and this is, in fact, one of the roles of publication: to enforce quality standards. However, the chain of manipulations that links the original file to the final output, is full of small choices, many of them far from obvious. These choices should be discussed, all of them. This is where reproducible research comes in. Only if research can be reproduced is there a real hope of mutual monitoring. Even more, research should be reproducible by yourself and when you work in a team, by your co workers. It is not just about other people been able to look at your output, if you come two weeks later and see a piece of code you (or your coworker) wrote, you may just not remember what motivated it, or where output was coming from.
So far, so good, no one seems to be against this at least in principle. However, the standard practice, at least in economics and political science, is still very far from this. There are many tools for reproducible research out there, but something I would like to emphasize is the principle that I will call: Everything should be on your script . As much as possible, there must be a series of commands that link the data collection stage, to the final output.
When I say”everything” I do not only mean the flow of commands, but also the choices made which should be documented. It also implies making your code as readable as possible. But above all, it is at odds with the popular practice of manipulating data “by hand”. People seem to think it is acceptable among honorable people to manipulate data using microsoft excel before you load it into you data analysis software or to play with it in the Stata command line and then save the file. If it is not obvious to you why this is problematic, think that this will break the link between the original data and the output: if manipulations are not done by and from the script, you can not run the script on the original data and reproduce the output.
Many people seem to think that all this is an ideal, but adopting it would be such a big pain in the ass. Certainly, documenting everything, or using a regular expression to remove something that you could do in excel with “find&replace” is perhaps time consuming. This reminds me of debates about why “R is less user friendly than Stata”. There is a grain of truth in it, but it mostly comes from establish habits and community effect. I learn to do data analysis doing everything in .Rmd files -which allow for easy documentation of everything-; I was taugth early to document my functions, name my variables descriptively and comment my code heavily. Finally, I usually take challenges of doing something with a command rather than would be more rapidly clicking here and there with my mouse as a small challenge learning opportunity. All these are just habits I’ve managed to acquire, just as people get used to eating healthy, brushing their teeth or giving up smoking, and they do not require an effort or substantial time investment: they come as second nature.
In political science, people seem to care increasingly about identification and research design. Research is not believed to be credible if causality is just infered from correlation. This is good -although I have mixed feeling about the excessive emphasis on Rubin-Imbens identification. But a more important revolution would the “treat your data transparently” revolution.