The joy of data cleaning

I was having a conversation with my girlfriend who was struggling trying to illustrate graphically a relationship between two variables in excel. It’s easy to lose perspective of how isolated from real world one is. I told her that there are things that are basically infeasible, or extremely cumbersome, in excel if your are trying to do any serious data work. At the time, I did not know how to explain something that was so evident to me. To start with, there are many data that do not come excel. She answered to me saying that she did not think she would ever do anything esoteric enough so that the data would not be available in an excel format.

During the last several months, I’ve come to enjoy everything that has to do with data manipulation. Just as my girlfriend, only three years ago,  I would never have been sensitive to the importance of addressing this task. You can just download the data from the internet, load it into R and then apply the relevant analysis. The intermediate steps sounded boring and trivial. Well, they are not

I was discussing this with a friend working in the data industry who told me that he had the same impression: data wrangling is fun because, unlike formal modeling, it poses a set of small challenges with instantaneous reward. On top of that, it’s something I feel one can do with a moderate level of concentration -listening to metal, to the radio, re-watching a movie.

It isn’t neither a trivial activity. It is something creative, perhaps more on the coding than theoretical side, but still. It  involve choices and tradeoffs that can, potentially, affect the output of your analysis: how you treat missing data, how to group variables, which information you choose to keep, and which to discard. There are all sorts of tiny details that may in the end make a difference.

An important implication of the above has to do with reproducible research. If all data were just like national accounts, perhaps a simple line of code with a link to the web of the relevant government agency would make  the job of importing and loading the data and this could be a minor nuisance. Most data are not like that, particularly the one we use in social sciences. The fact that preprocessing data involves choices implies that these should be part of the discussion as much as any other part of the research. Pre-treatment mistakes can have large consequences- a small line of code in which you forgot to specify the right kind of regression- and since so many tiny things can go wrong, this process should be peer-monitorized.

In practice, this is hardly the case. Even journals in poli-sci who ask for replication files do not have proper standards on documenting these. The pre-analysis part is often omitted (files of data are just released), undocumented or unjustified.  Debugging someone else’s code, especially if it is undocumented or badly written, is not the funniest activity for a saturday night. The upshot is that the lack of transparency is potentially enormous.

The joy of data cleaning

One thought on “The joy of data cleaning

Leave a reply to Unplug your mouse, everything should be in your script – papers and hot beverages Cancel reply