Friday, August 7, 2015

Data Cleaning

I previously wrote about the importance of hacking skills for economics researchers and said that it may be one of the most important econometrics lessons you never learned.

In one of his regular 'metrics Monday posts, Marc Bellemare recently wrote about data cleaning. He made a lot of great points and this really stands out to me:

"So what I suggest–and what I try to do myself–is to write a .do file that begins by loading raw data files (i.e., Excel or ASCII files) in memory, merges and appends them with one another, and which documents every data-cleaning decision via embedded comments (in Stata, those comments are lines that begin with an asterisk) so as to allow others to see what assumptions have been made and when. This is like writing a chemistry lab report which another chemist could use to replicate your work.

Lastly, another thing I did when I first cleaned data was to “replicate” my own data cleaning: When I had received all the files for my dissertation data in 2004, the data were spread across a dozen spreadsheets. I first merged and cleaned them and did about a month’s worth of empirical work with the data. I then decided to re-merge and re-clean everything from scratch just to make sure I had done everything right the first time."


The ability to document everything you do with the data, and ensure repeatability and accuracy is priceless! I somehow get the feeling that this is not something that is practiced religiously in a lot of spaces, and makes me cringe every time a hear about results from a ‘study.’ But in the corporate space where I work, this makes collaboration with other groups and teams go a lot smoother if everyone can get back to the same starting point when it comes to processing the data. This is especially helpful when pulling it together from a myriad of servers and non-traditional data sources and formats that characterize today's big data challenges. And in the academic space, just think about the Quarterly Journal of Political Science's requirements related to providing a replication package with article submissions.  Sometimes this sort of 'janitor' work can constitute up to 80% of a data scientist's workload, but its well worth the effort if done appropriately. Good documentation throughout the process adds to the workload, but pays dividends.

See also:
In God we trust, all others show me your code.
Data Science, 10% inspiration, 90% perspiration

No comments:

Post a Comment