Introduction, by R. Michael Alvarez
Today’s data scientist must know how to write good code. Regardless of whether they are working with a commercial off-the-shelf statistical software package, R, Python, or Perl, all require the use of good coding practices. Large and complex datasets need lots of manipulation to wrangle them into shape for analytics, statistical estimation often is complex, and presentation of complicated results sometimes requires writing lots of code. To make sure that code is understandable to the author and others, good coding practices are essential.
Many who teach methodology, statistics, and data science, are increasingly teaching their students how to write good computer code. As a practical matter, if a professor requires that students turn in their code for a problem set, that code needs to be well-crafted to be legible to the instructor. But as increasing numbers of our students are writing and distributing their code and software tools to the public, professionally we need to do more to train students how to write good code. Finally, good code is critical for research replication and transparency — if you can’t understand someone’s code, it might be difficult or impossible to be able to reproduce their analysis.
When I first started teaching methods to graduate students, there was little in the methodological literature that I found useful for teaching graduate students good coding practices. But in 1995, my colleague Jonathan Nagler wrote out some great guidance on good methodological practices, in particular guidelines for good coding style. His piece is available online (“Coding Style and Good Computing Practices”), and his advice from 1995 is as relevant today as it was then. I use Jonathan’s guidelines in my graduate teaching.
Over the past few years, as Political Analysis has focused resources on research replication and transparency, it’s become clear that we need to develop better guidance for researchers and authors regarding how to write good code. One of the biggest issues that we run into when we review replication materials that are submitted to the journal is poor documentation and unclear code; and if we can’t figure out how the code works, I’m sure that our readers will have the same problem.
We’ve been thinking of developing some guidelines for documentation of replication materials, and standards for coding practices. As part of that research, I asked Jonathan if he would write an update of his 1995 essay, and for him to reflect some on how things might have evolved in terms of good computing practices since 1995. His thoughts are below, and I encourage readers to also read Jonathan’s original 1995 essay.
* * * * *
Coding style and good computing practices: it is easy to get the style right, harder to get good practice, by Jonathan Nagler, NYU
Many years ago I was prompted to write Coding Style and Good Computing Practices, an article laying out guidelines for coding style for political scientists. The article was reprinted in a symposium on replication in PS (September 1995, Vol. 28, No. 3, 488-492). According to Google Scholar, it has rarely been cited, but I’m convinced it has been read quite often because I’ve seem some idiosyncratic suggestions made in it in the code of other political scientists. Though re-reading the article I am reminded how many people have not read it, or just ignored it.
Here is a list of basic points reproduced from that article:
- Labbooks: essential.
- Command files: they should be kept.
- Data-manipulation vs. data-analysis: these should be in distinct files.
- Keep tasks compartmentalized (‘modularity’).
- Know what the code is supposed to do before you start.
- Don’t be too clever.
- Variable names should mean something.
- Use parentheses and white-space to make code readable.
- Documentation: all code should include comments meaningful to others.
And I concluded with a list of rules:
- Maintain a labbook from the beginning of a project to the end.
- Code each variable so that it corresponds as closely as possible to a verbal description of the substantive hypothesis the variable will be used to test.
- Errors in code should be corrected where they occur and the code re-run.
- Separate tasks related to data-manipulation vs data-analysis into separate files.
- Each program should perform only one task.
- Do not try to be as clever as possible when coding. Try to write code that is as simple as possible.
- Each section of a program should perform only one task.
- Use a consistent style regarding lower and upper case letters.
- Use variable names that have substantive meaning.
- Use variable names that indicate direction where possible.
- Use appropriate white-space in your programs, and do so in a consistent fashion to make them easy to read.
- Include comments before each block of code describing the purpose of the code.
- Include comments for any line of code if the meaning of the line will not be unambiguous to someone other than yourself.
- Rewrite any code that is not clear.
- Verify that missing data is handled correctly on any recode or creation of a new variable.
- After creating each new variable or recoding any variable, produce frequencies or descriptive statistics of the new variable and examine them to be sure that you achieved what you intended.
- When possible, automate things and avoid placing hard-wired values (those computed ‘by-hand’) in code.
Those are still very good rules, I would not change any of them. I would add one, and that is to put comments in any paper citing the piece of code that produced the figures or tables in the paper. In 20 years a lot of things have changed about how we do computing. It has gotten much easier to follow good computing practices. Github has made it easy to share code, maintain revision history, and publish code. And the set of people who seamlessly collaborate by sharing files over Dropbox or one of its competitors probably dwarfs the number of political scientists using Github. But to paraphrase a common computing aphorism (GIGO), sharing or publishing badly written code won’t make it easy for people to replicate or build on your work.
I was motivated to write that article because as I stated then, most political scientists aren’t trained as computer programmers. Nor were most political scientists trained to work in a laboratory. So the article covered both style of code, and computing practice to make sure that an entire research project could be reproduced by someone else. That means keeping track of where you got your data, how it was processed, etc.
Any computer code is a set of instructions that produces results when read by a machine, and we can evaluate the code based on the results it produces. But when we share code we expect it to be read by humans. Two pieces of code be functionally equivalent — they could produce identical results when read by a machine — even though one is easy to read and understand by a human; while the other is pretty much unintelligible to a human. If you expect people to use your code, you need to make the code easy to read. I try to ask every graduate student I am going to work with to read several chapters from Brian W. Kernighan and Rob Pike’s, The Practice of Programming (1999), especially the Preface, Chapters 1, 3, 5, 6, and the Epilogue.
It has turned out to be easier to write clean code than to maintain good computing practices overall that would lead to easy reproducibility of an entire research project. It is fairly easy to post a ‘replication’ dataset, and the code used to produce the figures and tables in a paper. But that doesn’t really tell someone everything they need to know to try to reproduce your work, or extend it to other data. They need to know how your data was generated. And those steps occur in the production of the replication dataset, not in the use of it.
Most research projects in political science pull in data from many sources. And many, many coding decisions are made along the way to a finished product. All of those decisions may be visible in the code; but keeping coherent lab-books is essential for sifting through all the lines of code of any large project. And ‘projects’ rarely stand-alone anymore. Work on one dataset is linked to many projects, often with over-lapping sets of co-authors.
At the beginning of a research project it’s important for everyone to agree where the code is, where the data is, and what the overall structure of the documentation is. That means decisions about whether documentation is grouped by project (which could mean by individual paper), or by dataset. And it means reaching some agreement on whether there is a master document that points to many smaller documents describing individual tasks, or whether the whole project description sits in a single document. None of this is exciting to work out, certainly not as exciting as doing the research. But it is essential. A good goal of doing all this is to make it as easy as possible to make the whole bundle of documentation and code public as soon as it is time to do so. It both saves time when it is time to release documentation, and imposes some good habits and structure along the way.
Heading image: Typing computer screen reflection by Almonroth. CC BY-SA 3.0 via Wikimedia Commons.