Replication redux and Facebook data

Political Analysis

Political Analysis chronicles the exciting developments in the field of political methodology, with contributions to empirical and methodological scholarship outside the diffuse borders of political science. It is published on behalf of The Society for Political Methodology and the Political Methodology Section of the American Political Science Association. Political Analysis is ranked #5 out of 157 journals in Political Science by 5-year impact factor, according to the 2012 ISI Journal Citation Reports. Like Political Analysis on Facebook and follow @PolAnalysis on Twitter.

By Nathaniel Beck
January 19^th 2015

Introduction, from Michael Alvarez, co-editor of Political Analysis

Recently I asked Nathaniel Beck to write about his experiences with research replication. His essay, published on 24 August 2014 on the OUPblog, concluded with a brief discussion of a recent experience of his when he tried to obtain replication data from the authors of a recent study published in PNAS, on an experiment run on Facebook regarding social contagion. Since then the story of Neal’s efforts to obtain this replication material have taken a few interesting twists and turns, so I asked Neal to provide an update — because the lessons from his efforts to get the replication data from this PNAS study are useful for the continued discussion of research transparency in the social sciences.

Replication redux, by Nathaniel Beck

When I last wrote about replication for the OUPblog in August (“Research Replication in Social Science”), there was one smallish open question (about my own work) and one biggish question (on whether I would ever see the Kramer et al., “Experimental evidence of massive-scale emotional contagion through social networks”, replication file, which was “in the mail”). The Facebook story is interesting, so I start with that.

After not hearing from Adam Kramer of Facebook, even after contacting PNAS, I persisted with both the editor of PNAS (Inder Verma, who was most kind) and with the NAS through “well connected” friends. (Getting replication data should not depend on knowing NAS members!). I was finally contacted by Adam Kramer, who offered that I could come out to Palo Alto to look at the replication data. Since Facebook did not offer to fly me out, I said no. I was then offered a chance to look at the replication files in the Facebook office 4 blocks from NYU, so I accepted. Let me stress that all dealings with Adam Kramer were highly cordial, and I assume that delays were due to Facebook higher ups who were dealing with the human subjects firestorm related to the Kramer piece.

When I got to the Facebook office I was asked to sign a standard non-disclosure agreement, which I did. To my surprise this was not a problem, with the only consequence being that a security officer would have had to escort me to the bathroom. I then was put in a room with a Facebook secure notebook with the data and R-studio loaded; Adam Kramer was there to answer questions, and I was also joined by a security person and an external relations person. All were quite pleasant, and the security person and I could even discuss the disastrous season being suffered by Liverpool.

I was given a replication file which was a data frame which had approximately 700,000 rows (one for each respondent) and 7 columns containing the number of positive and negative words used by each respondent as well as the total word count of each respondent, percentages based on these numbers, experimental condition. and a variable which omitted some respondents for producing the tables. This is exactly the data frame that would have been put in an archive since it contained all the data needed to replicate the article. I also was given the R-code that produced every item in the article. I was allowed to do anything I wanted with that data, and I could copy the results into a file. That file was then checked by Facebook people and about two weeks later I received the entire file I created. All good, or at least as good as it is going to get.

Intel team inside Facebook data center. Intel Free Press. CC BY 2.0 via Wikimedia Commons.

The data frame I played with was based on aggregating user posts so each user had one row of data, regardless of the number of posts (and the data frame did not contain anything more than the total number of words posted). I can understand why Facebook did not want to give me the data frame, innocuous as it seemed; those who specialize in de-de-identifying private data and reverse engineering code are quite good these days, and I can surely understand Facebook’s reluctance to have this raw data out there. And I understand why they could not give me all the actual raw data, which included how feeds were changed and so forth; this is the secret sauce that they would not like reverse engineered.

I got what I wanted. I could see their code, could play with density plots to get a sense of words used, I could change the number of extreme points dropped, and I could have moved to a negative binomial instead of a Poisson. Satisfied, I left after about an hour; there are only so many things one can do with one experiment on two outcomes. I felt bad that Adam Kramer had to fly to New York, but I guess this is not so horrible. Had the data been more complicated I might have felt that I could not do everything I wanted, and running a replication with 3 other people in a room is not ideal (especially given my typing!).

My belief is that that PNAS and the authors could simply have had a different replication footnote. This would have said that the code used (about 5 lines of R, basically a call to a Poisson regression using GLM) is available at a dataverse. In addition, they could have noted that the GLM called used the data frame I described, with the summary statistics for that data frame. Readers could then see what was done, and I can see no reason for such a procedure to bother Facebook (though I do not speak for them). I also note a clear statement on a dataverse would have obviated the need for some discussion. Since bytes are cheap, the dataverse could also contain whatever policy statement Facebook has on replication data. This (IMHO) is much better than the “contact the authors for replication data” footnote that was published. It is obviously up to individual editors as to whether this is enough to satisfy replication standards, but at least it is better than the status quo.

What if I didn’t work four blocks from Astor Place? Fortunately I did not have to confront this horror. How many other offices does Facebook have? Would Adam Kramer have flown to Peoria? I batted this around, but I did most of the batting and the Facebook people mostly did no comment. So someone else will have to test this issue. But for me, the procedure worked. Obviously I am analyzing lots more proprietary data, and (IMHO) this is a good thing. So Facebook, et al., and journal editors and societies have many details to work out. But, based on this one experience, this can be done. So I close this with thanks to Adam Kramer (but do remind him that I have had auto-responders to email for quite while now).

On the more trivial issue of my own dataverse, I am happy to report that almost everything that was once on an a private ftp site is now on my Harvard dataverse. Some of this was already up because of various co-authors who always cared about replication. And on stuff that was not up, I was lucky to have a co-author like Jonathan Katz, who has many skills I do not possess (and is a bug on RCS and the like, which beats my “I have a few TB and the stuff is probably hidden there somewhere”). So everything is now on the dataverse, except for one data set that we were given for our 1995 APSR piece (and which Katz never had). Interestingly, I checked the original authors’ web sites (one no longer exists, one did not go back nearly that far) and failed to make contact with either author. Twenty years is a long time! So everyone should do both themselves and all of us a favor, and build the appropriate dataverse files contemporaneously with the work. Editors will demand this, but even with this coercion, this is just good practice. I was shocked (shocked) at how bad my own practice was.

Heading image: Wikimedia Foundation Servers-8055 24 by Victorgrigas. CC BY-SA 3.0 via Wikimedia Commons.

Nathaniel Beck is a Professor of Politics at New York University. His interests remain in political methodology, both narrowly and broadly defined, and in statistical issues as they deal with political science problems.

R. Michael Alvarez is a professor of Political Science at Caltech. His research and teaching focuses on elections, voting behavior, and election technologies. He is the co-editor of Political Analysis.

Political Analysis

Replication redux, by Nathaniel Beck

Related posts:

Recent Comments