Oxford University Press's
Academic Insights for the Thinking World

Text analysis for comparative politics

Introduction

By R. Michael Alvarez


Text has long been an important, but difficult to use, source of data for social scientists. Back when I wrote my Ph.D. thesis, for example, I sat for weeks with abstracts from the New York Times — finding newspaper articles relating to past presidential campaigns, and content-analyzing those articles to determine whether they had substantive or “horse-race” information in them. Back in the day, most scholars analyzed text like this manually, and despite the fact that vast amounts of text were available for study, very little of that information became data that was amenable to sophisticated quantitative analysis.

How the world has changed! Because of vast improvements in computational capabilities (both in terms of data accessibility, storage, and analytic power), tools and methods for the automated analysis of text have proliferated. Some of the most innovative new tools and methods are being developed by social scientists, and in recent years we have seen many important papers on the analysis of text published in Political Analysis.

One of the important developments in this area is the Structural Topic Model. This methodology for analyzing text has recently seen rapid development, and I asked the authors of “Computer-Assisted Text Analysis for Comparative Politics” (Christopher Lucas, Richard A. Nielsen, Margaret E. Roberts, Brandon M. Stewart, Alex Storer, and Dustin Tingley) to discuss in a bit more detail their paper and its contribution to the field. Their essay is below.

* * * * *

Text Analysis for Comparative Politics

By Christopher Lucas, Richard A. Nielsen, Margaret E. Roberts, Brandon M. Stewart, Alex Storer, and Dustin Tingley


Every two days, humans produce more textual information than the combined output of humanity from the dawn of recorded history up through the year 2003. Much of this text is directly relevant to questions in political science. Governments, politicians, and average citizens regularly communicate their thoughts and opinions in writing, providing new data from which to understand the political world and suggesting new avenues of study in areas that were previously thought intractable. However, in order to access the value in this textual data we need methods to conduct a principled, systematic analysis.

Preparing textual data for analysis presents unique challenges, particularly for comparativists working with non-English text. Though statistical methods for text analysis are often language agnostic, tools for pre-processing the texts are not. We provide software packages to help users preprocess text in multiple languages and translate text, accompanied by an overview of the various steps necessary to prepare textual data for analysis. Because the Structural Topic Model (STM) allows users to incorporate document metadata into the analysis, investigators can treat the language in which the document was written as a variable and can model systematic differences in topical content across languages.

As a proof of concept, we examined thousands of social media posts in Arabic and Chinese in June 2013 about Edward Snowden. Our analysis reveals that Chinese posts about Snowden during this time period are more likely to address issues of hypocrisy in US foreign policy, suggesting that the United States violates the human rights of its citizens while simultaneously advocating for better human rights protection abroad. By contrast, Arabic posts were more likely to deal with the issue of asylum, addressing the question: where will Snowden go next?

Work on Structural Topic Models continues to evolve. A forthcoming book chapter “Navigating the Local Modes of Big Data: The Case of Topic Models” addresses model stability in the STM and provides support for the use of a deterministic initialization strategy based on spectral methods. A recent working paper “Matching Methods for High-Dimensional Data with Applications to Text” demonstrates how the STM machinery can be used to facilitate causal inference from observational data where the pre-treatment confounders are documents. Two new software packages on CRAN have been released; stmBrowser and stmCorrViz provide interactive visualizations of STM models. The core software package stm has also been updated to increase speed and introduce numerous new features described in the papers above. The papers, software, and vignette detailing how to get started are available at structuraltopicmodel.com.

These methods open the world of text analysis to scholars of international relations and comparative politics. The possibilities are many, and we demonstrate but a few.

Featured image: Folded newspapers. (c) Mitrija via iStock.

Recent Comments

There are currently no comments.