|
From
•>>October 2002 Gary
King answers
a few questions about this month's Emerging Research Front
in
field of Psychiatry/Psychology: Psychiatry/Psychology, general
Article: "Analyzing incomplete political science data: An alternative algorithm for multiple imputation"
Authors: King,
G;Honaker, J;Joseph, A;Scheve, K
Journal: AMER POLIT SCI REV, 95: (1) 49-69 MAR 2001
Addresses:
Harvard Univ, Ctr Basic Res Social Sci, World Hlth Org, Global Programme Evidence Hlth Policy, Cambridge, MA 02138 USA.
Harvard Univ, Ctr Basic Res Social Sci, World Hlth Org, Global Programme Evidence Hlth Policy, Cambridge, MA 02138 USA.
Harvard Univ, Ctr Basic Res Social Sci, Dept Govt, Cambridge, MA 02138 USA.
Yale Univ, Inst Social & Policy Studies, Dept Polit Sci, New Haven, CT 06520 USA.
|
|

Why
do you think your paper is highly cited?
Our paper provides a way around a discrepancy between how almost
all social scientists analyze data with missing values (such as
public opinion surveys where respondents refuse to answer some
questions) and the recommendations of the statistics community. With
few exceptions, methodologists and statisticians agree that a
technique called "multiple imputation" is superior to the
way social scientists commonly treat missing data. The technique has
been known for two decades, but it had rarely been used in real
research settings (i.e., by few other than statisticians and their
students and consulting clients). The discrepancy occurred because
the only algorithms available to implement the technique were slow,
extremely difficult to implement, impossible to run in existing
statistical packages, and usable only by researchers with expertise
in arcane techniques they would otherwise have little need for and
did not know. We adapted an algorithm in a new way to implement a
general-purpose, multiple imputation model for missing data (known
as EMis) that is considerably easier to use and much faster. We also
showed that the risks of existing missing data practices were
substantial (i.e., on par with the much better known bias that can
occur when omitting appropriate controls). Our article also gave
examples where our approach led to more informative and less biased
substantive conclusions. As a companion to the paper, we also
offered easy-to-use, open source software that implements our
methods (see "Amelia: A Program for Missing Data,"
available at http://GKing.Harvard.edu).
Does
it describe a new discovery or new methodology that's useful to others?
The idea seems to have proven useful to others, and indeed many
thousands of copies of our software have been downloaded. Our survey
of the literature indicated that about half of the respondents who
participate in sample surveys refuse to give answers to one or more
questions researchers need in the average article. Almost all
analysts contaminate their data at least partially by filling in
educated guesses for some of these nonresponses (such as by coding
"don't know'' on issue positions as the middle category of
Likert scales), and approximately 94% of researchers use "listwise
deletion" to eliminate entire observations (losing about
one-third of their data on average) when any one variable remains
missing after the first procedure. At best (when respondents choose
randomly which questions they will answer), these procedures cause
scholarly analyses of survey and other data to discard a substantial
quantity of information. At worst (when respondents choose not to
answer survey questions for a reason related to the research
questions), these procedures induce massive bias. Our algorithm
reduces the likelihood of both problems.
Could
you summarize the significance of your paper in layman's terms?
When asked by survey researchers for their income, political
opinions, health status, or other sensitive information, some
citizens understandably refuse to answer. This is of course their
right, but if researchers need this information to understand the
world (and perhaps to design policies to reduce unemployment,
improve democracy, or advance health), researchers have to do
something to fill in the missing information. Before our article,
most political scientists and many others dropped all information
from any respondent who did not answer every question of interest.
For the average research article, our approach amounts to a way of
using about 50% more information from the data than had previously
been used, making research funds and investigator effort go farther.
For example, consider a graduate student writing a dissertation and
needing to collect about eight months worth of complete data in
uncomfortable circumstances far from home. Ideally every datum
collected would be complete, but even the best researchers lose
approximately one-third of their observations to item nonresponse
and listwise deletion. So nonresponse must be anticipated as
part of any realistic research plan. However, instead of booking a
trip for 12 months and planning to lose a third of the data—and
four months of his or her life—it probably makes more sense to
collect data for 8 months and take a few days to learn and implement
our methodology.
How did you become involved in this research?
My coauthors-to-be and then graduate students—James Honaker
(now Assistant Professor at UCLA), Anne Joseph (now law clerk to
U.S.
Supreme Court Justice Ruth Bader Ginsburg), and Kenneth Scheve
(now Assistant Professor at Yale University)—and I set out to
study missing data. The problem of missing data arises in almost
every quantitative social science study, and we had all confronted
the problem in our research and frequently been asked for
methodological advice on the subject by other researchers. We also
knew of the discrepancy between the way missing data methods are
recommended and used, and we set out to find a way to address the
problem.
Gary King
David Florence Professor of Government
Center for Basic Research in the Social Sciences
34 Kirkland Street
Harvard University
Cambridge, MA 02138
|
Return to Emerging Research Fronts | Return
to Special Topics main menu
|