Judgment misguided: Replication reservations

Replication of previously reported studies is sometimes useful or even necessary. Drug companies often try to replicate published research before investing a great deal of money in drug development based on that research. Ordinary academic researchers often want to examine more closely some published result, so they often include a replication of that result in a larger design, or just try to see if they can get the effect before they proceed to make modifications. Failures to replicate are often publishable (e.g., Gong, M., & Baron, J. The generality of the emotion effect on magnitude sensitivity. Journal of Economic Psychology, 32, 17–24, 2011), especially when several failures are included in a meta-analysis (e.g., http://journal.sjdm.org/14/14321/jdm14321.html). Finally, people may try to replicate a study when they disagree with its conclusions, possibly because of other theoretical or empirical work they have done.

Researchers are now spending time trying to replicate research studies in the absence of such purposes. In one project, some students are in the process of trying to replicate most of the papers published in Judgment and Decision Making, the journal I have edited since 2006 (https://osf.io/d7za8/). Let me explain why this bothers me.

First, these projects take time and money that could be spent elsewhere. The alternatives might be more worthwhile, but of course this depends on what they are.

Second, if you want to question a study's conclusions, it is often easier to find a problem with the data analysis or method of the original study. A large proportion of papers published in psychology (varying from field to field) have flaws that can be discovered this way. Many of these flaws are listed in http://journal.sjdm.org/stat.htm. It is possible to publish papers that do nothing but "take down" another published paper, especially if a correct re-analysis of the data yields a conclusion contradicting the original one.

Third, complete replication of a flawed study often succeeds quite well, because it replicates the flaws. A recent paper in the Journal of Personality and Social Psychology (Gawronski et al., 2017. Consequences, norms, and generalized inaction in moral dilemmas: The CNI model of moral decision-making, 113: 343-376) replicated every study in the paper itself. The replication involved new subjects but not new stimuli, but the data analysis ignored variations among the stimuli in the size and direction of the effects of interest (and other methodological problems).

Fourth, what do we conclude when a study does not replicate? Fraud? Dishonesty in reporting? Selective reporting? Luck? Sometimes these explanations can be detected by looking at the data (e.g. http://retractionwatch.com/2013/09/10/real-problems-with-retracted-shame-and-money-paper-revealed/#more-15597). And none of them can be inferred from a failure to replicate. So what is the point? Is it to scare journal editors into accepting papers only when they have very clear results that do not challenge existing theories or claims?

Blanket replication of every study is a costly way to provide incentives for editors. Perhaps these "replication factors" for journals are an antidote to the poison of "impact factors". Impact factors encourage publication of surprising results that will get news coverage, and will need to be cited, just because they are surprising. But the very fact that they are surprising increases the probability that something is wrong with them. A "replication index" will discourage publication of such papers. But it will also encourage publication of papers that go to excess to replicate studies within the paper, use large samples of subjects, and, in general, cost a lot of money. This will thus tend to drive out of the field those who are not on the big-grant gravy train (or who are not in schools that provide them with generous research funding). It is better for editors to ignore both concerns.

Fifth, I think that some good studies are unlikely to replicate. I try to publish them anyway. One general category consists of studies that pit two effects against each other, only one of which is interesting. An example is the "polarization effect" of Lord, Ross and Lepper (1979): subjects who opposed or favored capital punishment were presented with two studies, one showing that it deterred serious crimes and the other showing that it did not deter; both groups became more convinced of their original position, because they found ways to dismiss the study that disagreed with it. This result has in fact been replicated, but other attempts to find polarization have failed. The opposite effect is that presenting people with conflicting evidence moves them toward a more moderate position. In order for the polarization effect to "win", it must be strong enough to overcome this rational tendency toward moderation. The conditions for this to happen are surely idosyncratic. The interesting thing is that it happens at all. If the original study is honestly reported and shows a clear effect, then it does happen.

Another example is a study recently published in Judgment and Decision Making (Bruni and Tufano. The value of vulnerability: The transformative capacity of risky trust, 12, 408-414, 2017). The finding of interest was that people who made themselves "vulnerable", by showing that they had trusted someone who had previously been untrustworthy, evoked more trustworthy behavior in trustees who knew of their vulnerability. Again, this result must be strong enough to counter an opposite effect: these vulnerable people could also be seen as suckers, ripe for exploitation. I suspect that this result will not replicate, but I also think it is real. (I examined the data quite carefully.) It may well depend on details of the sample of subjects, the language, and so on. This is not going to help the "replicability index" of the journal (or the impact factor, for that matter, as it is quite a complex study), but I don't care, and I shouldn't care.

Of course, other important studies simply cannot be replicated, because they involve samples of attitudes in a given time and place, e.g., studies of the determinants of political attitudes, the spread of an epidemic, or the structure of an earthquake. What often can be done instead is to look at the data.

In my view, the problem is not so much "replicability" but rather "credibility". Replications will be done when they are worth doing for other reasons. But for general credibility checking, it is probably more efficient to look at the data and the methods. To smooth the path for both replication and examination of data, journals should welcome replications (with either result when the original result is in doubt) and they should require publication of data whenever possible.

Judgment misguided

Monday, December 25, 2017

Replication reservations

No comments:

Post a Comment

Followers

Blog Archive

About Me