Masal ve Hikayeler: replicability

replicability etiketine sahip kayıtlar gösteriliyor. Tüm kayıtları göster

The state of play in science

I've just read a new book that MT readers would benefit from reading as well. It's Rigor Mortis, by Richard Harris (2017: Basic Books). His subtitle is How sloppy science creates worthless cures, crushes hope, and wastes billions. One might suspect that this title is stridently overstated, but while it is quite forthright--and its argument well-supported--I think the case is actually understated, for reasons I'll explain below.

Harris, science reporter for National Public Radio, goes over many different problems that plague biomedical research. At the core is the reproducibility problem, that is, the numbers of claims by research papers that are not reproducible by subsequent studies. This particular problem made the news within the last couple of years in regard to using statistical criteria like p-values (significance cutoffs), and because of the major effort in psychology to replicate published studies, with a lot of failure to do so. But there are other issues.

The typical scientific method assumes that there is a truth out there, and a good study should detect its features. But if it's a truth, then some other study should get similar results. But many many times in biomedical research, despite huge media ballyhoo with cheerleading by the investigators as well as the media, studies' breakthrough!! findings can't be supported by further examination.

As Harris extensively documents, this phenomenon is seen in claims of treatments or cures, or use of animal models (e.g., lab mice), or antibodies, or cell lines, or statistical 'significance' values. It isn't a long book, so you can quickly see the examples for yourself. Harris also accounts for the problems, quite properly I think, by documenting sloppy science but also the careerist pressures on investigators to find things they can publish in 'major' journals, so they can get jobs, promotions, high 'impact factor' pubs, and grants. In our obviously over-crowded market, it can be no surprise to anyone that there is shading of the truth, a tad of downright dishonesty, conveniently imprecise work, and so on.

Since scientists feed at the public trough (or depend on profits and sales for biomedical products to grant-funded investigators), they naturally have to compete and don't want to be shown up, and they have to work fast to keep the funds flowing in. Rigor Mortis properly homes in on an important fact, that if our jobs depend on 'productivity' and bringing in grants, we will do what it takes, shading the truth or whatever else (even the occasional outright cheating) to stay in the game.

Why share data with your potential competitors who might, after all, find fault with your work or use it to get the jump on you for the next stage? For that matter, why describe what you did in enough actual detail that someone (a rival or enemy!) might attempt to replicate your work.....or fail to do so? Why wait to publish until you've got a really adequate explanation of what you suggest is going on, with all the i's dotted and t's crossed? Haste makes credit! Harris very clearly shows these issues in the all-too human arena of our science research establishment today. He calls what we have now, appropriately enough, a "broken culture" of science.

Part of that I think is a 'Malthusian' problem. We are credited, in score-counting ways, by chairs and deans, for how many graduate students we turn (or churn) out. Is our lab 'productive' in that way? Of course, we need that army of what often are treated as drones because real faculty members are too busy writing grants or traveling to present their (students') latest research to waste--er, spend--much time in their labs themselves. The result is the cruel excess of PhDs who can't find good jobs, wandering from post-doc to post-doc (another form of labor pool), or to instructorships rather than tenure-track jobs, or who simply drop out of the system after their PhD and post-docs. We know of many who are in that boat; don't you? A recent report showed that the mean age of first grant from NIH was about 45: enough said.

A reproducibility mirage
If there were one central technical problem that Harris stresses, it is the number of results that fail to be reproducible in other studies. Irreproducible results leave us in limbo-land: how are we to interpret them? What are we supposed to believe? Which study--if any of them--is correct? Why are so many studies proudly claiming dramatic findings that can't be reproduced, and/or why are the news media and university PR offices so loudly proclaiming these reported results? What's wrong with our practices and standards?

Rigor Mortis goes through many of these issues, forthrightly and convincingly--showing that there is a problem. But a solution is not so easy to come by, because it would require major shifting of and reform in research funding. Naturally, that would be greatly resisted by hungry universities and those who they employ to set up a shopping-mall on their campus (i.e., faculty).

One purpose of this post is to draw attention to the wealth of reasons Harris presents for why we should be concerned about the state of play in biomedical research (and, indeed, in science more generally). I do have some caveats, that I'll discuss below, but that is in no way intended to diminish the points Harris makes in his book. What I want to add is a reason why I think that, if anything, Harris' presentation, strong and clear as it is, understates the problem. I say this because to me, there is a deeper issue, beyond the many Harris enumerates: a deeper scientific problem.

Reproducibility is only the tip of the iceberg!
Harris stresses or even focuses on the problem of irreproducible results. He suggests that if we were to hold far higher evidentiary standards, our work would be reproducible, and the next study down the line wouldn't routinely disagree with its predecessors. From the point of view of careful science and proper inferential methods and the like, this is clearly true. Many kinds of studies in biomedical and psychological sciences should have a standard of reporting that leads to at least some level of reproducibility.

However, I think that the situation is far more problematic than sloppy and hasty standards, or questionable statistics, even if they are clearly a prominent ones. My view is that no matter how high our methodological standards are, the expectation of reproducibility flies in the face of what we know about life. That is because life is not a reproducible phenomenon in the way physics and chemistry are!

Life is the product of evolution. Nobody with open eyes can fail to understand that, and this applies to biological, biomedical, psychological and social scientists. Evolution is at its very core a phenomenon that rests essentially on variation--on not being reproducible. Each organism, indeed each cell, is different. Not even 'identical' twins are identical.

One reason for this is that genetic mutations are always occurring, even among the cells within our bodies. Another reason is that no two organisms are experiencing the same environment, and environmental factors affect and interact with the genomes of each individual organism of any species. Organisms affect their environments in turn. These are dynamic phenomena and are not replicable!

This means that, in general, we should not be expecting reproducibility of results. But one shouldn't overstate this because while obviously the fact that two humans are different doesn't mean they are entirely different. Similarity is correlated with kinship, from first-degree relatives to members of populations, species, and different species. The problem is not that there is similarity, it is that we have no formal theory about how much similarity. We know two samples of people will differ both among those in each sample and between samples. And, even the same people sampled at separate times will be different, due to aging, exposure to different environments and so on. Proper statistical criteria and so on can answer questions about whether differences seem only due to sampling from variation or from causal differences. But that is a traditional assumption from the origin of statistics and probability, and isn't entirely apt for biology: since we cannot assume identity of individuals, much less of samples or populations (or species, as in using mouse models for human disease), our work requires some understanding of how much difference, or what sort of difference, we should expect--and build into our models and tests etc.

Evolution is by its very nature an ad hoc phenomenon in both time and place, meaning that there are no fixed rules about this, as there are laws of gravity or of chemical reactions. That means that reproducibility is not, in itself, even a valid criterion for judging scientific results. Some reproducibility should be expected, but we have no rule for how much and, indeed, evolution tells us that there is no real rule for that.

One obvious and not speculative exemplar of the problem is the redundancy in our systems. Genomewide mapping has documented this exquisitely well: if variation at tens, hundreds, or sometimes even thousands of genome sites' affects a trait, like blood pressure, stature, or 'intelligence' and no two people have the same genotype, then no two people, even with the same trait measure have that measure for the same reason. And as is very well known, mapping only accounts for a fraction of the estimated heritability of the studied traits, meaning that much or usually most of the contributing genetic variation is unidentified. And then there's the environment. . . . .

It's a major problem. It's an inconvenient truth. The sausage-grinder system of science 'productivity' cannot deal with it. We need reform. Where can that come from?

It's not just about psych studies; it's about core aspects of inference.

A paper just published in Science by the "Open Science Collaboration" reports the results of a multi-year multi-institution effort to replicate 100 psychology studies published in three top psychology journals in 2008. This effort has often been discussed since it began in 2011, in large part because the importance of replicability in confirming scientific results is integral to the 'scientific method,' but replicability studies aren't a terribly creative use of a researcher's time, and they're difficult to publish so they aren't often on researchers' To-Do lists. So, this was unusual.

Les Twins; Wikipedia

There are many reasons a study can't be replicated. Sometimes the study was poorly conceived or carried out (assumptions and biases not taken into account), sometimes the results pertain only to the particular sample reported (a single family or population), sometimes the methods in an original study aren't described well enough to be replicated, sometimes random or even systematic error (instrument behaving badly) skews the results.

Because there's no such thing as a perfect study, replication studies can be victims of any of the same issues, so interpreting lack of replication isn't necessarily straightforward, and certainly doesn't always mean that the original study was flawed.

The Open Science Collaboration was scrupulous in its efforts to replicate original studies as carefully and faithfully as possible. Still, the results weren't pretty. The authors write:

Replication effects were half the magnitude of original effects, representing a substantial decline. Ninety-seven percent of original studies had statistically significant results. Thirty-six percent of replications had statistically significant results; 47% of original effect sizes were in the 95% confidence interval of the replication effect size; 39% of effects were subjectively rated to have replicated the original result; and if no bias in original results is assumed, combining original and replication results left 68% with statistically significant effects.

Interestingly enough, the authors aren't quite sure what any of this means. First, as they point out, direct replication doesn't verify the theoretical interpretation of the result, which could have been flawed originally, and remain flawed. And it's impossible to know when a study is not replicated why that is, whether the original was flawed or the replication effort was flawed, or even both were flawed.

This effort has been the subject of much discussion, naturally enough. In a piece published last week in The Atlantic, Ed Yong quotes several psychologists, including the project's lead author, saying that this project has been a welcome learning experience for the field. There are plans afoot to change how things are done, including pre-registration of hypotheses so that the reported results can't be cherry-picked, or increasing the size of studies to increase their power, as has been done in the field of genetics.

We'll see whether this is just a predictable wagon-circling welcome, or really means something. One has every reason to be skeptical, and wonder if these fields really are sciences in the proper sense of the term. Indeed, it's quite interesting to see genetics held up as an exemplar of good and reliable study design. After billions of dollars being spent on studies large and small of the genetics of asthma, heart disease, type 2 diabetes, obesity, hypertension, stroke, and so on, we've got not only a lot of contradictory findings, but most of what has been found are genes with small effects. And epidemiology, many of the 'omics fields, evolutionary biology, and others haven't done any better.

Why? The vagueness of the social and behavioral sciences is only part of the problem (unlike, say, force, outcome variables such as stress, aggression, crime, or intelligence are hard to consistently define, and can vary according to the instrument with which they are measured). Biomedical outcomes can be vague and hard to define as well (autism, schizophrenia, high blood pressure). We don't understand enough about how genes interact with each other or with the environment to understand complex causality.

Statistics and science
The problem may be much deeper than any of this discussion of non-replicable results suggests. First, from an evolutionary point of view, we expect organisms to be different, not replicates. This is because mutational changes (and recombination) are always making each individual organism's genotype unique and, second, the need to adapt--Darwin's central claim or observation--means that organisms have to be different so that their 'struggle for life' can occur.

We have only a general theory for this, since life is an ad hoc adaptive/evolutionary phenomenon. Far more broadly than just the behavioral or social sciences, our investigative methods are based on 'internal' comparisons (e.g., cases vs controls, various levels of blood pressure and stroke, fitness relative to different trait values) to evaluate samples against each other, rather than as representations of an externally derived, a priori theory. When we rely on statistics and p-value significance tests and probabilities and so on, we are implicitly confessing that we don't in fact really know what's going on, and all we can get are a kind of shadow of the underlying process that is cast by the differences we detect, and we detect them with generic (not to mention subjective) rather than specific criteria. We've written about these things several times in the past here.

The issue is not just weakly defined terms and study designs. As Freeman Dyson (in "A meeting with Enrico Fermi") wrote in 2004:

In desperation I asked Fermi whether he was not impressed by the agreement between our calculated numbers and his measured numbers. He replied, "How many arbitrary parameters did you use for your calculations?" I thought for a moment about our cut-off procedures and said, "Four." He said, "I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk." With that, the conversation was over. . . .

Here, parameters refers to what are more properly called 'free parameters', that is, ones not fixed in advance, but that are estimated from data. By contrast, for example, in physics the speed of light and gravitational constant are known, fixed values a priori, not estimated from data (though data were used to establish those values). We just lack such understanding in many areas of science, not just behavioral sciences.

In a sense we are using Ptolemaic tinkering to fit a theory that doesn't really fit, in the absence of a better (e.g., Copernican or Newtonian) theoretical understanding. Social and behavioral sciences are far behind the at least considerably more rigorous genetic and evolutionary sciences when the latter are done at their best (which isn't always). Like the shadows of reality seen in Plato's cave, statistical inference reflects the shadows of the reality we want to understand, but for many cultural and practical reasons we don't recognize that, or don't want to acknowledge it. The weaknesses and frangibility of our predictive 'powers' could, if properly understood by the general public, be a threat to our business and our culture doesn't reward candor when it comes to that business. The pressures (including from the pubic media, with their own agenda and interests) necessarily lead to reducing complexity to simpler models and claims far beyond what has legitimately been understood.

The problem is not just with weakly measured variables or poorly defined terms of, for example, outcomes. Nor is the problem if, when, or that people use methods wrongly. The problem is that statistical inference is based on a sample and is often retrospective, or mainly empirical and based on only rather generic theory. No matter how well chosen and rigorously defined, in these various areas (unlike much of physics and chemistry) the estimates of parameters and the like fitted to data that is necessarily about the subjects past, such as their culture or upbringing or lifestyles, but in the absence of adequate formal theory, these findings cannot be used to predict the future with knowable accuracy. That is because the same conditions can't be repeated, say, decades from now, and we don't know what future conditions will be, and so on.

Rather alarmingly, we were recently discussing this with a colleague who works in very physics- and chemistry-rigorous material science. She immediately told us that they, too, face problems in data evaluation with the number of variables they have to deal with, even under what the rest of us would enviously say were very well-controlled conditions where the assumptions of statistics--basically amounting to replicability of some underlying mathematical process--should really apply well.

So the social and related sciences may be far weaker than other fields, and should acknowledge that. But the rest of us, in various purportedly 'harder' biological, biomedical, and epidemiological sciences, are often not so much better off. Statistical methods and theory work wonderfully well when their assumptions are closely met. But there is too much out-of-the-box analytic toolware, that lures us into thinking that quick and definitive answers are possible. Those methods never promise that because what statistics does is account for repeated phenomena following the same rules, and the rule of many sciences is that, in their essence they are not following such rules.

But the lure of easy-answer statistics, and the understandable lack of deeply better ideas, perpetuates the expensive and misleading games that we are playing in many areas of science.

Masal ve Hikayeler

The state of play in science

It's not just about psych studies; it's about core aspects of inference.

Rare Disease Day and the promises of personalized medicine

Kötüye Kullanım Bildir