Can We Still Trust Scientific Studies?

Every day, hundreds of scientific studies are published worldwide. Yet many of them contradict each other, even when they address the same topic and use similar methodologies. This cacophony reveals a deeper issue: the current scientific system suffers from structural dysfunctions that undermine the reliability of research.

When an Umbrella Review Exposes the Flaws
Like any other consumer product, e-cigarettes are widely studied by researchers. Yet hardly a day goes by without a study being published that contradicts the conclusions of a previous one. While the use of a different methodology can sometimes explain why two similar studies arrive at opposing conclusions, this is not always the case. The truth is, the scientific world faces numerous problems.

A few days ago, a new British study was published. It was an umbrella review, i.e., a systematic review of systematic reviews. Here’s what that means:

When a researcher publishes a study on a specific subject, that’s a simple study. When a researcher analyzes the results of all the studies on the same subject, that’s a systematic review. An umbrella review, then, aims to analyze the results of several studies (systematic reviews) that themselves analyzed the results of many studies. In short, an umbrella review is a synthesis of a synthesis.

The umbrella review in question looked at youth vaping. In its conclusions, it reported finding “consistent evidence that higher risks of smoking initiation, substance use (marijuana, alcohol, and stimulants), asthma, coughing, injuries, and mental health problems are associated with e-cigarette use among young people.”

For once, we won’t criticize this specific umbrella review. With its numerous limitations, the authors clearly failed to nuance their results. What we’ll focus on this time is something called AMSTAR 2.

AMSTAR 2, the Measurement Tool That Distorts Reality
AMSTAR 2 is a critical appraisal tool. This term refers to a family of instruments used by scientists to assess the quality of a study. In this case, AMSTAR 2—short for A MeaSurement Tool to Assess systematic Reviews—is the standard tool for evaluating systematic reviews.

There are many such tools. AMSTAR 2 was designed to assess the quality of systematic reviews; GRADE evaluates the certainty of evidence; the Cochrane Risk of Bias tool is used for individual studies; NOS is geared to cohorts and case–control studies; QUADAS-2 to diagnostic studies, and so on. There are hundreds of tools.

All share the same goal: to quickly and consistently assess the quality of a scientific study. With tens of thousands of new studies published each year, scientists needed a way to separate the wheat from the chaff.

Among these tools, AMSTAR 2 stands out. It is essentially the expected standard for evaluating the quality of systematic reviews. Broadly speaking, if a researcher works with data from one or more systematic reviews, most medical journals will refuse their work unless they’ve used this tool. The umbrella review mentioned earlier therefore relied on AMSTAR 2.

Even when authors state in their manuscripts that the systematic review was conducted/prepared/designed in accordance with AMSTAR 2, this does not necessarily mean it achieves high or even moderate confidence under AMSTAR 2.Most systematic reviews reporting adherence to AMSTAR 2 had critically low methodological quality: a cross-sectional meta-research study.

Result? The authors report: “Most systematic reviews we included were rated as low or critically low quality using AMSTAR 2.”

Does this mean the umbrella review was almost entirely based on poor-quality systematic reviews? Not exactly. AMSTAR 2, on average, classifies over 90% of systematic reviews as “critically low quality”. Why?

Because AMSTAR 2 relies on sixteen items, seven of which are deemed critical, which heavily skew the final rating. In reality, fewer than half of these items are truly applicable to all systematic reviews[3]. Add to this vague criteria, often misunderstood by researchers[4], and you end up with a tool many consider fundamentally flawed.

So why use AMSTAR 2 at all, if so many researchers know it’s not really suited for this task?
Simply put: because AMSTAR 2 is the expected standard in academic practice. Despite its shortcomings, convention demands it. And it is not an isolated case—other tools or practices meant to guarantee scientific rigor are also problematic.

The Impact Factor
The Impact Factor is another tool that has drifted from its original purpose. It was created to help libraries decide which journals to subscribe to. Today—despite repeated warnings from its creator, Eugene Garfield—it has become a primary criterion for evaluating researchers and their work.

Using journal Impact Factors rather than actual article citation counts to evaluate researchers is a highly controversial issue.Eugene Garfield, creator of the Impact Factor

What does the Impact Factor actually measure? The journal in which the researcher’s work is published. A scientist’s reputation and the perceived quality of their work are thus tied to where it is published, not to the qualities of the study itself.

It’s like rating a movie—and the actors in it—not by the script or their performances, but by the theater where the film is shown. It makes no sense, yet that’s precisely what often happens in science today.

Predatory Journals
Another problem: predatory journals. These claim to be legitimate scientific outlets but in reality accept almost any study—for a fee. No peer review, sometimes not even a cursory read. If the author pays, they get published.

These journals pollute the scientific literature. They allow “anyone” to publish a study whose data hasn’t been verified. In 2014, around 420,000 studies were published in such journals, whose number exceeded 8,000.

Worse still, some of these papers have been cited in genuine scientific work. Bad science seeps into good science—a problem known as citation contamination.

A negative consequence of the rapid growth of open-access scholarly publishing funded by article processing charges is the emergence of publishers and journals with highly questionable promotional and peer-review practices.Predatory’ open access: a longitudinal study of article volumes and market characteristics.

As further proof, Polish researchers conducted an experiment: they created a fictitious scholar, Anna Szust, with a fabricated CV, and applied to 120 medical journals for an editor position. Forty predatory journals accepted her within hours. More worryingly, eight journals listed in the Directory of Open Access Journals (considered quality OA venues) also accepted her. Thankfully, none indexed in Journal Citation Reports fell for it.

Predatory journals also fuel other frauds, such as paper mills—businesses that fabricate entire studies (fake data, fake charts) on demand, publish them in predatory outlets, and sell authorship to researchers who want to pad their CVs to secure funding. And the list of cracks in the system goes on.

Peer Review
Peer review is widely considered the gold standard of scientific validation—by scientists and journalists alike. But in reality, it’s far from perfect.

Process, in brief:

An author submits a paper to a journal;
The editor sends it to experts in the field;
They review it anonymously and recommend acceptance, revision, or rejection;
The journal makes the final decision.

Problem: while treated as objective, peer review is inherently subjective, because a study’s quality is judged arbitrarily by a few people—who often disagree. For the same paper, one expert may recommend acceptance, another rejection. Evidence of a flawed system.

When evaluating faculty, most people don’t have—or don’t take—the time to read the papers! And even if they did, their judgment would likely be influenced by the comments of those who cited the work.Eugene Garfield, creator of the Impact Factor

Add to this the many biases that can affect the process: author nationality vs. reviewer nationality, institutional prestige, gender, discipline, confirmation bias, etc.

For the record, some studies rejected after peer review later went on to win a Nobel Prize.
Yet, as with AMSTAR 2, peer review is deeply entrenched in scientific practice. To its credit, no fully viable replacement exists—for now.

Citation Manipulation
Citations are another problem. They are the currency of science. The more a researcher’s work is cited, the more influential it is deemed. Citation counts weigh on hiring, promotions, and funding.

But citations can have perverse effects: they turn collaboration into competition. Researchers may choose topics that are more citable rather than more important.

The ability to purchase bulk citations is a new and worrying development.Jennifer Byrne, cancer researcher

An even bigger issue: some actors sell citations. A scientist pays and gets cited. A minor—or poor-quality—study can be perceived as strong simply because it is frequently cited. A black market has developed on this basis.

P-Hacking
Finally, p-hacking. The letter p in studies refers to the probability that a result is due to chance. In scientific research, results are commonly considered statistically significant when p < 0.05, meaning there is less than a 5% likelihood the result is random.

Regarding different p-hacking strategies, we found that even with a single strategy, false-positive rates can typically be increased to at least 30% above the typical 5% threshold with a ‘reasonable effort’—that is, without assuming researchers automate data-mining procedures.Big little lies: a compendium and simulation of p-hacking strategies.

This 5% threshold, chosen by Ronald Fisher in the 1920s without any special scientific justification, has become a nightmare. Journals can refuse to publish studies whose results are not “significant,” creating an incentive to game the stats.

Some researchers then cheat to get p < 0.05: stop data collection as soon as the threshold is hit; drop participants after seeing their inclusion pushes p above 0.05; test a laundry list of variables and report only those below 0.05; or slice the data into absurd subgroups until a “significant” effect appears.

Multiple studies have documented p-hacking. For example, one researcher examined 100 psychology papers from prestigious journals and replicated them. Of the 97 that originally reported p < 0.05, only 36 actually reproduced statistically significant results. (Not all fields are as affected as psychology.)

Reform Rather Than Rejection
In many research fields, the widespread use of questionable research practices has jeopardized the credibility of scientific findings.Big little lies: a compendium and simulation of p-hacking strategies.

The examples in this article are not exhaustive. Others could be cited. The goal is not to discredit researchers.

Despite these dysfunctions, science remains our best tool to understand the world. Encouragingly, some initiatives are emerging: preregistration of study protocols, mandatory sharing of raw data, and efforts to develop better-suited appraisal tools (than AMSTAR 2, for instance).

Today, the problem is not ignorance of the flaws but how to address them—and, frankly, resistance to change.

Should we reject science? No. But these revelations call for a more critical reading of studies, especially in controversial fields such as vaping. Between dogmatic conclusions and blind skepticism lies a middle path: science aware of its own limits.