Episode 18

A Statistician reads JAMA

June 30, 2025 · 39:03

Judith: Welcome to Berry's In the
Interim podcast, where we explore the

cutting edge of innovative clinical
trial design for the pharmaceutical and

medical industries, and so much more.

Let's dive in.

Scott Berry: Well, welcome back to
In the Interim, I'm Scott Berry.

I'm your host.

For today of, in the interim, I've a
interesting topic for today, I, I used to

write a column for CHANCE Magazine and I
would quarterly write a column called the

A statistician reads the sports pages,
and of course, consuming and talking

about things that show up in sports.

The statistical analysis of this.

I did that for, uh,
about 10 years actually.

Um.

Um, in, in that, it was, it
was, um, was very rewarding.

So I'm gonna do that today.

And this is a statistician reads jama.

So I'm gonna tell you about an experience
of opening JAMA and giving a read

to relatively, uh, a random clinical
trial that I read the results of.

And it, it, it sparked a number
of things I thought very mu,

very much worth discussing.

Um, by the way, i, I, for those of you out
there may, whether you're driving to work

in the morning, the afternoon, you're out
for your daily run or however you consume.

In the interim, I'd love to hear from you.

Let me know what kind of topics
you would like to talk about.

If.

Particular people you'd like
on the on, in the interim.

Love to hear what you'd like to hear
about here at Berry Consultants, we

have a company of about 35 scientists,
static, mostly statistical scientists.

We work on clinical trial design.

We work on implementing,
uh, adaptive trials.

We have software for simulating trials.

We do a wide range of therapeutic areas,
and yes, we focus on innovative trial

designs, adaptive trials, bayesian
statistics, and so we'd, I, I'd love

to hear what kind of topics you'd
like to hear about on in the interim.

So here we go.

So this really happened.

I got an email from JAMA and it, they,
in there, they list different, uh,

articles and results of different trials.

And unfortunately I don't get to read
these as much as I would like, but one

showed up and I thought, you know what,
I'm, I'm just gonna read one of these and

see what, see what these trials look like.

So this trial is the Fair HF two trial,
and it was published, uh, the, the,

the primary author is Anchor, uh, so
Anchor Etal, and it says, published

online, March 30th, 2025 in jama.

Uh, an or in an original investigation.

So if, if at all what I talk
about is interesting, yes, please,

please go check out the, the trial.

Uh, I was, I'm not involved in the trial.

I had nothing to do
with it, and I, uh, uh.

Only vaguely know a couple of the
authors, so I don't, I don't have

any involvement in this trial.

So the, what is the question?

In the trial?

The trial, and it says
this right in jama, what I.

Are the efficacy and safety of intravenous
intravenous ferric carbo, carbo x maltose.

I'm going to refer to this as
intravenous, um, uh, iron supplement.

Uh, the paper refers to it that way.

So this is for in, in patients with
heart failure and iron deficiency.

So is this.

Intravenous iron supplement for
patients with heart failure and

they have an iron deficiency.

Is it effective and safe?

That's the question in the trial.

Okay, so very interesting.

So the design, the settings, and the
participants, it's a multicenter trial.

It's randomized one-to-one.

With the iron supplement.

Uh, in there patients with
heart failure defined as having

A-L-V-E-F left ventricular ejection
fraction less than or equal to

45% and having an iron deficiency.

I'll let you if, if, if you know
what levels of iron deficiency are.

Uh, I'll let you go to the paper to, to,
to see that by the way, they define a

particular highly deficient group, and
I'll say something about that as well.

But, so everybody's iron deficient
and there's a particular subgroup

that is, is, uh, highly deficient.

It enrolled at 70 clinical
sites in six European countries.

Enrolled from March,
2017 to November of 23.

Median follow-up of patients was
16 months, uh, in, in the trial.

Okay, so again, any of the details,
please, uh, uh, read the article.

So sounds very interesting.

The trial enrolled 1,105 patients,
so rather large trial from that

perspective, large and long trial.

The primary endpoint in the trial,
and it's a bit sort of interesting and

I'll try to lay this out a little bit.

So the primary endpoint, it's
looking at cardiovascular death

and heart failure, hospitalization.

Very common endpoints in heart
failure trials, it, it's going to

analyze three different primary.

Endpoints where that's an endpoint
tied to an analysis in the trial.

So the first one is the time to
first cardiovascular death or

heart failure hospitalization.

Again, a very common way to analyze
in heart failure trials is a time to

event, so they're doing standard time
to event analyses for that endpoint.

Additionally, they're doing an analysis
of the rate of total hospitalization.

So this is, a patient could have
multiple heart failure hospitalizations,

and they're using rather standard
analysis techniques of that for

count data, negative binomial
analysis for count data, depending

on the exposure the patients have.

The third one is analyzing the first
endpoint that I described, which

is time to first cardiovascular
death or hospitalization.

But it's restricted to that
subgroup that I, that I de, that I

described that are highly deficient.

And that, so those are the three
analyses that are gonna define the

primary set of analyses in the trial.

Yes.

They're gonna control the
overall family-wise, experimental

error rate by, by analyzing all
three of those, uh, analyses.

So sort of co-primary, if you
will, in the sense that any one

of those could potentially be
successful and they adjust for that.

Okay, so they use a Berg procedure
for analyzing those three,

uh, I'll call them endpoints.

And I, I struggle with this a little bit
because I like to think of endpoint as

heart failure, time to heart failure,
hospitalization, or cardiovascular death.

And how you analyze it or the
subgroups aren't really the endpoint,

but I'll, I'll describe it that way.

I think it'll be easier to describe it.

So they are analyzing these three analyses
and in the analysis they've set up a

procedure where they will refer to this
as statistically significant if any one

of the following three things happen.

If all three of those are significant.

At, and I'm going to do
it in the one-sided sense.

In, in, in one sided.

This is all about superiority.

They describe it in the paper two-sided.

So I I'll do it.

Two-sided, sorry.

So if all three are less than 0.05,

two-sided, then the trial's
statistically significant and

they've demonstrated superiority.

The second opportunity is that if
two of them are significant at 0.025,

so that's half of the original,
but if two of them meet 0.025,

then the trial demonstrates
statistical significance.

Yes, if one of them, if any one of them
is significant at the point, at 1.67%,

two-sided, then it demonstrates
statistical significance.

So this is sort of three shots
on goal where they're looking

at three different analyses.

Again, time to cardiovascular
death or hospitalization.

The the number of heart failure
hospitalizations and a subgroup where

they analyze time to first, cardiovascular
death, or heart failure hospitalization.

Okay, so the SAP is is, uh,
published as part of it.

It's a very well written SAP.

The design is reasonably standard.

I found no evidence of any adaptations in
the trial, so they enrolled 1,105 patients

and carried out the primary analysis.

I'm sure there was A-D-S-M-B and
safety was reviewed, things like

that, but a very traditional trial.

Um, and, and a good trial, and I, I,
I'm going to talk about the publication

of this and the results of this.

The, the, the authors should
be commended on this trial.

The patients involved in the
trial and they deserve praise.

So I, I, I hope in, in no way
does this come across negative

for the people who ran, conducted
and published this, this trial.

But I want to dive into the science of it.

I wanna push a little bit on the science
of it, and I want to give my reading

as a statistician who randomly picked
up this article and, and read it.

what my reading of it is.

Okay.

So the structure set up
what happened in the trial.

So the trial enrolled 1,105 patients.

Again, it was randomized, double blind
in the setting, the first analysis

of time to cardiovascular death
or heart failure hospitalization.

They report this as the, the, the
number of events per hundred patient

years as just a way to summarize it.

Uh, in the paper is in the, in
the treatment group, it is 16.7

and in the placebo it's 21.9

per a hundred patient years, so 16.7

to 21.9,

140.

One of the patients of 558 on
the treatment had an event.

And 166 of 5 47.

Reasonably similar sample
sizes, so 141 and 166.

The hazard ratio and the time
to event analysis is 0.79.

The two sided P values 0.041

sided would be 0.02

of superiority.

The, the hazard ratio of 0.79

is showing the treatment did better.

And notice that doesn't meet
that endpoint by itself in the

setting of, of the Huck procedure.

If that would've been the loan
primary analysis, it would

be statistically significant.

Now what happened to the other endpoints?

It, it met 0.05.

So if all three meet 0.05,

the trial will be considered significant.

The total heart failure hospitalizations
had 264 in the treatment group and

320 heart failure hospitalizations.

Now that's adding across patients.

It matters.

Uh, how many patients have 0,
1, 2, and three, and so on?

The relative risk in
that analysis is 0.80,

0.80.

Again, a benefit for the treatment.

A 20% re relative risk reduction
in heart failure hospitalizations.

The P value, the
two-sided P value is 0.12.

In that the third, which analyzed this
subgroup of patients that, um, met this

high need showed a and this was the,
the end point of cardiovascular death

or heart failure, hospitalization.

Time to first event showed
a hazard ratio of 0.79

Also.

Exactly the same as the primary analysis.

The confidence interval's a little
wider and the P value's 0.07.

So let's think back to
the Hawk Bird procedure.

Do all of them meet 0.05?

They don't.

The first one did the other two didn't.

Do two of them meet 0.025?

No, actually none of them meet 0.025

and none of them met 0.0167.

So according to the primary analysis
methodology, the controlling of

the experiment wide type one error
rate, this trial is not significant.

Statistical significance
Was, was not shown.

Wow.

Again, a very interesting result
that time to cardiovascular death

or heart failure hospitalization
showed a significant P value.

So for example, the, the credible, the
confidence interval shown goes from 0.63

to 0.99.

That's the 95% confidence interval.

So what is the conclusion in the trial?

So I'm reading this, I'm looking at
the data, I'm looking at the results,

the conclusion and relevance for
the the paper in patients with heart

failure and iron deficiency, I.

Iron supplement did not significantly
reduce the time to first heart failure,

hospitalization, or cardiovascular death
in the overall cohort or in patients

with transference saturation less than
20%, or reduce the total number of heart

failure hospitalizations for placebo.

And in the little, uh, uh, figure
they show that talks about the

population, it's really very nice.

It's the, the cartoon of the, the
article, the conclusion says that iron

supplement was well tolerated, but
did not significantly improve outcomes

compared with placebo in patients with
heart failure and iron deficiency.

Okay.

So what, what does a statistician,
and by the way, this is

just me as the statistician.

I, I'm, I'm sure other statisticians
have a very different reaction to me.

So I read this and I really,
really struggled with this

from, from several points.

And so let me sort of
dive into the points.

The first part about this is, it, it just,
the, the scientific struggle I have that.

We report trials as black and white.

They're significant or they're not.

And in this trial where the only reason
it's not significant is because that

overall cohort on cardiovascular death
and heart failure, hospitalization was

part of a, a multiple testing procedure.

So it was, it was incredibly
close to being significant.

But it wasn't, the conclusion of
this trial is that iron supplement

doesn't help patients, doesn't change
the clinical outcome of patients.

A Bayesian analysis say of that primary
endpoint would say there's, assuming a

non uh, informative prior would be a 98%
probability that iron supplement benefits.

On time to cardiovascular death
or heart failure hospitalization.

So any way you want to read
this, there's gray area to this.

Every trial typically has gray
area to this, but yet when we

publish it, it's all or nothing.

In this scenario, the conclusion
is the same as if the data were

identical in the two groups and the
HA and the hazard ratio was one.

Even if it showed harm, the
conclusion would be the same, that

the treatment doesn't benefit and
really co conclusions say one thing,

the treatment benefited or it didn't.

I think it's a gross simplification
of a six year trial of 1100 patients,

but I understand how we got there.

And by the way, statisticians share
some blame in in how we got there.

We reinforce that hypothesis testing
and type one error of 5% and you can't

say anything if you don't reject a no.

And that's the way we
analyze trials this trial.

I bet if you flip three deaths in this
trial, it meets the Hochberg procedure.

And our conclusion to clinicians reading
this article is that it benefits, it's

either it doesn't benefit or it benefits.

Now we hope.

Anybody reading this dives into the data,
looks at the results, thinks about other

trials, thinks about the treatment of
this, and makes a decision based on it.

But I have to believe
the conclusion from jama.

It makes a huge indent into any
clinician reading this article.

So.

The black and whiteness of trials, uh,
I just feel like is scientifically,

I, I really struggle with it and
I, I accept that as a member of the

statistical community, we're probably
partly to blame for this dogmatic

approach to hypothesis testing.

And its black and white and we're gonna
come down hard on you if you don't

interpret it any other way than that.

Okay.

And now for those statisticians out
there, I I, I have nothing against

the Hochberg procedure, and I know
that we design trials, we do FDA

trials where, uh, 5% scenario, 2.5%,

one-sided tests I is, is the standard.

And, and we live by that and
we do hochberg procedures.

So I, I don't have anything to get
that, but I really struggle with the

likelihood principle aspect of this.

The data in this scenario is exactly
the same as another scenario.

The data are identical where this, this
paper says it's statistically significant,

and this iron supplement benefits patients
with heart failure and iron deficiency.

This 0.05

had, uh, the exact same data set and
the likelihood principle that if we

have exactly the same data in two
different scenarios, our conclusions

should be the same if you're a Bayesian.

The posterior probabilities
identical for those two trials.

It, the Bayesian machinery, uh,
satisfies the likelihood principle by

the nature of the Bayesian a Bay theor.

So I really struggle with this
part of it, and I get it from

a type one error scenario.

Being a Bayesian, I don't think
type one error is the be all, end

all, and it flips this result.

For clinicians reading it, and
I, I really struggle with that.

Okay.

Now, given the rules we play by, and
they knew the rules, they played by,

they wrote them in this, this article,
so the SAP lays out superiority.

They knew that if they ended up in this
situation, so if somebody simulated

this trial and showed them this result.

By the way, that's a
huge value of simulation.

If they saw that result and said,
good, that's the result we want, great.

But I'd be really surprised if
that's the result they wanted In

this trial, there are no adaptation.

Could this trial have been adaptive?

Could we have seen that result?

Could it have been bigger?

Now, I recognize this
was a six year trial.

And it may be the funding of
this, it couldn't have been bigger

and this is just the way it is.

And then the, they would say, the
investigators would say, yes, that

that's, that's the result we want.

But I think it's one of those scenarios
that had this trial been six months

longer, had it enrolled 200 patients,
would it have changed clinical practice?

Does this paper change clinical practice?

Should it change clinical practice?

Would it have changed clinical
practice if the trial were six

months bigger, 200 patients bigger?

That's a little bit of the struggle here.

And so, uh, I, I just bring out that
I, I hate to say it, but is 1100

patients in six years, is it wasted?

Uh, in, in the way because
we do black and white.

Now we shouldn't do black and white,
and I'll talk more about what would be

shades of gray, but these are the rules
that everybody plays by at this point.

Journals play by this.

We kind of know the rules going in.

Could it have been adaptive?

The other struggle I have is
that this is about science.

This is about recommending
treatments to patients that

that could potentially benefit.

And I would think if the truth
of this is a hazard ratio of 0.8

on heart failure, hospitalization and
um, uh, cardiovascular death, this

is a clinically important treatment.

So we only analyze data in the trial.

We are stuck on that by the way.

I think that's reflective
of frequentist approaches.

We analyze the data in the trial.

We calculate the probability of the
data is as extreme or more extreme

than what we saw, assuming the null.

That's the P value 0.04

it.

But this is, there's science
here we we typically know more.

Part of my struggle is what's next.

This trial at I I if, if
you're into giving adjectives,

it's borderline significant.

It's very close to being
statistically significant.

Do we need another trial that's six
years in 1100 patients and get the P

value of that next trial below 0.05

or below HUCKER stuff,
or would 200 patients.

Potentially, I said, suppose
that trial was bigger by 200

patients or 300 patients, would
it change clinical practice?

But no, we can't do that.

The next trial designed would
only analyze that trial.

There's something incredibly
frustrating about that.

Now I want to give you a
different potential scenario.

Suppose this was a novel treatment.

I assume that nobody owns the, the,
the, uh, the, the rights to this.

Nobody has patent life
on a, a iron supplement.

And this is all about,
uh, uh, treating patients.

But suppose this was a novel
treatment in heart failure and a

company ran this trial exactly this
and, and they don't get significant.

And they go to a regulatory agency,
they go to ema, they go to PMDA,

they go to the us FDA, and they say,
you know, we just can't approve it.

It, it's not enough.

Based on that trial, does that
company need to run another

trial of 1500 patients?

Can we say, look, there's
information there.

The next trial doesn't need to be as big.

We're really close to approving
this, but we just can't do it.

And there are examples of this.

Um, do we start over it, it seems, I.

Bad science.

Now, the FDA is absolutely doing this.

There are scenarios where they use
the results of one trial combined

to the results of another trial.

You can look up Rebi ota.

It was approved in 2023.

Fairing pharmaceuticals.

There are multiple devices that
have been been approved this way.

There's multiple scenarios that I
know of that we're working with the

agency or have designed trials where
it, it uses the previous results

recognizing that, boy, it's really close.

We shouldn't need 1500 patients
after this for approval.

So I want you to think about
just the, the medical community.

In that scenario where we're so
focused on single trials, what about

combining the results together?

All of this, my biggest issue with
this as a statistician reading it

is I don't believe the conclusions.

I think they're wrong.

Now, maybe it's just me, but I
think this treatment works, and let

me give you a little bit of why.

So when I read the article.

And I see that first of all, the
statistics is compelling to me.

That's highly likely
just based on the trial.

If we're stuck on only this
trial, a 98% probability for

a clinically really clinically
important event is really valuable.

In a scenario like this where this isn't
about an FDA approval, which has its own

regulatory standards, and I know we've,
we've gotta go by this, but this is about

the next patient that walks in the door.

If it were me, I want iron supplement.

I think it works.

It's highly likely to work.

But the other thing I thought is okay,
you know, we, we do get type 1 errors.

We do get scenarios where we get a
hazard ratio like this in a confidence

interval and the treatment doesn't work.

So, as a statistician, I wanna know
what other information is out there

and what do we know about this?

Well, there have been previous
large trials run and the trial

nicely, uh, it, it's actually, you
can't really find much about it.

There's a little bit in the JAMA
article, so I, I, I don't want to.

Say there isn't.

Uh, but it talks about the reason
and still the uncertainty about this.

And so there is a trial, the
heart FID trial, that it looked

at time to cardiovascular death
or heart failure, hospitalization.

All three of the trials I'm gonna
tell you about that have already been

published, use that same endpoint
time to first cardiovascular death

or heart failure hospitalization.

And that trial showed it
had a hazard ratio of 0.93

for that endpoint with a confidence
interval that went up to 1.06.

So that probably had a P value.

Something like, uh, uh, one-sided 0.1.

Uh, two-sided, you know, maybe 0.15

and maybe a one-sided 0.075.

I didn't go find the article, but
I, I found the summary of that.

So high hazard ratio of 0.93.

The Iron Man trial, a great name for
the trial with an iron supplement,

had a hazard ratio of 0.84

for that same endpoint where the upper
bound of the confidence interval is 1.02.

Borderline significant but not that
trial did not demonstrate clinical

benefit, but hazard ratio of 0.84,

so 0.9

3.84

and the Affirm A HF trial
had a hazard ratio of 0.80.

Confidence interval 0.98.

They also res report heart failure
hospitalizations for three of them.

I'm sorry, for two of them, 0.80

and 0.74.

The other primary endpoint
in the fair HF two trial.

So walking into this trial, we've got
three other trials that demonstrate 0.9,

3.8.

4.80.

Uh, uh, for hazard ratio
for that primary endpoint.

All positive, one of them significant,
one of them borderline significant.

Uh, one of, uh, another one, 1.06

for the upper bound of this, and
now this trial, the confidence

interval for that endpoint is 0.79

with an upper bound of 0.99.

That information altogether, this
treatment as, as the read of a

statistician, this treatment works.

We are sitting there with four
trials all analyzed separately.

There is a meta-analysis published.

Does it move clinicians?

So I don't know the answer to that.

Other people can tell me that.

But if I randomly pick up this
article and I read it, it says,

this treatment doesn't work.

Boy, I, I, you know, and I know, uh,
but boy and, and jam is an incredible.

Journal.

I, I know multiple editors for Jana.

They jama they do an
incredible job with it.

And, and I know exactly how we
got here and this is not unique.

Uh, we, we see this commonly.

I, I just, I struggle with it.

I don't think the conclusion is right.

I think it's actually highly
likely the conclusion is wrong.

Do people read the conclusions?

So that, that's my struggle.

So what could the world be different?

You know?

Okay, Scott, so what, what
would be different here?

What?

What can we propose
different ways to do this?

Well, let's suppose we didn't think of
the trial as being this black and white.

Where we do a significance test, if
it's significance, we all wave flags

and we all celebrate and we publish
it and it changes clinical practice.

And if it's not significant, it doesn't
change Clinical practice, you know,

is, is suppo what would be different?

What if the trial reported
the posterior probability?

The treatment is superior to the control
and it doesn't put an adjective on it.

It doesn't say significant, borderline
significant, highly significant.

Three asterisks on it.

We don't have to do that.

98% is an adjective

and it allows somebody to consume the data
if they want to only look at that trial.

Okay?

98% probability for the primary
endpoint of superiority.

By the way, that satisfies the
likelihood principle in that scenario,

and we're overly stuck on type
one error that it's significant.

What if that's the report and the little
cartoon in the front of JAMA says, this

trial demonstrated 98% probability that
it benefits time to first cardiovascular

event and heart failure hospitalization

Now.

We, uh, the first thing a frequentist
is gonna say, well, oh yeah, but

now you've got a prior for that.

An important part of this, the
important part is where this sits

in the science, and I think the,
the article in the Journal of the

American Medical Association shouldn't
just report on that single trial.

The trial should prospectively define
a relatively non-informative prior.

So that the data speaks for itself,
that we, we know how to do that.

It's common that we do that.

And here that would give a
98% posterior probability.

We, the trial prospectively
defines a pessimistic prior.

It prospectively defines
an optimistic prior.

This would be relatively easy to do.

It also specifies.

A prior, based on the current summary
of scientific information, it uses

the meta-analysis that was published
and it says based on the previous

data, and it might even have several
of these based only on trial one.

Here's what you do.

Other trial two.

So if you don't like trial three and you
like trial one, here's several priors.

And somebody that uses those products.

Here's the probability, and my guess
in this scenario is that this is 99.9

probability.

This treatment is beneficial.

If you use that summary of that
information updated, which Bayesians

do based on this trial, and allows the
reader to judge it on this and never.

Never says it statistically
significantly affects the clinical

outcome, but says there's a 99.3%

probability this benefits, or nine
that this trial demonstrated a 98%

probability that it, it benefits.

A pessimistic prior would be 94%.

An optimistic prior is 99.3.

I'm making these numbers up, but I'm
just guessing what they might be.

And the prior, based on a summary of
the other trials and this one together,

that's where I think this would be 99.97%

probability of benefit,
something like that.

As a statistician reading it, I'd be much
more comfortable when I read this article.

That, that's providing
this advice to clinicians.

Where right now, when I read that article,
I really struggle and boy, I hope they

look at the data and I hope that they,
they consume all of this information.

It's hard it, the setting,

so.

This is my read of a random article
that sort of stuck with me of the

results of this, and it stuck with
me largely as somebody who spent 25

years in clinical trials doing publicly
funded trials, privately funded

sponsor trials, NIH, funded trials,
patient organization funded trials,

comparative effectiveness trials that.

Uh, you know, I struggle with
the scientific outcome of this.

So we are not in the interim here
we are at the end of the trial.

Thinking of it, maybe this could have been
in the interim and the trial could have

been adaptive in the world we live in, but
also things about the world we live in.

Could we do things differently as
we're moving forward, uh, uh, in this

and we get more and more results?

Could this look different?

So I am Scott Berry in the interim,
and until the next interim, thanks.

View episode details

Listen to In the Interim... using one of many popular podcasting apps or directories.

← Previous · All Episodes · Next →

A Statistician reads JAMA

Subscribe