· 39:03
Judith: Welcome to Berry's In the
Interim podcast, where we explore the
cutting edge of innovative clinical
trial design for the pharmaceutical and
medical industries, and so much more.
Let's dive in.
Scott Berry: Well, welcome back to
In the Interim, I'm Scott Berry.
I'm your host.
For today of, in the interim, I've a
interesting topic for today, I, I used to
write a column for CHANCE Magazine and I
would quarterly write a column called the
A statistician reads the sports pages,
and of course, consuming and talking
about things that show up in sports.
The statistical analysis of this.
I did that for, uh,
about 10 years actually.
Um.
Um, in, in that, it was, it
was, um, was very rewarding.
So I'm gonna do that today.
And this is a statistician reads jama.
So I'm gonna tell you about an experience
of opening JAMA and giving a read
to relatively, uh, a random clinical
trial that I read the results of.
And it, it, it sparked a number
of things I thought very mu,
very much worth discussing.
Um, by the way, i, I, for those of you out
there may, whether you're driving to work
in the morning, the afternoon, you're out
for your daily run or however you consume.
In the interim, I'd love to hear from you.
Let me know what kind of topics
you would like to talk about.
If.
Particular people you'd like
on the on, in the interim.
Love to hear what you'd like to hear
about here at Berry Consultants, we
have a company of about 35 scientists,
static, mostly statistical scientists.
We work on clinical trial design.
We work on implementing,
uh, adaptive trials.
We have software for simulating trials.
We do a wide range of therapeutic areas,
and yes, we focus on innovative trial
designs, adaptive trials, bayesian
statistics, and so we'd, I, I'd love
to hear what kind of topics you'd
like to hear about on in the interim.
I.
So here we go.
So this really happened.
I got an email from JAMA and it, they,
in there, they list different, uh,
articles and results of different trials.
And unfortunately I don't get to read
these as much as I would like, but one
showed up and I thought, you know what,
I'm, I'm just gonna read one of these and
see what, see what these trials look like.
So this trial is the Fair HF two trial,
and it was published, uh, the, the,
the primary author is Anchor, uh, so
Anchor Etal, and it says, published
online, March 30th, 2025 in jama.
Uh, an or in an original investigation.
So if, if at all what I talk
about is interesting, yes, please,
please go check out the, the trial.
Uh, I was, I'm not involved in the trial.
I had nothing to do
with it, and I, uh, uh.
Only vaguely know a couple of the
authors, so I don't, I don't have
any involvement in this trial.
So the, what is the question?
In the trial?
The trial, and it says
this right in jama, what I.
Are the efficacy and safety of intravenous
intravenous ferric carbo, carbo x maltose.
I'm going to refer to this as
intravenous, um, uh, iron supplement.
Uh, the paper refers to it that way.
So this is for in, in patients with
heart failure and iron deficiency.
So is this.
Intravenous iron supplement for
patients with heart failure and
they have an iron deficiency.
Is it effective and safe?
That's the question in the trial.
Okay, so very interesting.
So the design, the settings, and the
participants, it's a multicenter trial.
It's randomized one-to-one.
With the iron supplement.
Uh, in there patients with
heart failure defined as having
A-L-V-E-F left ventricular ejection
fraction less than or equal to
45% and having an iron deficiency.
I'll let you if, if, if you know
what levels of iron deficiency are.
Uh, I'll let you go to the paper to, to,
to see that by the way, they define a
particular highly deficient group, and
I'll say something about that as well.
But, so everybody's iron deficient
and there's a particular subgroup
that is, is, uh, highly deficient.
It enrolled at 70 clinical
sites in six European countries.
Enrolled from March,
2017 to November of 23.
Median follow-up of patients was
16 months, uh, in, in the trial.
Okay, so again, any of the details,
please, uh, uh, read the article.
So sounds very interesting.
The trial enrolled 1,105 patients,
so rather large trial from that
perspective, large and long trial.
The primary endpoint in the trial,
and it's a bit sort of interesting and
I'll try to lay this out a little bit.
So the primary endpoint, it's
looking at cardiovascular death
and heart failure, hospitalization.
Very common endpoints in heart
failure trials, it, it's going to
analyze three different primary.
Endpoints where that's an endpoint
tied to an analysis in the trial.
So the first one is the time to
first cardiovascular death or
heart failure hospitalization.
Again, a very common way to analyze
in heart failure trials is a time to
event, so they're doing standard time
to event analyses for that endpoint.
Additionally, they're doing an analysis
of the rate of total hospitalization.
So this is, a patient could have
multiple heart failure hospitalizations,
and they're using rather standard
analysis techniques of that for
count data, negative binomial
analysis for count data, depending
on the exposure the patients have.
The third one is analyzing the first
endpoint that I described, which
is time to first cardiovascular
death or hospitalization.
But it's restricted to that
subgroup that I, that I de, that I
described that are highly deficient.
And that, so those are the three
analyses that are gonna define the
primary set of analyses in the trial.
Yes.
They're gonna control the
overall family-wise, experimental
error rate by, by analyzing all
three of those, uh, analyses.
So sort of co-primary, if you
will, in the sense that any one
of those could potentially be
successful and they adjust for that.
Okay, so they use a Berg procedure
for analyzing those three,
uh, I'll call them endpoints.
And I, I struggle with this a little bit
because I like to think of endpoint as
heart failure, time to heart failure,
hospitalization, or cardiovascular death.
And how you analyze it or the
subgroups aren't really the endpoint,
but I'll, I'll describe it that way.
I think it'll be easier to describe it.
So they are analyzing these three analyses
and in the analysis they've set up a
procedure where they will refer to this
as statistically significant if any one
of the following three things happen.
If all three of those are significant.
At, and I'm going to do
it in the one-sided sense.
In, in, in one sided.
This is all about superiority.
They describe it in the paper two-sided.
So I I'll do it.
Two-sided, sorry.
So if all three are less than 0.05,
two-sided, then the trial's
statistically significant and
they've demonstrated superiority.
The second opportunity is that if
two of them are significant at 0.025,
so that's half of the original,
but if two of them meet 0.025,
then the trial demonstrates
statistical significance.
Yes, if one of them, if any one of them
is significant at the point, at 1.67%,
two-sided, then it demonstrates
statistical significance.
So this is sort of three shots
on goal where they're looking
at three different analyses.
Again, time to cardiovascular
death or hospitalization.
The the number of heart failure
hospitalizations and a subgroup where
they analyze time to first, cardiovascular
death, or heart failure hospitalization.
Okay, so the SAP is is, uh,
published as part of it.
It's a very well written SAP.
The design is reasonably standard.
I found no evidence of any adaptations in
the trial, so they enrolled 1,105 patients
and carried out the primary analysis.
I'm sure there was A-D-S-M-B and
safety was reviewed, things like
that, but a very traditional trial.
Um, and, and a good trial, and I, I,
I'm going to talk about the publication
of this and the results of this.
The, the, the authors should
be commended on this trial.
The patients involved in the
trial and they deserve praise.
So I, I, I hope in, in no way
does this come across negative
for the people who ran, conducted
and published this, this trial.
But I want to dive into the science of it.
I wanna push a little bit on the science
of it, and I want to give my reading
as a statistician who randomly picked
up this article and, and read it.
what my reading of it is.
Okay.
So the structure set up
what happened in the trial.
So the trial enrolled 1,105 patients.
Again, it was randomized, double blind
in the setting, the first analysis
of time to cardiovascular death
or heart failure hospitalization.
They report this as the, the, the
number of events per hundred patient
years as just a way to summarize it.
Uh, in the paper is in the, in
the treatment group, it is 16.7
and in the placebo it's 21.9
per a hundred patient years, so 16.7
to 21.9,
140.
One of the patients of 558 on
the treatment had an event.
And 166 of 5 47.
Reasonably similar sample
sizes, so 141 and 166.
The hazard ratio and the time
to event analysis is 0.79.
The two sided P values 0.041
sided would be 0.02
of superiority.
The, the hazard ratio of 0.79
is showing the treatment did better.
And notice that doesn't meet
that endpoint by itself in the
setting of, of the Huck procedure.
If that would've been the loan
primary analysis, it would
be statistically significant.
Now what happened to the other endpoints?
It, it met 0.05.
So if all three meet 0.05,
the trial will be considered significant.
The total heart failure hospitalizations
had 264 in the treatment group and
320 heart failure hospitalizations.
Now that's adding across patients.
It matters.
Uh, how many patients have 0,
1, 2, and three, and so on?
The relative risk in
that analysis is 0.80,
0.80.
Again, a benefit for the treatment.
A 20% re relative risk reduction
in heart failure hospitalizations.
The P value, the
two-sided P value is 0.12.
In that the third, which analyzed this
subgroup of patients that, um, met this
high need showed a and this was the,
the end point of cardiovascular death
or heart failure, hospitalization.
Time to first event showed
a hazard ratio of 0.79
Also.
Exactly the same as the primary analysis.
The confidence interval's a little
wider and the P value's 0.07.
So let's think back to
the Hawk Bird procedure.
Do all of them meet 0.05?
They don't.
The first one did the other two didn't.
Do two of them meet 0.025?
No, actually none of them meet 0.025
and none of them met 0.0167.
So according to the primary analysis
methodology, the controlling of
the experiment wide type one error
rate, this trial is not significant.
Statistical significance
Was, was not shown.
Wow.
Again, a very interesting result
that time to cardiovascular death
or heart failure hospitalization
showed a significant P value.
So for example, the, the credible, the
confidence interval shown goes from 0.63
to 0.99.
That's the 95% confidence interval.
So what is the conclusion in the trial?
So I'm reading this, I'm looking at
the data, I'm looking at the results,
the conclusion and relevance for
the the paper in patients with heart
failure and iron deficiency, I.
Iron supplement did not significantly
reduce the time to first heart failure,
hospitalization, or cardiovascular death
in the overall cohort or in patients
with transference saturation less than
20%, or reduce the total number of heart
failure hospitalizations for placebo.
And in the little, uh, uh, figure
they show that talks about the
population, it's really very nice.
It's the, the cartoon of the, the
article, the conclusion says that iron
supplement was well tolerated, but
did not significantly improve outcomes
compared with placebo in patients with
heart failure and iron deficiency.
Okay.
So what, what does a statistician,
and by the way, this is
just me as the statistician.
I, I'm, I'm sure other statisticians
have a very different reaction to me.
So I read this and I really,
really struggled with this
from, from several points.
And so let me sort of
dive into the points.
The first part about this is, it, it just,
the, the scientific struggle I have that.
We report trials as black and white.
They're significant or they're not.
And in this trial where the only reason
it's not significant is because that
overall cohort on cardiovascular death
and heart failure, hospitalization was
part of a, a multiple testing procedure.
So it was, it was incredibly
close to being significant.
But it wasn't, the conclusion of
this trial is that iron supplement
doesn't help patients, doesn't change
the clinical outcome of patients.
A Bayesian analysis say of that primary
endpoint would say there's, assuming a
non uh, informative prior would be a 98%
probability that iron supplement benefits.
On time to cardiovascular death
or heart failure hospitalization.
So any way you want to read
this, there's gray area to this.
Every trial typically has gray
area to this, but yet when we
publish it, it's all or nothing.
In this scenario, the conclusion
is the same as if the data were
identical in the two groups and the
HA and the hazard ratio was one.
Even if it showed harm, the
conclusion would be the same, that
the treatment doesn't benefit and
really co conclusions say one thing,
the treatment benefited or it didn't.
I think it's a gross simplification
of a six year trial of 1100 patients,
but I understand how we got there.
And by the way, statisticians share
some blame in in how we got there.
We reinforce that hypothesis testing
and type one error of 5% and you can't
say anything if you don't reject a no.
And that's the way we
analyze trials this trial.
I bet if you flip three deaths in this
trial, it meets the Hochberg procedure.
And our conclusion to clinicians reading
this article is that it benefits, it's
either it doesn't benefit or it benefits.
Now we hope.
Anybody reading this dives into the data,
looks at the results, thinks about other
trials, thinks about the treatment of
this, and makes a decision based on it.
But I have to believe
the conclusion from jama.
It makes a huge indent into any
clinician reading this article.
So.
The black and whiteness of trials, uh,
I just feel like is scientifically,
I, I really struggle with it and
I, I accept that as a member of the
statistical community, we're probably
partly to blame for this dogmatic
approach to hypothesis testing.
And its black and white and we're gonna
come down hard on you if you don't
interpret it any other way than that.
Okay.
And now for those statisticians out
there, I I, I have nothing against
the Hochberg procedure, and I know
that we design trials, we do FDA
trials where, uh, 5% scenario, 2.5%,
one-sided tests I is, is the standard.
And, and we live by that and
we do hochberg procedures.
So I, I don't have anything to get
that, but I really struggle with the
likelihood principle aspect of this.
The data in this scenario is exactly
the same as another scenario.
The data are identical where this, this
paper says it's statistically significant,
and this iron supplement benefits patients
with heart failure and iron deficiency.
This 0.05
had, uh, the exact same data set and
the likelihood principle that if we
have exactly the same data in two
different scenarios, our conclusions
should be the same if you're a Bayesian.
The posterior probabilities
identical for those two trials.
It, the Bayesian machinery, uh,
satisfies the likelihood principle by
the nature of the Bayesian a Bay theor.
So I really struggle with this
part of it, and I get it from
a type one error scenario.
Being a Bayesian, I don't think
type one error is the be all, end
all, and it flips this result.
For clinicians reading it, and
I, I really struggle with that.
Okay.
Now, given the rules we play by, and
they knew the rules, they played by,
they wrote them in this, this article,
so the SAP lays out superiority.
They knew that if they ended up in this
situation, so if somebody simulated
this trial and showed them this result.
By the way, that's a
huge value of simulation.
If they saw that result and said,
good, that's the result we want, great.
But I'd be really surprised if
that's the result they wanted In
this trial, there are no adaptation.
Could this trial have been adaptive?
Could we have seen that result?
Could it have been bigger?
Now, I recognize this
was a six year trial.
I.
And it may be the funding of
this, it couldn't have been bigger
and this is just the way it is.
And then the, they would say, the
investigators would say, yes, that
that's, that's the result we want.
But I think it's one of those scenarios
that had this trial been six months
longer, had it enrolled 200 patients,
would it have changed clinical practice?
Does this paper change clinical practice?
Should it change clinical practice?
Would it have changed clinical
practice if the trial were six
months bigger, 200 patients bigger?
That's a little bit of the struggle here.
And so, uh, I, I just bring out that
I, I hate to say it, but is 1100
patients in six years, is it wasted?
Uh, in, in the way because
we do black and white.
Now we shouldn't do black and white,
and I'll talk more about what would be
shades of gray, but these are the rules
that everybody plays by at this point.
Journals play by this.
We kind of know the rules going in.
Could it have been adaptive?
The other struggle I have is
that this is about science.
This is about recommending
treatments to patients that
that could potentially benefit.
And I would think if the truth
of this is a hazard ratio of 0.8
on heart failure, hospitalization and
um, uh, cardiovascular death, this
is a clinically important treatment.
So we only analyze data in the trial.
We are stuck on that by the way.
I think that's reflective
of frequentist approaches.
We analyze the data in the trial.
We calculate the probability of the
data is as extreme or more extreme
than what we saw, assuming the null.
That's the P value 0.04
it.
But this is, there's science
here we we typically know more.
Part of my struggle is what's next.
This trial at I I if, if
you're into giving adjectives,
it's borderline significant.
It's very close to being
statistically significant.
Do we need another trial that's six
years in 1100 patients and get the P
value of that next trial below 0.05
or below HUCKER stuff,
or would 200 patients.
Potentially, I said, suppose
that trial was bigger by 200
patients or 300 patients, would
it change clinical practice?
But no, we can't do that.
The next trial designed would
only analyze that trial.
There's something incredibly
frustrating about that.
Now I want to give you a
different potential scenario.
Suppose this was a novel treatment.
I assume that nobody owns the, the,
the, uh, the, the rights to this.
Nobody has patent life
on a, a iron supplement.
And this is all about,
uh, uh, treating patients.
But suppose this was a novel
treatment in heart failure and a
company ran this trial exactly this
and, and they don't get significant.
And they go to a regulatory agency,
they go to ema, they go to PMDA,
they go to the us FDA, and they say,
you know, we just can't approve it.
It, it's not enough.
Based on that trial, does that
company need to run another
trial of 1500 patients?
Can we say, look, there's
information there.
The next trial doesn't need to be as big.
We're really close to approving
this, but we just can't do it.
And there are examples of this.
Um, do we start over it, it seems, I.
Bad science.
Now, the FDA is absolutely doing this.
There are scenarios where they use
the results of one trial combined
to the results of another trial.
You can look up Rebi ota.
It was approved in 2023.
Fairing pharmaceuticals.
There are multiple devices that
have been been approved this way.
There's multiple scenarios that I
know of that we're working with the
agency or have designed trials where
it, it uses the previous results
recognizing that, boy, it's really close.
We shouldn't need 1500 patients
after this for approval.
So I want you to think about
just the, the medical community.
In that scenario where we're so
focused on single trials, what about
combining the results together?
All of this, my biggest issue with
this as a statistician reading it
is I don't believe the conclusions.
I think they're wrong.
Now, maybe it's just me, but I
think this treatment works, and let
me give you a little bit of why.
So when I read the article.
And I see that first of all, the
statistics is compelling to me.
That's highly likely
just based on the trial.
If we're stuck on only this
trial, a 98% probability for
a clinically really clinically
important event is really valuable.
In a scenario like this where this isn't
about an FDA approval, which has its own
regulatory standards, and I know we've,
we've gotta go by this, but this is about
the next patient that walks in the door.
If it were me, I want iron supplement.
I think it works.
It's highly likely to work.
But the other thing I thought is okay,
you know, we, we do get type 1 errors.
We do get scenarios where we get a
hazard ratio like this in a confidence
interval and the treatment doesn't work.
So, as a statistician, I wanna know
what other information is out there
and what do we know about this?
Well, there have been previous
large trials run and the trial
nicely, uh, it, it's actually, you
can't really find much about it.
There's a little bit in the JAMA
article, so I, I, I don't want to.
Say there isn't.
Uh, but it talks about the reason
and still the uncertainty about this.
And so there is a trial, the
heart FID trial, that it looked
at time to cardiovascular death
or heart failure, hospitalization.
All three of the trials I'm gonna
tell you about that have already been
published, use that same endpoint
time to first cardiovascular death
or heart failure hospitalization.
And that trial showed it
had a hazard ratio of 0.93
for that endpoint with a confidence
interval that went up to 1.06.
So that probably had a P value.
Something like, uh, uh, one-sided 0.1.
Uh, two-sided, you know, maybe 0.15
and maybe a one-sided 0.075.
I didn't go find the article, but
I, I found the summary of that.
So high hazard ratio of 0.93.
The Iron Man trial, a great name for
the trial with an iron supplement,
had a hazard ratio of 0.84
for that same endpoint where the upper
bound of the confidence interval is 1.02.
Borderline significant but not that
trial did not demonstrate clinical
benefit, but hazard ratio of 0.84,
so 0.9
3.84
and the Affirm A HF trial
had a hazard ratio of 0.80.
Confidence interval 0.98.
They also res report heart failure
hospitalizations for three of them.
I'm sorry, for two of them, 0.80
and 0.74.
The other primary endpoint
in the fair HF two trial.
So walking into this trial, we've got
three other trials that demonstrate 0.9,
3.8.
4.80.
Uh, uh, for hazard ratio
for that primary endpoint.
All positive, one of them significant,
one of them borderline significant.
Uh, one of, uh, another one, 1.06
for the upper bound of this, and
now this trial, the confidence
interval for that endpoint is 0.79
with an upper bound of 0.99.
That information altogether, this
treatment as, as the read of a
statistician, this treatment works.
We are sitting there with four
trials all analyzed separately.
There is a meta-analysis published.
Does it move clinicians?
So I don't know the answer to that.
Other people can tell me that.
But if I randomly pick up this
article and I read it, it says,
this treatment doesn't work.
Boy, I, I, you know, and I know, uh,
but boy and, and jam is an incredible.
Journal.
I, I know multiple editors for Jana.
They jama they do an
incredible job with it.
And, and I know exactly how we
got here and this is not unique.
Uh, we, we see this commonly.
I, I just, I struggle with it.
I don't think the conclusion is right.
I think it's actually highly
likely the conclusion is wrong.
Do people read the conclusions?
So that, that's my struggle.
So what could the world be different?
You know?
Okay, Scott, so what, what
would be different here?
What?
What can we propose
different ways to do this?
Well, let's suppose we didn't think of
the trial as being this black and white.
Where we do a significance test, if
it's significance, we all wave flags
and we all celebrate and we publish
it and it changes clinical practice.
And if it's not significant, it doesn't
change Clinical practice, you know,
is, is suppo what would be different?
What if the trial reported
the posterior probability?
The treatment is superior to the control
and it doesn't put an adjective on it.
It doesn't say significant, borderline
significant, highly significant.
Three asterisks on it.
We don't have to do that.
98% is an adjective
and it allows somebody to consume the data
if they want to only look at that trial.
Okay?
98% probability for the primary
endpoint of superiority.
By the way, that satisfies the
likelihood principle in that scenario,
and we're overly stuck on type
one error that it's significant.
What if that's the report and the little
cartoon in the front of JAMA says, this
trial demonstrated 98% probability that
it benefits time to first cardiovascular
event and heart failure hospitalization
Now.
We, uh, the first thing a frequentist
is gonna say, well, oh yeah, but
now you've got a prior for that.
An important part of this, the
important part is where this sits
in the science, and I think the,
the article in the Journal of the
American Medical Association shouldn't
just report on that single trial.
The trial should prospectively define
a relatively non-informative prior.
So that the data speaks for itself,
that we, we know how to do that.
It's common that we do that.
And here that would give a
98% posterior probability.
We, the trial prospectively
defines a pessimistic prior.
It prospectively defines
an optimistic prior.
This would be relatively easy to do.
It also specifies.
A prior, based on the current summary
of scientific information, it uses
the meta-analysis that was published
and it says based on the previous
data, and it might even have several
of these based only on trial one.
Here's what you do.
Other trial two.
So if you don't like trial three and you
like trial one, here's several priors.
And somebody that uses those products.
Here's the probability, and my guess
in this scenario is that this is 99.9
probability.
This treatment is beneficial.
If you use that summary of that
information updated, which Bayesians
do based on this trial, and allows the
reader to judge it on this and never.
Never says it statistically
significantly affects the clinical
outcome, but says there's a 99.3%
probability this benefits, or nine
that this trial demonstrated a 98%
probability that it, it benefits.
A pessimistic prior would be 94%.
An optimistic prior is 99.3.
I'm making these numbers up, but I'm
just guessing what they might be.
And the prior, based on a summary of
the other trials and this one together,
that's where I think this would be 99.97%
probability of benefit,
something like that.
As a statistician reading it, I'd be much
more comfortable when I read this article.
That, that's providing
this advice to clinicians.
Where right now, when I read that article,
I really struggle and boy, I hope they
look at the data and I hope that they,
they consume all of this information.
It's hard it, the setting,
so.
This is my read of a random article
that sort of stuck with me of the
results of this, and it stuck with
me largely as somebody who spent 25
years in clinical trials doing publicly
funded trials, privately funded
sponsor trials, NIH, funded trials,
patient organization funded trials,
comparative effectiveness trials that.
Uh, you know, I struggle with
the scientific outcome of this.
So we are not in the interim here
we are at the end of the trial.
Thinking of it, maybe this could have been
in the interim and the trial could have
been adaptive in the world we live in, but
also things about the world we live in.
Could we do things differently as
we're moving forward, uh, uh, in this
and we get more and more results?
Could this look different?
So I am Scott Berry in the interim,
and until the next interim, thanks.
Listen to In the Interim... using one of many popular podcasting apps or directories.