I have tried to show the time course of the bioequivalence assessment. Regrettably, its development proved to be a convoluted process. Some approaches were proposed, subsequently abandoned, only to reemerge decades later in a modified form. Conversely, others arose from intuitive gut feelings, devoid of any scientific foundation, yet they persist in their utilization.
I have to confess that
apologize. This is due to my professional background, which has led me
to be less skilled at crafting engaging narratives.
I have to confess that
»Short« in the title is a euphemism…
‘Bioavailability’ (a portmanteau of ‘biologic availability’) in its current meaning was coined in 19711 and ‘Bioequivalence’ saw the light of day in 1975.2
The MeSH term ‘Biological Availability’ was introduced in 1979.
“The extent to which the active ingredient of a drug dosage form becomes available at the site of drug action or in a biological medium believed to reflect accessibility to a site of action.
The site of action (i.e., a receptor) is practically always inaccessible. There is no space for believes in science.
The main assumption in Bioequivalence (BE) was – and still is – that ‘similar’ concentrations in the systemic circulation of healthy volunteers will lead to similar concentrations at the target site (i.e., a receptor) and thus, to similar effects in patients.
The best definition of BE is given by the International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH).3
“Two drug products containing the same drug substance(s) are considered bioequivalent if their relative bioavailability (BA) (rate and extent of drug absorption) after administration in the same molar dose lies within acceptable predefined limits. These limits are set to ensure comparable in vivo performance, i.e., similarity in terms of safety and efficacy.
Throughout the article we will use data of a study in a two-treatment two-sequence two-period (2×2×2) crossover design as an example. \[\small{\begin{array}{cccc} \textsf{Table I}\phantom{0}\\ \text{subject} & \text{sequence} & \text{T} & \text{R}\\\hline \phantom{1}1 & \text{RT} & 71 & 81\\ \phantom{1}2 & \text{TR} & 61 & 65\\ \phantom{1}3 & \text{RT} & 80 & 94\\ \phantom{1}4 & \text{TR} & 66 & 74\\ \phantom{1}5 & \text{TR} & 94 & 54\\ \phantom{1}6 & \text{RT} & 97 & 63\\ \phantom{1}7 & \text{RT} & 70 & 85\\ \phantom{1}8 & \text{TR} & 76 & 90\\ \phantom{1}9 & \text{TR} & 54 & 53\\ 10 & \text{RT} & 99 & 56\\ 11 & \text{RT} & 83 & 90\\ 12 & \text{TR} & 51 & 68\\\hline \end{array}}\]
Problems were reported with formulations of Narrow Therapeutic Index Drugs (NTIDs) like phenytoin,4 5 6 7 digoxin,1 8 9 10 11 12 warfarin,13 theophylline,14 primidone.15 Some show nonlinear pharmacokinetics (phenytoin) or are auto-inducers (warfarin).
Generic drugs in the current sense did not yet exist at that time; only the content had to meet the USP requirements.
“Although in 1969 Professor John Wagner demonstrated to the Bureau of Medicine, methods for comparing areas under the serum versus time curve (AUC) to estimate bioequivalence, his approach was ignored inasmuch as the FDA hierarchy did not believe a problem existed, and therefore such studies would not be neccessary. For their part the Offices of Pharmaceutical Research and Compliance in the Bureau of Medicine and the Commissioner’s Office believed that the “Bioavailability Problem” as some called it was a “Content Uniformity Problem”.16 In 1971 for example, when notified of a “Bioavailability Problem” with a generic digoxin product, FDA investigated and ascertained that one manufacturer first added all the excipients into a 55-gal drum, then added digoxin, closed the lid, and mixed it by rolling the drum across the floor a few times. The content uniformity of those tablets varied from 10% to 156%.
Following a ‘Conference on Bioavailability of Drugs’ held at the National Academy of Sciences of the United States in 1971, a guideline was published the following year.18
“[…] the mean of AUC of the generic had to be within 20% of the mean AUC of the approved product. At first this was determined by using serum versus time plots on specially weighted paper, cutting the plot out and then weighing each separately.
Methods and procedures for in vivo testing to determine bioavailability (BA) for new drugs were proposed by the FDA on June 20, 1975. Several terms were defined:19
The “site of drug action” was questioned but kept in the regulation of 1977 and is used ever since by the FDA.20
“[A] comment also recommended that the phrase “becomes available to the site of drug action” be deleted since it is overly optimistic to presume that bioavailability data consisting of estimates of parent drug […] concentration in body fluids […] provides, as a general rule, an estimate of the availability of the therapeutic moiety at the site of drug action.
The Commissioner agrees that bioavailability data alone do not estimate the availability of the therapeutic moiety at the site of drug action. It is scientifically valid to assume, however, that if an active drug ingredient or therapeutic moiety reaches a reasonable extent of systemic circulation at a reasonable rate, the therapeutic moiety will also become available at the site of drug action […]. For this reason, the Commissioner concludes that reference to availability at site of drug action should not be deleted. He also believes that omission of such a reference would incorrectly focus the definition of bioavailability exclusively on absorption of the active drug ingredient or therapeutic moiety from the drug product. Even where such absorption is total, the product may not be bioavailable because an insufficient amount of the active drug ingredient or therapeutic moiety reaches the systemic circulation. In certain instances, e.g., high first-pass metabolism in the liver or rapid renal clearance, the active drug ingredient or therapeutic moiety must be absorbed at a rate sufficient to overcome the metabolic or elimination mechanism and reach the systemic circulation so that the therapeutic moiety will become available at the site of drug action in sufficient amounts to elicit the intended therapeutic effect.
The FDA’s
80/20 Rule or ‘Power Approach’ (at least 80% power to detect a 20%
difference) of 1972 consisted of testing the hypothesis of no difference
at the \(\small{\alpha=}\) \(\small{0.05}\) level of
significance.17 21 \[H_0:\;\mu_\text{T}-\mu_\text{R}=0\;vs\;H_1:\;\mu_\text{T}-\mu_\text{R}\neq
0,\tag{1}\] where \(\small{H_0}\) is the null
hypothesis of equivalence and \(\small{H_1}\) the alternative
hypothesis of inequivalence. \(\small{\mu_\text{T}}\) and \(\small{\mu_\text{R}}\) are the (true) means
of \(\small{\text{T}}\) and \(\small{\text{R}}\), respectively. In order
to pass the test, the estimated (post hoc, a
posteriori, retrospective) power had to be at least 80%. The power
depends on the true value of \(\small{\sigma}\), which is unknown. There
exists a value of \(\small{\sigma_{\,0.80}}\) such that if
the power of the test of no difference \(\small{H_0}\) is greater or equal to 0.80.
Since \(\small{\sigma}\) is unknown, it
has to be approximated by the sample standard deviation \(\small{s}\). The Power Approach in a simple
2×2×2 crossover design then consists of rejecting \(\small{H_0}\) and concluding that \({\small{\mu_\text{T}}}\) and \({\small{\mu_\text{R}}}\) are equivalent if
where \(\small{n_1,\,n_2}\) are the
number of subjects in sequences 1 and 2, the degrees of freedom \(\small{\nu=n_1+n_2-2}\), and \(\small{\bar{x}_\text{T},\bar{x}_\text{R}}\)
are the means of \(\small{\text{T}}\)
and \(\small{\text{R}}\),
Note that this procedure is based on estimated power \(\small{\widehat{\pi}}\), since the
true power is a function of the unknown \(\small{\sigma}\). It was the only approach
based on post hoc power and
was never implemented in any other jurisdiction.
For the example we estimate a power of only 46.4% to detect a 20% difference and the study would fail.
The biostatistical community published alternative proposals:22 23 24 25
The analysis was performed on untransformed (raw) data (i.e., by an additive model assuming normal distributions) and BE was concluded if the 95% confidence interval (CI) of the point estimate (PE) lied entirely within 80 – 120%.22 25
If data are analyzed by an additive model the result are
It is a fundamental error to naïvely transform
results to percentages (i.e., dividing by the reference’s
mean). It would require Fieller’s
CI for the ratio of normal
distributed data.26 27 However, Locke’s paper27 was ignored back in the day.
We get for our example in R:
n <- 12L
example <- data.frame(subject = rep(1L:n, each = 2),
treatment = c("R", "T", "T", "R", "R", "T", "T", "R",
"T", "R", "R", "T", "R", "T", "T", "R",
"T", "R", "R", "T", "R", "T", "T", "R"),
period = rep(1L:2L, n),
Y = c(81, 71, 61, 65, 94, 80, 66, 74,
94, 54, 63, 97, 85, 70, 76, 90,
54, 53, 56, 99, 90, 83, 51, 68))
facs <- c("subject", "period", "treatment")
example[facs] <- lapply(example[facs], factor) # factorize the data
# additive model (untransformed data, differences); sequence not in the model!
muddle <- lm(Y ~ subject + period + treatment, data = example)
CI <- as.numeric(confint(muddle, level = 0.95)["treatmentT", ])
PE <- coef(muddle)[["treatmentT"]]
# percentages (flawed!)
X.T <- mean(example$Y[example$treatment == "T"])
X.R <- mean(example$Y[example$treatment == "R"])
PE.pct <- 100 * X.T / X.R
CI.pct <- 100 * (CI + X.R) / X.R
# Fieller’s CI (ratio of normal distributed data)
s2.TT <- var(example$Y[example$treatment == "T"])
s2.RR <- var(example$Y[example$treatment == "R"])
s2.TR <- cov(example$Y[example$treatment == "T"],
example$Y[example$treatment == "R"])
pe <- X.T / X.R # same like in the additive model
t <- qt(p = 0.025, df = n - 1, lower.tail = FALSE)
G <- t^2 * s2.RR / (n * X.R^2)
K <- pe^2 + s2.TT / s2.RR * (1 - G) +
s2.TR / s2.RR * (G * s2.TR / s2.RR - 2 * pe)
ci <- setNames(c(100 * ((pe - G * s2.TR / s2.RR) + c(-1, 1) * t / X.R *
sqrt(s2.RR / n * K)) / (1 - G)),
c("lower", "upper"))
result <- data.frame(method = c("differences", "percentages", "Fieller"),
PE = c(sprintf("%+.3f", PE),
sprintf("%6.2f%%", PE.pct),
sprintf("%6.2f%%", 100 * pe)),
lower = c(sprintf("%+.3f", CI[1]),
sprintf("%.2f%%", CI.pct[1]),
sprintf("%.2f%%", ci[["lower"]])),
upper = c(sprintf("%+.3f", CI[2]),
sprintf("%6.2f%%", CI.pct[2]),
sprintf("%.2f%%", ci[["upper"]])),
BE = c("? ", rep("fail", 2)))
if (CI.pct[1] >= 80 & CI.pct[2] <= 120) result$BE[2] <- "pass"
if (ci[["lower"]] >= 80 & ci[["upper"]] <= 120) result$BE[3] <- "pass"
names(result)[3:4] <- c("lower CL", "upper CL")
print(result, row.names = FALSE)
# method PE lower CL upper CL BE
# differences +2.417 -12.777 +17.611 ?
# percentages 103.32% 82.44% 124.21% fail
# Fieller 103.32% 84.84% 125.65% fail
With the naïve transformation we get a 95% CI of 82.44 – 124.21%, and the study would fail because the upper confidence limit (CL) is > 120%. Nevertheless, with the correct method the study would fail as well.
Westlake23 mused that the shortest CI – which is symmetrical about the PE – would be too difficult to comprehend by non-statisticians. He suggested to split the t-values in such a way that the probability of the two tails sums to \(\small{\alpha}\) and the respective CI is symmetrical around 0 (or 100%). In the example we obtain ±21.80%, and the study would fail as well because the confidence limits are > ±20%. As above, calculating a percentage is flawed.
However, such a result is misleading. The information about the location of the difference is lost; one cannot know any more whether the average BA of \(\small{\text{T}}\) is lower or higher than the one of \(\small{\text{R}}\). Therefore, the method was criticized24 and never implemented in practice. It took me years to convince Certara to remove Westlake’s CI from the results in Phoenix WinNonlin. In 2016, I was successful with version 6.4… Since then the differences are given in the additive model.
The ‘Approved Drug Products with Therapeutic Equivalence Evaluations’ was published and is annually updated28 with monthly supplements.29 The nickname “Orange Book” relates to the color commonly associated with Halloween, which is the date of the publication’s finalization – October 31, 1980 – and to the book’s orange-colored cover. It gives information about the originator’s approval (with a ‘New Drug Application’ – NDA), as well as which originator’s product and strength (called Reference Listed Drug – RLD) has to be used in studies of generics in an ‘Abbreviated New Drug Application’ – ANDA. Generic prescription drugs are coded as follows:
, AN
, AP
, or AT
, depending on the
dosage form; or
, BE
, BN
, BP
, BS
, BT
, BX
, or
See also information about the ‘Electronic Orange Book’ below.
The generic boom started 1984 in the U.S. with the ‘Drug Price Competition and Patent Term Restoration Act’ (informally known as ‘Hatch-Waxman Act’).30
The approval process was different for innovator (originator) and generic companies.
Innovators:There was an early agreement that pharmaceutical equivalence (content, in vitro) is too permissive and therapeutic equivalence (like in phase III) would require extremely large studies in patients.31 Hence, comparing BA in healthy volunteers seemed to be a reasonable compromise.32
“What is the justification for studying bioequivalence in healthy volunteers?
“Variability is the enemy of therapeutics” and is also the enemy of bioequivalence. We are trying to determine if two dosage forms of the same drug behave similarly. Therefore we want to keep any other variability not due to the dosage forms at a minimum. We choose the least variable “test tube”, that is, a healthy volunteer.
Disease states can definitely change bioavailability, but we are testing for bioequivalence, not bioavailability.
Whereas in PK by bioavailability exclusively the Area under Curve extrapolated to infinite time \(\small{(AUC_{0-\infty}})\) is meant, the FDA introduced in 1975 two new terms, namely
The former is understood as a surrogate for the absorption rate \(\small{k\,_\text{a}}\) in a PK model. I prefer – like the ICH3 and the FDA since 200333 – rate and extent of absorption, in order not to contaminate the original meaning of BA in PK. Whereas the FDA and China’s CDE require for single dose studies \(\small{AUC_{0-\text{t}}}\) and \(\small{AUC_{0-\infty}}\), in all other jurisdictions only \(\small{AUC_{0-\text{t}}}\) is required.
Let us consider the basic equation of pharmacokinetics \[\eqalign{ \frac{f\cdot D}{CL}&=\frac{f\cdot D}{V\cdot k_\text{el}}=\\ AUC_{0-\infty}&=\int_{0}^{\infty}C(t)\,dt\textsf{,}\tag{3}}\]
where \(\small{f}\) is the fraction
absorbed (we are interested in the comparison of formulations), \(\small{D}\) is the dose, \(\small{CL}\) is the clearance, \(\small{V}\) is the apparent volume of
distribution, \(\small{k\,_\text{el}}\)
is the elimination rate constant, and \(\small{C(t)}\) is the plasma concentration
with time. We see immediately that for identical34 doses and
invariate35 \(\small{CL}\), \(\small{V}\), \(\small{k\,_\text{el}}\) (which are
drug-specific), comparing the \(\small{AUC}\text{s}\) allows to compare the
fractions absorbed.
Note that the top row of \(\small{(3)}\) is for a one-compartment
model. Nevertheless, the bottom row is universally valid, i.e.,
for any number of compartments and absorption (\(\small{k\,_\text{a}}\), eventual lag-time)
is irrelevant.
“Pharmacokinetics: one of the magic arts of divination whereby needles are stuck into dummies in an attempt to predict profits.
It must be mentioned that \(\small{C_\text{max}}\) is not sensitive to even substantial changes in the rate of absorption \(\small{k\,_\text{a}}\), since it is a composite metric.36 In a one compartment model it depends on \(\small{k\,_\text{a}}\), \(\small{f}\) and both the elimination rate constant \(\small{k\,_\text{el}}\) and \(\small{V}\) (or \(\small{CL}\) if you belong to the other church). Whereas \(\small{k\,_\text{a}}\) and \(\small{f}\) are properties of the formulation – we are interested in – the others are properties of the drug.37 \[\eqalign{ t_\textrm{max}&=\frac{\log_{e}(k\,_\text{a}/k\,_\text{el})}{k\,_\text{a}-k\,_\text{el}}\\ C_\textrm{max}&=\frac{f\cdot D\cdot k\,_\text{a}}{V\cdot (k\,_\text{a}-k\,_\text{el})}\large(\small\exp(-k\,_\text{el}\cdot t_\textrm{max})-\exp(-k\,_\text{a}\cdot t_\textrm{max})\large)\tag{4}}\] Therefore, when using it as a surrogate for the absorption rate one must keep in mind that formulations with different fractions absorbed and absorption rate constants will show a T/R-ratio of \(\small{C_\text{max}}\) which differs from the one of \(\small{AUC}\) (which is independent from \(\small{k\,_\text{a}}\) and thus, unbiased with regard to \(\small{f}\)).
An assessment of \(\small{t_\text{max}}\) would not necessarily help; in this example it is 2.71 h for the formulation with \(\small{k\,_\text{a}=}\) \(\small{0.74}\) / h and 2.78 h for the one with \(\small{k\,_\text{a}=}\) \(\small{0.71}\) / h. A difference of only four minutes cannot be detected with common sampling schedules.
It took ten years before the alternative metric \(\small{C_\text{max}/AUC}\) (based on theoretical considerations and simulations) was proposed.38 39 40 Apart from being less biased than \(\small{C_\text{max}}\), it is also substantially less variable. Regrettably, it was never implemented in any guideline.
In the early 1980s originators failed in trying to falsify the concept (i.e., comparing BE in healthy volunteers to large therapeutic equivalence (TE) studies in patients): If BE passed, TE passed as well and vice versa. If they would have succeeded (BE passed while TE failed), generic companies would have to demonstrate TE in order to get products approved. Such studies would have to be much larger than the originators’ phase III studies, making them economically infeasible.31 Essentially, that would have meant an early end of the young generic industry.
However, comparative BA is also used by originators in scale-up of formulations used in phase III to the to-be-marketed formulation, supporting post-approval changes, in line extensions of approved products, and for testing of drug-drug interactions or food effects. Hence, a substantial part of BE trials are performed by originators. If they had been successful to refute the concept, they would have shot into their own foot.
In the mid 1980s a consensus was reached, i.e., that generic approval should only be acceptable after suitable in vivo equivalence. It must be mentioned that BE relies on current Good Manufacturing Practices (cGMP). If drugs are not manufactured according to cGMP, the entire concept would collapse.
It was an open issue whether BE
should be interpreted as a surrogate of clinical efficacy / safety or a
measure of pharmaceutical quality. Whereas in the 1980s the former was
prevalent, since the 1990s the latter is mainstream.
A somewhat naïve interpretation of the
PK metrics is that \(\small{AUC}\) directly translates to
efficacy and \(\small{C_\text{max}}\)
to safety. Especially the latter is not correct because any difference
in \(\small{C_\text{max}}\) leads to a
relatively smaller difference in the maximum effect \(\small{E_\text{max}}\).
There was no consensus about the definition of ‘similarity’ and the statistical methodology to compare plasma profiles. Two early methods are outlined in the following.
top of section ↩︎ previous section ↩︎
This was an approach employed by the FDA. Two drugs were considered bioequivalent if at least 75% of subjects show \(\small{\text{T}/\text{R}\textsf{-}}\)ratios within 75 – 125%.17 43 44 It is not a statistic and, thus, was immediately criticized because variable formulations or studies with some extreme values may pass the criterion by pure chance.45
We get for our example in R:
example <- data.frame(subject = rep(1:12, each = 2),
treatment = c("R", "T", "T", "R", "R", "T", "T", "R",
"T", "R", "R", "T", "R", "T", "T", "R",
"T", "R", "R", "T", "R", "T", "T", "R"),
Y = c(81, 71, 61, 65, 94, 80, 66, 74,
94, 54, 63, 97, 85, 70, 76, 90,
54, 53, 56, 99, 90, 83, 51, 68))
rule.75.75 <- reshape(example, idvar = "subject", timevar = "treatment",
direction = "wide")
rule.75.75 <- rule.75.75[c("subject", "Y.T", "Y.R")]
names(rule.75.75)[2:3] <- c("T", "R")
rule.75.75$T.R <- 100 * (rule.75.75$T / rule.75.75$R)
for (i in 1:nrow(rule.75.75)) {
if (rule.75.75$T.R[i] >= 75 & rule.75.75$T.R[i] <= 125) {
rule.75.75$BE[i] <- TRUE
rule.75.75$within[i] <- "yes"
} else {
rule.75.75$BE[i] <- FALSE
rule.75.75$within[i] <- "no"
names(rule.75.75)[c(4, 6)] <- c("T/R (%)", "±25%")
if (sum(rule.75.75$BE) / nrow(rule.75.75) >= 0.75) {
BE <- "Passed BE by the"
} else {
BE <- "Failed BE by the"
print(rule.75.75[, c(1:4, 6)], row.names = FALSE); cat(BE, "75/75 Rule.\n")
# subject T R T/R (%) ±25%
# 1 71 81 87.65432 yes
# 2 61 65 93.84615 yes
# 3 80 94 85.10638 yes
# 4 66 74 89.18919 yes
# 5 94 54 174.07407 no
# 6 97 63 153.96825 no
# 7 70 85 82.35294 yes
# 8 76 90 84.44444 yes
# 9 54 53 101.88679 yes
# 10 99 56 176.78571 no
# 11 83 90 92.22222 yes
# 12 51 68 75.00000 yes
# Passed BE by the 75/75 Rule.
Nine of the twelve subjects (75%) have a \(\small{\text{T}/\text{R}\textsf{-}}\)ratio within 75 – 125% and the study would pass, despite the three subjects with high \(\small{\text{T}/\text{R}\textsf{-}}\)ratios.
Another suggestion was testing for a statistically significant difference at level \(\small{\alpha=0.05}\) with a t-test. The null hypothesis was that formulations are equal, i.e., \(\small{\mu_\text{T}-\mu_\text{R}=0}\).
Let’s assess our example in R again:
example <- data.frame(subject = rep(1:12, each = 2),
treatment = c("R", "T", "T", "R", "R", "T", "T", "R",
"T", "R", "R", "T", "R", "T", "T", "R",
"T", "R", "R", "T", "R", "T", "T", "R"),
Y = c(81, 71, 61, 65, 94, 80, 66, 74,
94, 54, 63, 97, 85, 70, 76, 90,
54, 53, 56, 99, 90, 83, 51, 68))
tt <- reshape(example, idvar = "subject", timevar = "treatment",
direction = "wide")
tt <- tt[c("subject", "Y.T", "Y.R")] # change for clarity
tt$T.R <- tt[, 2] - tt[, 3] # difference
names(tt)[2:4] <- c("T", "R", "T–R") # cosmetics
tt[, 4] <- sprintf("%+0.f", tt[, 4])
p <- t.test(x = tt$T, y = tt$R, paired = TRUE)$p.value
if (p >= 0.05) {BE <- "Passed BE" } else {BE<- "Failed BE" }
print(tt, row.names = FALSE); cat(sprintf("%s by a paired t-test (p = %.4f).\n", BE, p))
# subject T R T–R
# 1 71 81 -10
# 2 61 65 -4
# 3 80 94 -14
# 4 66 74 -8
# 5 94 54 +40
# 6 97 63 +34
# 7 70 85 -15
# 8 76 90 -14
# 9 54 53 +1
# 10 99 56 +43
# 11 83 90 -7
# 12 51 68 -17
# Passed BE by a paired t-test (p = 0.7193).
We calculate a \(\small{p}\)-value of 0.7193, which is statistically not significant and the study would pass again.
However, we face a similar problem like with the 75/75 Rule. If the differences show high variability, the study would pass. On the other hand, if there is low variability in the differences, the study would fail. This is counterintuitive and actually the opposite of what regulators want.
One of my early sins46 – it was not the last…
After phenytoin intoxications in Austria47 we compared three
generics (containing the free acid like the originator, the sodium- or
calcium-salt) to the reference in a crossover study. All formulations
have been approved and were marketed in Austria. Although at that time I
already calculated a 95% CI,
the reviewers of our manuscript insisted in testing for a significant
difference »because it is state of the art«.
The \(\small{AUC}\)s of two generics
were statistically significant different from the reference (\(\small{\text{T}_1}\) containing the free
acid like the originator and \(\small{\text{T}_3}\) containing the
Ca-salt). \(\small{\text{T}_2}\)
containing the Na-salt was statistically not significant different and,
thus, considered equivalent – despite its high \(\small{\text{T}/\text{R}\textsf{-}}\)ratio
(Table II). \[\small{
\textsf{Table II}\phantom{00000}\\
\text{formulation} & \text{T}/\text{R (%)} & p & &
\text{T}_1 & 146.65 & 0.0195\phantom{6} & \text{*} &
\text{T}_2 & 133.67 & 0.151\phantom{96} & \text{n.s.} &
\text{T}_3 & \phantom{1}27.97 & 0.00596 & \text{**} &
\end{array}}\] If we evaluate the study according to
current standards (i.e., by the 90%
CI inclusion approach based on
data and acceptance limits of 80.00 – 125.00%), all generics would fail.
\(\small{\text{T}_3}\) would even be
bioinequivalent because its upper
CL is way below 80% (Table III).
\textsf{Table III}\phantom{0000}\\
\text{formulation} & \text{PE (%)} &
\text{CL}_\text{lower}\text{(%)} & \text{CL}_\text{upper}\text{
(%)} & \text{BE}\\\hline
\text{T}_1 & 151.12 & 118.75 & 192.32 & \text{fail
\text{T}_2 & 139.39 & \phantom{1}95.91 & 202.60 &
\text{fail (inconclusive)}\\
\text{T}_3 & \phantom{1}21.67 & \phantom{1}10.25 &
\phantom{2}45.81 & \text{fail (inequivalent)}\\\hline
\end{array}}\] Given the nonlinear
PK of phenytoin,48 49 switching a patient
from the originator to the generics with high \(\small{\text{T}/\text{R}\textsf{-}}\)ratios
would be problematic – potentially leading to toxicity after multiple
doses. Even worse would be switching from the generic \(\small{\text{T}_3}\) with its low \(\small{\text{T}/\text{R}\textsf{-}}\)ratio
to any of the other formulations.
An Analysis of Variance (ANOVA) instead of a t-test allows to take period-effects into account.50 51 52 This decade was also the heyday of Bayesian methods.53 54 55 56 Nomograms for sample size estimation were also Bayesian57 but happily misused by frequentists. New parametric58 59 as well as nonparametric methods entered the stage.58 60 PK metrics to compare controlled release formulations in steady state were proposed.61 62 63 The first software to evaluate 2×2×2 crossover studies was released in the public domain.64
The acceptance range in bioequivalence is based on a ‘clinically relevant difference’ \(\small{\Delta}\), i.e., for data following a lognormal distribution \[\left\{\theta_1,\theta_2\right\}=\left\{100\,(1-\Delta),100\,(1-\Delta)^{-1}\right\}\tag{5}\] It must be mentioned that the commonly applied \(\small{\Delta=20\%}\)65 leading to \(\small{\{80.00\%,}\) \(\small{125.00\%\}}\) is arbitrary (as is any other).
An important leap forward was the Two One-Sided Tests Procedure (TOST)21 – although it was never implemented in its original form \(\small{(6)}\) in regulatory practice. Instead, the confidence interval inclusion approach \(\small{(7)}\) made it to the guidelines. Although these approaches are operationally identical (i.e., their outcomes [pass | fail] are the same), these are statistically different methods:
The TOST Procedure gives two \(\small{p}\)-values, namely \(\small{p(\theta_0\geq\theta_1)}\) and \(\small{p(\theta_0\leq\theta_2)}\).
\[\begin{matrix}\tag{6} H_\textrm{0L}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\leq\theta_1\:vs\:H_\textrm{1L}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}>\theta_1\\ H_\textrm{0U}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}\geq\theta_2\:vs\:H_\textrm{1U}:\frac{\mu_\textrm{T}}{\mu_\textrm{R}}<\theta_2 \end{matrix}\]
The two-sided \(\small{1-2\,\alpha}\) confidence interval is assessed for inclusion in the acceptance range \(\small{\left\{\theta_1,\theta_2\right\}}\).
Evaluating our example for \(\small{\left\{\theta_1,\theta_2\right\}=\left\{80\%,120\%\right\}}\)
by \(\small{(6)}\)
we get \(\small{p(\theta_0\geq\theta_1)=0.0160}\)
and \(\small{p(\theta_0\leq\theta_2)=0.0528}\).
Since one of the \(\small{p\textsf{-}}\)values is \(\small{>\alpha}\), the study would fail.
Assessing it by \(\small{(7)}\) we get a
CI of 82.44 – 124.21%. The
study would fail because the upper
CL is > 120%.
Although the study failed, repeating it in a larger sample size
(with higher power) would likely allow us to demonstrate
BE, since the outcome was
If assessing the example by \(\small{(7)}\) according to current standards (i.e., of \(\small{\log_e\textsf{-}}\)transformed data for \(\small{\left\{\theta_1,\theta_2\right\}=}\) \(\small{\{80\%,}\) \(\small{125\%\}}\)), we would get a 90% CI of 87.40 – 121.73% and the study would pass. The Times They Are a-Changin’.
Interlude 2
It is a misconception that a certain CI of a sample (i.e., a particular study) contains the – true (but unknown) – population mean \(\small{\mu}\) with \(\small{1-\alpha}\) probabilty. Let’s simulate some studies and evaluate them by \(\small{(7)}\):
set.seed(123) # for reproducibility of simulations
mue <- 1 # true population mean
CV <- 0.25
studies <- 100
x <- sampleN.TOST(CV = CV, theta0 = mue, targetpower = 0.8, print = FALSE)
subjects <- x[["Sample size"]]
power <- x[["Achieved power"]]
# simulate subjects within studies, lognormal distribution
samples <- data.frame(study = rep(1:studies, each = subjects * 2),
subject = rep(rep(1:subjects, studies), each = 2),
period = rep(rep(1:2, studies), 2),
sequence = rep(c(rep(c("TR"), subjects),
rep(c("RT"), subjects)), studies),
treatment = c(rep(c("T", "R"), subjects / 2),
rep(c("R", "T"), subjects / 2)),
Y = rlnorm(n = subjects * studies * 2,
meanlog = log(mue) - 0.5 * log(CV^2 + 1),
sdlog = sqrt(log(CV^2 + 1))))
facs <- c("subject", "period", "treatment")
samples[facs] <- lapply(samples[facs], factor) # factorize the data
result <- data.frame(study = 1:studies, PE = NA_real_,
lower = NA_real_, upper = NA_real_,
BE = FALSE, contain = TRUE)
grand.PE <- numeric(studies)
for (i in 1:studies) {
temp <- samples[samples$study == i, ]
heretic <- lm(log(Y) ~ period + subject + treatment, data = temp)
result$PE[i] <- 100 * exp(coef(heretic)[["treatmentT"]])
result[i, 3:4] <- 100 * exp(confint(heretic, level = 0.90)["treatmentT", ])
if (round(result[i, 3], 2) >= 80 & round(result[i, 4], 2) <= 125)
result$BE[i] <- TRUE
if (result$lower[i] > 100 * mue | result$upper[i] < 100 * mue) result$contain[i] <- FALSE
grand.PE[i] <- mean(result$PE[1:i]) # (cumulative) grand means
} = 4.5, height = 4.5)
op <- par(no.readonly = TRUE)
par(mar = c(3.05, 2.9, 1.4, 0.75), cex.axis = 0.9, mgp = c(2, 0.5, 0))
xlim <- range(c(min(result$lower), 1e4 / min(result$lower),
max(result$upper), 1e4 / max(result$upper)))
plot(1:2, 100 * rep(mue, 2), type = "n", log = "x", xlab = "PE [90% CI]",
ylab = "study #", axes = FALSE,
xlim = xlim, ylim = range(result$study))
abline(v = 100 * c(0.8, mue, 1.25), lty = c(2, 1, 2))
axis(1, at = c(125, pretty(xlim)),
labels = sprintf("%.0f%%", c(125, pretty(xlim))))
axis(2, at = c(1, pretty(1:studies)[-1]), las = 1)
axis(3, at = 100 * mue, label = expression(mu))
lines(grand.PE, 1:studies, lwd = 2)
for (i in 1:studies) {
if (result$BE[i]) { # pass
clr <- "blue"
} else { # fail
if (result$contain[i]) {# mue within CI
clr <- "magenta"
} else { # mue not in CI
clr <- "red"
lines(c(result$lower[i], result$upper[i]), rep(i, 2), col = clr)
points(result$PE[i], i, pch = 16, cex = 0.6, col = clr)
In 7% of studies the population mean \(\small{\mu}\) is not contained in
the 90% CI (red lines). In
other words, given the result of a single study we can never
know where \(\small{\mu}\) lies. Only
the grand mean (mean of sample means \(\small{\frac{1}{n}\sum_{i=1}^{i=n}\overline{x_i}}\))
approaches \(\small{\mu}\) for a large number
of samples. After the 100th study it is with 99.44%
pretty close to \(\small{\mu}\) (for
geeks: The convergence is poor; if simulating 25,000 studies, it is
100.23%). However, nobody would repeat a – passing – study (blue lines)
for such a rather uninteresting information, right?
This explains also why a particular study might fail by pure
chance even if a formulation is equivalent (here 15% of
studies; red or magenta lines). Such cases are related to the producer’s
risk (Type II
Error = 1 – power), which is for the given conditions 16.3%. On the
other hand, it is also possible that a formulation which is not
equivalent might pass. These cases are related to the patient’s
risk (Type I
For details see the articles about hypotheses, treatment effects, post hoc power, and sample size
estimation. Science is a cruel mistress.
At a hearing in 1986 the FDA confirmed that \(\small{(6)}\) or \(\small{(7)}\) of untransformed data should be used with \(\small{\Delta=}\) \(\small{20\%}\). If clinically relevant, tighter limits (\(\small{\Delta=10\%}\)) might be needed.66
The first German guideline was drafted by the International Association for Pharmaceutical Technology (Arbeitsgemeinschaft für Pharmazeutische Verfahrenstechnik) in 1985.67 It was presented and discussed in 1987.68 69 70 In the the same year, the first guideline of the Nordic Council on Medicines was published in cooperation with the agencies of Denmark, Finland, Iceland, Norway and Sweden.71
In 1988 wider acceptance limits of 70 – 130% were proposed for \(\small{C_\text{max}}\) due to its inherent high variability72 (as a one-point metric practically always larger than the one of the integrated metric \(\small{AUC}\)).
The Australian draft guideline was published in 1988.73 It was the first covering not only the design and evaluation but also validation of bioanalytical methods. The model with effects period, subject, treatment25 52 was recommended and a test for sequence-effects was not considered necessary. The problematic conversion of differences to percentages was acknowledged and Fieller’s CI26 27 mentioned. Kudos to both!
In 1989 a loose-leaf collection was started.74 It contained raw-data of generic drugs marketed in Germany, the evaluation provided by companies, as well as results recalculated by the ZL (Central Laboratory of German Pharmacists). Including the 6th supplement of 1996 it contained more than 2,000 pages… It was an indispensible resource for planning new studies and also showed the ‘journey’ of dossiers (i.e., the same study being used by different companies).
The BioInternational conferences co-organized by Henning Blume and Kamal K. Midha (Toronto 1989, Bad Homburg 1992, Munich 1994, Tokio 1996, London 1999, 2003, 2005, 2008) made valuable contributions to the advancement of testing for bioequivalence. The first dealt with the \(\small{\log_{e}\textsf{-}}\)transformation of data and the definition of Highly Variable Drugs (HVDs).75 There was a poll among the participants about the \(\small{\log_{e}\textsf{-}}\)transformation of data. Outcome: ⅓ never, ⅓ always, ⅓ case by case (i.e., perform both analyses and report the one with narrower CI ‘because it fits the data better’). Let’s be silent about the last team.76 HVDs were defined as drugs with intra-subject variabilities of more than 30% but problems might be evident already with 25%.
The original acceptance range was symmetrical around 100%. In \(\small{\log_{e}\textsf{-}}\)scale it should be symmetrical around \(\small{0}\) (because \(\small{\log_{e}1=0}\)). What happens to our \(\small{\Delta}\), which should still be 20%? Due to the positive skewness of the lognormal distribution a lively discussion started after early publications proposing 80 – 125%.25 52 Keeping 80 – 120% would have been flawed because the maximum power should be obtained at \(\small{\mu_\text{T}/\mu_\text{R}=1}\) for \[\exp\left((\log_{e}\theta_1+\log_{e}\theta_2)/2\right),\tag{8}\] which works only if \(\small{\theta_2=\theta_1^{-1}}\) or \(\small{\theta_1=\theta_2^{-1}}\). Keeping the original limits, maximum power would be obtained at \(\small{\mu_\text{T}/\mu_\text{R}=}\) \(\small{\exp((\log_{e}0.8+\log_{e}1.2)/2)}\) \(\small{\approx0.979796}\).
There were three parties (all agreed that the acceptance range should be symmetrical in \(\small{\log_{e}\textsf{-}}\)scale and consequently asymmetrical when back-transformed). These were their arguments and suggestions:
The 90% CI inclusion approach \(\small{(7)}\) based on \(\small{\log_{e}\textsf{-}}\)transformed data with acceptance limits of 80.00 – 125.00% \(\small{(5)}\) was the winner.
either too low, i.e., \(\small{p(\text{BA}<\phantom{1}80\%)>5\%}\)
or too high, i.e., \(\small{p(\text{BA}>125\%)>5\%}\)
but evidently not at the same time. Hence, the 90% CI controls the risk for the population of patients. Therefore, if a study passes, the risk for patients does still not exceed 5%. Note that at the BE limits \(\small{\left\{\theta_1,\theta_2\right\}}\) power, i.e., the chance to pass, is 5%. Therefore, the patient’s risk (type I error) is controlled.
First sample size tables for the multiplicative model with the acceptance range 80 – 125% were published77 and extended for narrower (\(\small{\Delta=10\%}\): 90.00 – 111.11%) and wider (\(\small{\Delta=30\%}\): 70.00 – 142.86%) acceptance ranges.78 The nonparametric method was improved taking period-effects into account.79 80 Drug-drug and food-interaction studies should be assessed for equivalence.81 The general applicability of average BE was challenged and the concept of individual and population bioequivalence outlined.82 83 84 The first textbook dealing exclusively with BA/BE was published.85
This was also the decade of updated and new guidelines. A European draft guidance was published in 1990;86 the final guideline was published in December 1991 and came into force in June 1992.87 The 90% CI inclusion approach of \(\small{\log_{e}\textsf{-}}\)transformed data with an acceptance range of 80 – 125% was recommended and for NTIDs the acceptance range may need to be tightened. Due to its inherent higher variability a wider acceptance range may be acceptable for \(\small{C_\text{max}}\). If inevitable and clinically acceptable, a wider acceptance range may also be used for \(\small{AUC}\). Only if clinically relevant, a nonparametric analysis of \(\small{t_\text{max}}\) was recommended.
An in vivo study was not required if the new formulation is
Similar statements about solutions were given in all later guidelines. The second lead to application of the Biopharmaceutic Classification System (BCS).88 More about that further down.
“The almost classical 1977 FDA notice […] defined bioavailability as the rate and extent to which the active drug ingredient of therapeutic moiety is absorbed from a drug product and becomes available at the site of action.20 However, in the majority of cases substances are intended to exhibit a systemic therapeutic effect, and a more practical definition can be given, taking into account that the substance in the general circulation is in exchange with the substance at the site of action. Therefore, the European 1991 guidance on bioavailability and bioequivalence87 gave the following definition: Bioavailability is understood as to be the extent and rate to which the a substance or its therapeutic moiety is delivered from the pharmaceutical form into the general circulation.
In July 1992 a guidance of the
FDA was
published.90 An
ANOVA of \(\small{\log_{e}\textsf{-}}\)transformed
data was recommended and the nested subject(sequence)
term in the statistical model entered the scene.
It must be mentioned that in comparative
BA studies subjects are usually
uniquely coded. Hence, the term subject(sequence) is a
bogus one91 and could be replaced by the simple
subject as well (see below for an
example). Alas, this model was implemented in all global guidelines ever
since. If you understand why, let me know.
In the same year the Canadian guidance for Immediate Release (IR) formulations was published.92 To that time is was the most extensive one because it gave not only the method of evaluation, but information about the study design, sample size, ethics, bioanalytics, etc. It differed from the others in the relaxed requirement for \(\small{C_\text{max}}\), where only the \(\small{\text{T}/\text{R}\textsf{-}}\)ratio has to lie within 80 – 125% (instead of its CI). The guidance for MR formulations followed in 1996.93
In 1998 the World Health Organization published its first guideline,94 which was similar to the European one.
Table IV shows the result of the example evaluated by various methods. \[\small{\begin{array}{lcccc} \textsf{Table IV}\phantom{0}\\ \phantom{0}\text{Method} & \text{Model} & \text{PE} & \text{power},p,\text{CI, etc.} & \text{BE?}\\\hline \text{80/20 Rule} & \text{additive} & - & 46.40<80\% & \text{fail}\\ t\text{-test} & \text{additive} & +2.417\;(103.32\%) & 0.7193\geq0.05 & \text{pass}\\ \text{TOST} & \text{additive} & +2.417\;(103.32\%) & 0.0160\leq0.05,\,0.0528>0.05 & \text{fail}\\ \text{95% CI} & \text{additive} & +2.417\;(103.32\%) & -12.777\,,+17.611\;(82.44-124.21\%) & \text{fail}\\ \text{Fieller} & \text{additive} & 103.32\% & 84.84-125.65\% & \text{fail}\\ \text{Westlake} & \text{additive} & \pm0.000\;(100.00\%) & \pm2.944\;(\pm21.80\%) & \text{fail}\\\hline \text{80/20 Rule} & \text{multiplicative} & - & 72.90<80\% & \text{fail}\\ \text{75/75 Rule} & \text{(multiplicative)} & - & 9/12=75\% & \text{pass}\\ t\text{-test} & \text{multiplicative} & 103.14\% & 0.7317\geq0.05 & \text{pass}\\ \text{TOST} & \text{multiplicative} & 103.14\% & 0.0097\leq0.05,\,0.0309\leq0.05 & \text{pass}\\ {\color{Blue} {90\%\,\text{CI}}} & {\color{Blue} {\text{multiplicative}}} & {\color{Blue} {103.14\%}} & {\color{Blue} {87.40-121.73\%}} & {\color{Blue} {\text{pass}}}\\ \text{Westlake} & \text{multiplicative} & 100.00\% & \pm18.09\% & \text{pass}\\ \text{75/75 Rule} & \text{multiplicative} & - & 75\%\subset \pm25\% & \text{pass}\\\hline \end{array}}\] In the additive model the acceptance range was 80 – 120%, whereas in the multiplicative model it is 80 – 125%. Since in the former differences are assessed, the wrong percentages are given in brackets.
At the time being only the 90% CI inclusion approach is globally accepted. Our example in R again:
example <- data.frame(subject = rep(1:12, each = 2),
sequence = c("RT", "RT", "TR", "TR", "RT",
"RT", "TR", "TR", "TR", "TR",
"RT", "RT", "RT", "RT", "TR",
"TR", "TR", "TR", "RT", "RT",
"RT","RT", "TR", "TR"),
treatment = c("R", "T", "T", "R", "R", "T", "T", "R",
"T", "R", "R", "T", "R", "T", "T", "R",
"T", "R", "R", "T", "R", "T", "T", "R"),
period = rep(1:2, 12),
Y = c(81, 71, 61, 65, 94, 80, 66, 74,
94, 54, 63, 97, 85, 70, 76, 90,
54, 53, 56, 99, 90, 83, 51, 68))
facs <- c("subject", "sequence", "treatment", "period")
example[facs] <- lapply(example[facs], factor) # factorize the data
txt <- paste("nested model : period, subject(sequence), treatment",
"\nsimple model : period, subject, sequence, treatment",
"\nheretic model: period, subject, treatment\n\n")
result <- data.frame(model = c("nested", "simple", "heretic"),
PE = NA, lower = NA, upper = NA, BE = "fail", na = 0)
for (i in 1:3) {
if (result$model[i] == "nested") { # bogus nested model (guidelines)
nested <- lm(log(Y) ~ period +
subject %in% sequence +
treatment, data = example)
result$PE[i] <- 100 * exp(coef(nested)[["treatmentT"]])
result[i, 3:4] <- 100 * exp(confint(nested, level = 0.90)["treatmentT", ])
result[i, 6] <- sum(
if (result$model[i] == "simple") { # simple model (subjects are uniquely coded)
simple <- lm(log(Y) ~ period +
subject +
sequence +
treatment, data = example)
result$PE[i] <- 100 * exp(coef(simple)[["treatmentT"]])
result[i, 3:4] <- 100 * exp(confint(simple, level = 0.90)["treatmentT", ])
result[i, 6] <- sum(
if (result$model[i] == "heretic") { # heretic model (without sequence)
heretic <- lm(log(Y) ~ period +
subject +
treatment, data = example)
result$PE[i] <- 100 * exp(coef(heretic)[["treatmentT"]])
result[i, 3:4] <- 100 * exp(confint(heretic, level = 0.90)["treatmentT", ])
result[i, 6] <- sum(
# rounding acc. to guidelines
if (round(result[i, 3], 2) >= 80 & round(result[i, 4], 2) <= 125)
result$BE[i] <- "pass"
checks <- data.frame(comparison = c("simple", "heretic"),
lower = "different", upper = "different")
for (i in 2:3) {
if (isTRUE(all.equal(result$lower[i], result$lower[1])))
checks$lower[i-1] <- "identical"
if (isTRUE(all.equal(result$upper[i], result$upper[1])))
checks$upper[i-1] <- "identical"
# cosmetics
names(checks) <- c("Comparison vs nested", "lower CL", "upper CL")
result$PE <- sprintf("%6.2f%%", result$PE)
result$lower <- sprintf("%6.2f%%", result$lower)
result$upper <- sprintf("%6.2f%%", result$upper)
names(result)[c(3:4, 6)] <- c("lower CL", "upper CL", "NE")
cat(txt); print(result, row.names = FALSE); print(checks, row.names = FALSE)
# nested model : period, subject(sequence), treatment
# simple model : period, subject, sequence, treatment
# heretic model: period, subject, treatment
# model PE lower CL upper CL BE NE
# nested 103.14% 87.40% 121.73% pass 13
# simple 103.14% 87.40% 121.73% pass 1
# heretic 103.14% 87.40% 121.73% pass 0
# Comparison vs nested lower CL upper CL
# simple identical identical
# heretic identical identical
As already outlined above, the nested model recommended in all [sic] guidelines is over-specified because subjects are uniquely coded.
In the example we get 13 not estimable (aliased) effects. Correct,
because we are asking for something the data cannot provide.91 In the simple model only one effect cannot be
estimated. However, even sequence can be removed from the
model. I call it heretic because regulators will grill you if
you are using it. It was proposed by Westlake25
52 and I employed it in hundreds (‼)
of studies and some cases are published.95
A ‘Positive List’ was published by the German regulatory authority, i.e., for 90 drugs BE was not required.96 In order to comply with the European Note for Guidance of 200197 it had to be removed by the BfArM.
The FDA guidance for ‘Scale-Up and Postapproval Changes’ (SUPAC)98 99 defined three ‘Levels’ of changes:
Under certain conditions of Level 2, demonstration of in vitro similarity by \(\small{f_2\geq 50\%}\)100 in the application / compendial medium at 15, 30, 45, 60 and 120 minutes (or until an asymptote is reached) of at least 12 units is sufficient.
where \(\small{\text{R}_i}\) and
\(\small{\text{T}_i}\) are the
cumulative percent dissolved at \(\small{1\ldots\ n}\) time points of \(\small{\text{R}}\) and \(\small{\text{T}}\), respectively.
For Level 3 changes in vivo testing
(BE) is mandatory.
It must be mentioned that comparing formulations by \(\small{f_2}\) can be problematic, especially if the shapes of dissolution curves are different and/or if they intersect. \(\small{f_2}\) is not a statistic and, therefore, it is impossible to evaluate false positive and negative rates of decisions for approval of drug products based on it.101
Two (of five) sessions of the BioInternational ’92 conference in Bad
Homburg dealt with BE of Highly
Variable Drugs.102 103 Various approaches have been discussed:
Multiple dose instead of single dose studies, metabolite instead of the
parent compound, stable isotope techniques,104 add-on designs,
and – for the first time – replicate designs.
Although the BioInternational 2 in Munich 1994 was with over 600 participants the largest in the series, no substantial progress for HVD(P)s was achieved.105 Following a suggestion106 at a joint AAPS/FDA workshop in 1995 widening the conventional acceptance limits of 80.00 – 125.00% was considered.107
“For some highly variable drugs and drug products, the bioequivalence standard should be modified by changing the BE limits while maintaining the current confidence interval at 90%. […] the bioequivalence limits should be determined based in part upon the intrasubject variability for the reference product.
A hot topic ever since… Why are we discussing it for 36 (‼) years (since the first BioInternational conference)? Is it really that complicated108 or are we too stupid?
Studies in steady-state were proposed as an option for HVD(P)s in a European draft guideline109 in order to reduce variability, but it was removed from the final version of 2001.97
Validation of bioanalytical methods110 111 112 113 was partly covered in Australia and Canada. However, no specific guideline existed. A series of conferences (informally known as ‘Crystal City’) was initiated in 1990.114 Procedures stated in the conference report115 were discussed at the BioInternational 2 in Munich 1994 and quickly adopted by bioanalytical sites.
In 1996 the
WHO initiated the
‘International Comparator Product System for Pharmaceuticals’ to
establish a ‘Global Comparator’, which could be used in countries
where the innovator’s product is not – or no more – marketed. As the
product may have been changed (not necessarily in all countries), the
innovators were contacted and asked which product is currently closest
to the one that led to the original authorization. These letters were
ignored. There was not even an answer like »We have received your
request but prefer rather not to reply because the information is
confidential.« In light of this, the
WHO requested that the
competent authorities of the countries inquire once more. Unfortunately,
this request has been largely ignored by the majority of innovators.116 The
list of international comparator products was first published in June
1999 and is updated periodically.117
An innovator showed an example of a product which has undergone more
than twenty (‼) changes, with the product closest to the original being
marketed only in Nigeria…118
Poland happily adopted Germany’s ‘Positive List’96 only when it wanted to join the European Union
to learn that in the meantime Germany abandoned it to comply with the
2001 guideline.97
A positive list of 19 drugs existed in The Netherlands for »strict
national market authorisation«.119 It must have been
a schizophrenic situation for assessors of the
MEB: In the morning a
dossier for national MA of
IR paracetamol without any in
vivo comparison → . In the afternoon another
dossier of the same product in the course of a European
submission. A comparative BA study
performed, but 90% CI
80.00–125.01% → . Outright bizarre.
Denmark required that the 90%
CI had to include 100%
(i.e., that there is no significant treatment effect).120
Bizarre as well. For details see the examples in this
In February 2005 the FDA published the Electronic Orange Book (EOB), which is updated daily. It can be searched by: Proprietary name, active ingredient, applicant (company), application number, dosage form, route of administration, patent number. It gives also a list of newly added or delisted patents.
The series of ‘Crystal City’ meetings continued.121 122 Incurred sample reanalysis (ISR) was proposed122 and details subsequently outlined.123 The first bioanalytical method validation guidance was published by the FDA in 2001 and revised in 2018.124 125 Before the EMA’s draft guideline was published in 2009,126 some European inspectors raised an eyebrow if sites worked according to a ‘foreign’ (i.e., the FDA’s) guidance.
“The validation of bioanalytical methods and the analysis of study samples should be performed in accordance with the principles of Good Laboratory Practice (GLP). However, as human bioanalytical studies fall outside of the scope of GLP […], the sites conducting the human studies are not required to be monitored as part of a national GLP compliance programme.
Well roared, lions! My CRO (in Austria) was GLP-certified since 1991, although we performed only phase I studies. In other countries (e.g., Spain), this was not possible. In Germany GLP is subject to state law. Hence, it was possible to get certified in one federal state but not in another… However, this ‘issue’ was resolved with the final guideline published in 2011127 and the ICH M10 guideline of 2022,128 129 superseding all local guidelines.
In June 2010 the
FDA started to
publish Product-Specific Guidances (PSGs).130 They are available
(with November 19, 2024 an amazing 2,252) and can be searched by active
ingredient or RLD. Many
PSGs remain drafts for a
long time. For example, of the 138
PSGs starting with the
letter P
, only five (‼) are final and some are for 14 years
still in draft state.
The EMA requires for
prolonged and multiphasic release products both a single dose
study and a study in steady state.131 The steady state study can be waived if
there is no ‘risk’ of accumulation (\(\small{AUC_{0-\tau}>90\%AUC_{0-\infty}}\),
where \(\small{\tau}\) is the intended
dosing interval). However, for prolonged release products this option is
rarely – if ever – possible… Different
PK metrics to assess the minimum
concentration in steady-state are required for originators and generic
companies. The former have to assess the minimum concentration within
the dosing interval \(\small{(C_\text{ss,min})}\), whereas the
latter have to assess the minimum concentration at the end of the dosing
interval \(\small{(C_{\text{ss}\,,\tau})}\). If there
is a lag-time, the latter is more difficult due to its higher
variability.132 Why double standards?
For prolonged release products with no ‘risk’ of accumulation and
multiphasic release products the cut-off times for partial \(\small{AUC\textsf{s}}\) have to be
pre-specified based on PK, which
is a rather difficult feat. Furthermore,
BE has not only to be demonstrated
for \(\small{AUC_{0-\text{t}}}\) and
all partial \(\small{AUC\textsf{s}}\)
but also for \(\small{C_\text{max}}\)
in each of the sections. This gives for one cut-off time \(\small{\text{tc}}\) already a whopping six
PK metrics (\(\small{AUC_{0-\text{t}}}\), \(\small{AUC_{0-\infty}}\), \(\small{AUC_{0-{\text{tc}}}}\), \(\small{AUC_{\text{tc}-\text{t}}}\), \(\small{C_{\text{max,t}\leq \text{tc}}}\),
At least reference-scaling (see below) is acceptable
for all PK metrics – except for
\(\small{AUC_{0-\text{t}}}\) and \(\small{AUC_{0-\infty}}\).130 That’s different to the few
PSGs of the
(e.g., methylphenidate,
where the cut-off times are based on
PD and – apart from the partial
\(\small{AUC\textsf{s}}\) and \(\small{AUC_{0-\infty}}\) – only the
global \(\small{C_\text{max}}\) is required.
Whereas Average Bioequivalence (ABE) is bijective (if \(\small{T}\) is equivalent to \(\small{R}\), \(\small{R}\) is also equivalent to \(\small{T}\)), this holds in all variants of Scaled Average Bioequivalence (SABE) if and only if \(\small{CV_\text{wT}=CV_\text{wR}}\). Therefore, switching from \(\small{R}\) to \(\small{T}\) is tolerable if \(\small{CV_\text{wT}<CV_\text{wR}}\) but in such a case switching from \(\small{T}\) to \(\small{R}\) might be problematic.
After a wealth of – controversal – publications in the 1990s,82 83 84 133 134 135 136 137 138 139 140 141 the FDA introduced two new concepts as alternatives to ABE, namely Population Bioequivalence (PBE) and Individual Bioequivalence (IBE).142 ABE focuses only on the comparison of population averages of the PK metrics and not the variances of formulations. It does also not assess a subject-by-formulation interaction variance, that is, the variation in the average \(\small{\text{T}}\) and \(\small{\text{R}}\) difference among individuals. In contrast, PBE and IBE include comparisons of both averages and variances of PK metrics. The PBE approach assesses total variability of the PK metrics in the population. The IBE approach assesses within-subject variability for the \(\small{\text{T}}\) and \(\small{\text{R}}\) formulations, as well as the subject-by-formulation interaction.
Demonstrated PBE would support ‘Prescribability’ (i.e., a drug naïve patient could start treatment), whereas IBE support ‘Switchability’ (i.e., a patient could switch formulations during treatment).141 Contrary to ABE, both PBE and IBE require studies in a full replicate design, which means that both \(\small{\text{T}}\) and \(\small{\text{R}}\) are administered twice. The acceptance limits for ABE were kept at 80.00 – 125.00% but for the others scaling to the variability of the reference was possible. That would mean an incentive for test formulations with a lower variability than the one of the reference but a penalty for ones with a higher variability.
However, the underlying statistical concepts were not trivial and the
result practically incomprehensible for non-statisticians. Furthermore,
both approaches had a discontinuity (when moving from constant- to
reference-scaling), which lead to an inflated type I error (patient’s
risk) of approximately 6.5%.137 139 142 143 144
faced criticism, e.g.,
responses [to the guidance] were still doubt-filled as to whether the new bioequivalence criteria really provided added value compared to average bioequivalence145
and was regarded a‘theoretical’ solution to a ‘thoretical’ problem146 147
leading to its omission from a subsequent guidance,148 and a return to conventional ABE.149
“Average bioequivalence should suffice based upon grounds of ‘practicality, plausibility, historical adequacy, and purpose’ and ‘because we have better things to do.’ […] ‘Statisticians have a bad track record in bioequivalence, […] the literature is full of ludicrous recommendations from statisticians, […] regulatory recommendations (of dubious validity) have been hastily implemented, and practical realities have been ignored’. “Individual bioequivalence is a promising, clinically relevant method that should theoretically provide further confidence to clinicians and patients that generic drug products are indeed equivalent in an individual patient.
Even today, considering the studies summarized and analyzed by the FDA, the data is inadequate to validate the theoretical approach and provide confidence to the scientific community that the methodology required and the expense entailed are justified.
At this time, individual bioequivalence still remains a theoretical solution to solve a theoretical clinical problem. We have no evidence that we have a clinical problem, either a safety or an efficacy issue, and we have no evidence that if we have the problem that individual bioequivalence will solve the problem.
I remember a Dutch regulator standing up in the BioInternational conference in London 2003, saying:
I’m glad that PBE and IBE are dead. I never understood them.
We don’t see a problem, our database search showed that even HVDPs comply with the usual BE criteria.
However, this observation was based on its requirement that only the PE of \(\small{C_\text{max}}\) has to lie within 80.0–125.0%.93
Benet proposed in a keynote at the BioInternational 1994 that innovators should perform replicate studies as part of the new drug application and provide information on intra- and inter-subject measures of extent and rate of BA in the PK section of the package insert.105 SABE was also discussed at the BioInternational 2005.151
The EMA published a concept paper in 2006, containing valuable points for discussion.152
Application of SABE was not limited to a certain PK metric. Furthermore, a comparison of \(\small{s_{\text{wT}}^{2}}\) with \(\small{s_{\text{wR}}^{2}}\) would require a full replicate design.
“Who controls the past controls the future:
who controls the present controls the past.
SABE was introduced 2010 first by the EMA,153 shortly after by the FDA,154 155 in 2017 by the WHO,156 and in 2018 by Health Canada.157
The concept of SABE is based on the following considerations:
The conventional model of ABE by \(\small{(7)}\) is modified in SABE to \[H_0:\;\frac{\mu_\text{T}}{\mu_\text{R}}\Big{/}\sigma_\text{wR}\not\subset\left\{\theta_{\text{s}_1},\theta_{\text{s}_2}\right\}\;vs\;H_1:\;\theta_{\text{s}_1}<\frac{\mu_\text{T}}{\mu_\text{R}}\Big{/}\sigma_\text{wR}<\theta_{\text{s}_2},\tag{13}\] where \(\small{\sigma_\text{wR}}\) is the standard deviation of the reference. The scaled limits \(\small{\left\{\theta_{\text{s}_1},\theta_{\text{s}_2}\right\}}\) of the acceptance range depend on conditions given by the agency.
Reference-Scaled Average Bioequivalence (RSABE)161 is recommended by the FDA and China’s CDE. Average Bioequivalence with Expanding Limits (ABEL)162 is another variant of SABE and recommended in all other jurisdictions. In order to apply the methods following conditions have to be fulfilled:
In all methods a point estimate-constraint is imposed. Even if a study would pass the scaled limits, the PE has to lie within 80.00 – 125.00% in order to pass. Whilst the PE-constraint is statistically not justified, it was implemented in all jurisdictions ‘for political reasons’.164
- There is no scientific basis or rationale for the point estimate recommendations
- There is no belief that addition of the point estimate criteria will improve the safety of approved generic drugs
- The point estimate recommendations are only “political” to give greater assurance to clinicians and patients who are not familiar (don’t understand) the statistics of highly variable drugs
Compared to ABE, SABE leads to a substantial reduction in sample sizes (see this article). However, both RSABE and ABEL may result in an inflated type I error (the patient’s risk),108 which was already described in 2009162 165 (before [sic] SABE was implemented) and is still an unresolved issue166 167 (see also this article).
published SAS
code154 161 but it is a mystery why a fixed-effects
model for the partial replicate design and a mixed-effects model
for a full replicate design was recommended. If you understand why,
please let me know.
If \(\small{s_\text{wR}<0.294}\), ABE has to be assessed by \(\small{(7)}\) and \(\small{\Delta=20\%}\) (90% CI entirely within 80.00 – 125.00%).
It must be mentioned that if the study was performed in a partial
replicate design, the model is over-specified and the optimizer of any
(‼) software might not converge (for details see this article).
If \(\small{s_\text{wR}\geq0.294}\),
should be applied. The regulatory constant is given by \[\theta_\text{s}=\frac{\log_{e}1.25}{s_0}\approx
0.8925742\ldots\small{\textsf{,}}\tag{14}\] where \(\small{s_0}\) is the regulatory switching
condition \(\small{0.25}\). The point
estimate \(\small{PE}\) is given by
where \(\small{\overline{Y}_\text{T}}\)
and \(\small{\overline{Y}_\text{R}}\)
are the means of \(\small{\log_{e}}\)-transformed
PK-metrics obtained for the test
and reference products, respectively. The standard error \(\small{se}\) of the \(\small{PE}\) is \[se=\sqrt{\frac{\widehat{s}}{{N_{s}}^{2}}\sum
\frac{1}{n_i}}\small{\textsf{,}}\tag{15}\] where \(\small{\widehat{s}}\) is the model’s
residual mean squares error, \(\small{N_\text{s}}\) are the number of
sequences, and \(\small{n_i}\) the
number of subjects in sequence \(\small{i}\). We start with the
SABE model \(\small{(13)}\) and
work with \(\small{\log_{e}\textsf{-}}\)transformed
values for convenience \[-\theta_\text{s}\leq\frac{\mu_\text{T}-\mu_\text{R}}{\sigma_\text{wR}}\leq\theta_\text{s}\tag{16}\]
and use its squared and linearized form \[\left(\mu_\text{T}-\mu_\text{R}\right)^2-{\theta_{s}}^{2}\cdot{\sigma_{\text{wR}}}^{2}\leq0\small{\text{.}}\tag{17}\]
Upon inspecting part of the SAS
code in the
pointest=exp(estimate); x=estimate**2-stderr**2; theta=((log(1.25))/0.25)**2; y=-theta*s2wr;
…we see that stderr**2, i.e., \(\small{se^2}\) from \(\small{(15)}\), is inserted in the left-hand side of \(\small{(17)}\) – which is formulated in the true parameters – yielding for the estimates \[PE^2-se^2-{\theta_{s}}^{2}\cdot {s_{\text{wR}}}^{2}\leq0\small{\textsf{.}}\tag{18}\] This is not stated as such in the formulas of the guidance. We are aware of only one reference,168 which is – regrettably – not in the public domain.
“The statistical approach we use is very similar to that proposed by Tothfalusi, Endrenyi, et al. 2001,169 with a minor difference (use of an unbiased estimator for \(\small{\left(\mu_\text{T}-\mu_\text{R}\right)^2})\).
Then \[\eqalign{ E_\text{m}&=PE^2-se^2\\ E_\text{s}&={\theta_{s}}^{2}-{s_{\text{wR}}}^{2} }\tag{19}\] are calculated, where \(\small{E_\text{m}}\) and \(\small{E_\text{s}}\) are the estimates of the true parameters (\(\small{se^2}\) acts again as a bias correction). Since their distributions are known, their upper confidence limits \(\small{C_\text{m}}\) and \(\small{C_\text{s}}\) can be calculated by \[\eqalign{ C_\text{m}&=\left(\left|PE\right|+t_{1-\alpha,\,\nu}\cdot se\right)^2\\ C_\text{s}&=E_\text{s}\cdot \nu\big{/}\chi_{1-\alpha,\,\nu}^{2}\small{\textsf{,}} }\tag{20}\] where \(\small{\nu}\) are the degrees of freedom given by \(\small{\sum n-N_\text{s}}\). A modification170 of Howe’s approximation171 is used in order to get the CI of a sum of random variables from the individual CIs. The squared lengths of the individual CIs are: \[\eqalign{ L_\text{m}&=\left(C_\text{m}-E_\text{m}\right)^2\\ L_\text{s}&=\left(C_\text{s}-E_\text{s}\right)^2\small{\textsf{.}} }\tag{21}\] Finally we calculate the 95% upper confidence bound: \[\small{\textsf{bound}}=E_\text{m}-E_\text{s}+\sqrt{\left(L_\text{m}-L_\text{m}\right)^2}\tag{22}\]
In order to pass RSABE:
Although the EMA’s
concept paper stated152 that the
statistical and computational methods will be given in the guideline,
this was not the case.153
code and two example data sets were published later in
a Q&A document.172 The
evaluation has to be done with a simple
ANOVA, i.e., assuming
identical within-subject variances of the test and reference products.
Methods to identify and handle outliers were not given.
If \(\small{CV_\text{wR}\leq30\%}\), ABE has to be demonstrated by \(\small{(7)}\) and \(\small{\Delta=20\%}\) (90% CI entirely within 80.00 – 125.00%).
Otherwise, ABEL can be applied and the limits expanded to \(\small{\left\{L,U\right\}=100\exp(\mp k\cdot s_\text{wR})}\), with the regulatory constant \(\small{k=0.76}\). The scaling is capped at 50% for all agencies (maximum expansion 69.84 – 143.19%), except for Health Canada at ≈57.382% (67.7 – 150.0%).
CVwR <- 100 * sort(c(seq(0.3, 0.6, 0.05), 0.57382))
EL <- data.frame(CVwR = CVwR,
EMA.uc = c(rep("no", 4), rep("yes", 4)),
EMA.L = NA_real_, EMA.U = NA_real_,
HC.uc = c(rep("no", 6), rep("yes", 2)),
HC.L = NA_real_, HC.U = NA_real_)
EMA <- scABEL(CV = CVwR / 100, regulator = "EMA")
HC <- scABEL(CV = CVwR / 100, regulator = "HC")
EL[, 1] <- sprintf("%.3f%%", EL[, 1])
EL[, 3:4] <- sprintf("%.2f%%", 100 * EMA)
EL[, 6:7] <- sprintf("%.1f%%", 100 * HC)
names(EL)[2:7] <- c("capped", "L (EMA)", "U (EMA)",
"capped", "L (HC)", "U (HC)")
print(EL, row.names = FALSE)
# CVwR capped L (EMA) U (EMA) capped L (HC) U (HC)
# 30.000% no 80.00% 125.00% no 80.0% 125.0%
# 35.000% no 77.23% 129.48% no 77.2% 129.5%
# 40.000% no 74.62% 134.02% no 74.6% 134.0%
# 45.000% no 72.15% 138.59% no 72.2% 138.6%
# 50.000% yes 69.84% 143.19% no 69.8% 143.2%
# 55.000% yes 69.84% 143.19% no 67.7% 147.8%
# 57.382% yes 69.84% 143.19% yes 66.7% 150.0%
# 60.000% yes 69.84% 143.19% yes 66.7% 150.0%
It has to be demonstrated that the high \(\small{CV_\text{wR}}\) is not caused by
outliers. If outliers are detected, they have to be excluded and \(\small{CV_\text{wR}}\) as well as \(\small{\left\{L,U\right\}}\) recalculated.
However, the 90% CI has to be
calculated with complete data.
In order to pass ABEL:
With the regulatory switching condition \(\small{s_0=0.10}\) we get the regulatory
constant by \[\theta_\text{s}=\frac{\log_{e}1.11111}{s_0}\approx
1.053595\ldots\tag{23}\] The 95% upper confidence bound is
determined with \(\small{\theta_\text{s}}\) by \(\small{(15)-(22)}\).
The upper CL for \(\small{\sigma_\text{wT}/\sigma_\text{wR}}\)
is calculated by \[\frac{s_\text{wT}/s_\text{wR}}{\sqrt{F_{{1-\alpha/2},\nu_1,\nu_2}}}\small{\textsf{,}}\tag{24}\]
where \(\small{s_\text{wT}}\) ist the
estimate of \(\small{\sigma_\text{wT}}\) with \(\small{\nu_1}\) degrees of freedom, \(\small{s_\text{wR}}\) ist the estimate of
\(\small{\sigma_\text{wR}}\) with \(\small{\nu_2}\) degrees of freedom, and
\(\small{F}\) is the value of the F-distribution
with \(\small{\nu_1}\) (numerator) and
\(\small{\nu_2}\) (denominator) for
In order to pass:
The last condition is operationally equivalent to capping the ‘implied’ limits \(\small{\left\{L,U\right\}}\) of RSABE at \(\small{CV_\text{wR}\geq}\) \(\small{\approx21.42\%}\). Otherwise, for any larger \(\small{CV_\text{wR}}\) they would by wider than 80.00 – 125.00%. Of course, that is not what we want for an NTID. We can show that numerically.
fun <- function(x, Delta, sigma.0) { # x is CVwR
theta.s <- log(Delta) / sigma.0 # regulatory constant
swR <- sqrt(log(x^2 + 1)) # within subject standard deviation of R
U <- exp(theta.s * swR) # upper ‘implied’ (scaled) limit
objective <- U - 1.25 # target zero
Delta <- 1.11111 # approximate acc. to the guidance (not the exact 1/0.9)
sigma.0 <- 0.10 # regulatory switching condition
# numerically find the CVwR where U ~1.25
CVcap <- 100 * uniroot(fun, interval = c(0, 0.3), tol = 1e-8,
Delta, sigma.0)$root
# check the ‘implied’ limits
CVwR <- sort(c(CVcap / 100, seq(0.05, 0.3, 0.05)))
comp <- data.frame(CVwR = CVwR, L.implied = NA_real_, U.implied = NA_real_,
L.capped = NA_real_, U.capped = NA_real_)
f <- c(-1, +1)
for (i in seq_along(CVwR)) {
comp[i, 2:5] <- sprintf("%.2f%%", 100 * exp(f * log(Delta) / sigma.0 *
sqrt(log(CVwR[i]^2 + 1))))
if (comp$CVwR[i] >= CVcap / 100) {
comp[i, 4:5] <- sprintf("%.2f%%", 100 * exp(f * log(Delta) / sigma.0 *
sqrt(log((CVcap / 100)^2 + 1))))
comp$CVwR <- sprintf("%.2f%%", 100 * comp$CVwR)
txt <- sprintf("The ‘implied’ limits in RSABE are capped at CVwR %.9g%%.\n", CVcap)
cat(txt); print(comp, row.names = FALSE)
# The ‘implied’ limits in RSABE are capped at CVwR 21.4189888%.
# CVwR L.implied U.implied L.capped U.capped
# 5.00% 94.87% 105.41% 94.87% 105.41%
# 10.00% 90.02% 111.08% 90.02% 111.08%
# 15.00% 85.46% 117.02% 85.46% 117.02%
# 20.00% 81.17% 123.20% 81.17% 123.20%
# 21.42% 80.00% 125.00% 80.00% 125.00%
# 25.00% 77.15% 129.62% 80.00% 125.00%
# 30.00% 73.40% 136.25% 80.00% 125.00%
Introduced by the FDA in 2000,178 148 the EMA in 2010,153 and the ICH in 2019179 as an alternative for in vivo testing of IR products based on the Biopharmaceutic Classification System, where drugs are classified by their solubility and permeability.88
Class I | Class II |
High solubility | Low solubility |
High permeability | High permeability |
Class III | Class IV |
High solubility | Low solubility |
Low permeability | Low permeability |
The idea behind waiving an in vivo study is based on the fact that such studies are not required for aqueous solutions (see above). Thus, if a drug product dissolves very rapidly, it can be expected to behave similarly to a solution.
A BCS-based biowaiver may be acceptable if the drug substance has been proven to exhibit high solubility and complete absorption (Class I) and either very rapid (> 85% within 15 min) or similarly rapid (85% within 30 min) in vitro dissolution characteristics of the test and reference product has been demonstrated considering specific requirements and excipients that might affect BA are qualitatively and quantitatively the same. In general, the use of the same excipients in similar amounts is preferred.
BCS-based biowaivers may also be acceptable if the drug substance has been proven to exhibit high solubility and limited absorption (Class III) and very rapid (> 85% within 15 min) in vitro dissolution of the test and reference product has been demonstrated considering specific requirements, excipients that might affect BA are qualitatively and quantitatively the same, and other excipients are qualitatively the same and quantitatively very similar.
The following conditions should be employed in the comparative dissolution studies to characterize the dissolution profile of the products:179
When high variability or coning is observed in the paddle apparatus at 50 rpm for both reference and test products, the use of the basket apparatus at 100 rpm is recommended. Additionally, alternative methods (e.g., the use of sinkers or other appropriately justified approaches) may be considered to overcome issues such as coning, if scientifically substantiated.179
The evaluation of the similarity factor \(\small{f_2}\) is based on the following conditions:179
A risk assessment of potential bioinequivalence by application of a biowaiver has be provided, which has to be more strict for Class III than for Class I drugs.153 Biowaivers for NTIDs are not possible.
Alas, approaches are not harmonized yet166 167 180 181 182 – after eight BioInternational conferences 1989 – 2008, six GBHI workshops 2015 – 2024, and more than five years of involvement of the ICH… At least there is an agreement to use a 90% CI.
Agency | Uncomplicated drug | HVD(P) | NTID |
FDA 161 |
ABE (any
metric) CI \(\small{\subset}\) 80.00–125.00% |
RSABE (any
metric) \(\small{\textsf{bound}\leq}\) 0 PE \(\small{\subset}\) 80.00–125.00% |
RSABE (any
metric) \(\small{\textsf{bound}\leq}\) 0 CI \(\small{\subset}\) 80.00–125.00% upper CL of \(\small{\sigma_\text{wT}/\sigma_\text{wR} \leq}\) 2.5 |
EMA 153 131 |
ABE (any
metric) CI \(\small{\subset}\) 80.00–125.00% |
(\(\small{C_\text{max}}\), \(\small{\textsf{p}AUC}\)) \(\small{uc}\) 50% PE \(\small{\subset}\) 80.00–125.00% |
ABE CI \(\small{\subset}\) 90.00–111.11% PSGs: Only for \(\small{AUC}\) |
WHO 156 183 |
ABE (any
metric) CI \(\small{\subset}\) 80.00–125.00% |
(\(\small{C_\text{max}}\), \(\small{AUC}\)) \(\small{uc}\) 50% PE \(\small{\subset}\) 80.00–125.00% |
ABE CI \(\small{\subset}\) 90.00–111.11% |
HC 157 |
ABE \(\small{AUC}\): CI \(\small{\subset}\) 80.0–125.0% \(\small{C_\text{max}}\): PE \(\small{\subset}\) 80.0–125.0% |
(\(\small{AUC}\), \(\small{uc}\)
57.382%) ABE (\(\small{C_\text{max}}\): PE \(\small{\subset}\) 80.0–125.0%)157 |
ABE \(\small{AUC}\): CI \(\small{\subset}\) 90.0–112.0% \(\small{C_\text{max}}\): CI \(\small{\subset}\) 80.0–125.0% |
This lack of harmonization leads to the paradox (though hypothetical) situation that the same study will pass in one jurisdiction but fail in another.108 166 167 180 181 182
Only with a few exceptions (i.e., in Australia, Canada, New Zealand, Singapore, South Africa, Switzerland, Taiwan, the UK, and in countries following the WHO’s guidelines156 183) under certain conditions, the local reference must be used in comparative BA studies. Accepting a foreign reference or – even better – a ‘Global Comparator’ would be desirable in order to reduce the number of studies. Under current legislation, this is not possible in most countries.184
Still unresolved, outlook:
It is beyond me why the EMA’s guideline153 (based on the European legislation222) refers – apart from salts and esters – to different ethers.
No other jurisdiction contains such a ludicrous statement.
“Those people who think they know everything
are a great annoyance to those of us who do.
At the Assembly of the International Council for Harmonisation (Singapore, 19/20 November 2019) the new topic ‘Bioequivalence for Immediate-Release Solid Oral Dosage Forms’ was proposed by the FDA. The Assembly approved the outline of a Concept Paper and agreed to establish a Working Group without delay to initiate work on finalizing the Concept Paper and Business Plan for the M13 topic.223 The Working Group was established in February 2020 and the Concept Paper endorsed in July 2020.224
The following stakeholders (31 members) are represented in the M13A Expert Working Group: ANVISA (1), EC (2), EFPIA (2), FDA (3), Global Self-Care Federation (1), Health Canada (1), HSA (1), IFPMA (1), IGBA (2), JFDA (1), JPMA (2), MFDS (1), MHRA (1), NAFDAC (1), NMPA (2), PhRMA (2), PMDA (2), SAHPRA (1), Swissmedic (1), TFDA (1), TGA (1), WHO (1).
“Step 4: Adoption of an ICH Harmonised Guideline
Step 4 is reached when the Assembly agrees that there is sufficient consensus on the draft Guideline.
The Step 4 Final Document is adopted by the ICH Regulatory Members of the ICH Assembly as an ICH Harmonised Guideline at Step 4 of the ICH process.
“Step 5: Implementation
Having reached Step 4, the harmonised Guideline moves immediately to the final step of the process that is the regulatory implementation. This step is carried out according to the same national/regional procedures that apply to other regional regulatory guidelines and requirements, in the ICH regions.
Although a guideline is in implementation, it is not necessarily implemented in all regions. There is no deadline and no obligation for an agency to implement it at all.
The draft guideline was published in December 2022.3 The ICH’s Assembly (Fukuoka, 4/5 June 2024) approved the final version, which was published together with Q&As in July 2024;194 226 see also a brief summary.227 A comparison of some PK- and statistics-ralated issues of the draft and final versions is given (together with my personal views) in the BEBA-Forum.228
Given the considerable degree of conformity already achieved by local guidelines in the mid-2010s, the development of a harmonized guideline should have been a relatively straightforward process. Furthermore, in the workshops of the Global Bioequivalence Harmonisation Initiative (Amsterdam 2015, 2018, 2022; Rockville 2016, 2024; Bethesda 2019) controversial issues and potential solutions were discussed in great detail. However, development of the guideline still took three and a half years.
The biggest difference to all previous guidelines is the definition of so-called “high-risk products”. For these products there is an increased likelihood that the in vivo performance will be affected differently by varying GI conditions in fasting and fed state. A rationale should be provided for the selection of the type of study(ies) – fasting, fed, or both – and meal type, e.g., fat and calorie content, based on the understanding of the test and comparator products.
Due to the tiered approach of M13, different implementation dates in regions, whether Product-Specific Guidances (PSGs) exist and will be updated, contradictions with current guidelines will exist.229
Canada (HPFB) and China (NMPA).
Argentina (ANMAT), Australia (TGA), Brazil (ANVISA), Egypt (EDA), Japan (PMDA), Jordan (JFDA), Korea (MFDS), Mexico (COFEPRIS), Nigeria (NAFDAC), Saudi Arabia (SFDA), Singapore (HSA), South Africa (SAHPRA), Taiwan (TFDA), Turkey (TITCK), and the U.K. (MHRA).
At the ICH’s Assembly (Incheon, 15/16 November 2022) the EWG reported that preliminary work on M13B ‘Additional Strength Biowaiver’ has commenced and Steps 1 and 2a/b were targeted for June 2023.237 Due to delays in the finalization of M13A, the draft of M13B was expected to be published for public consultation (Step 2b) in July 2024.238 However, according to the EWG’s Workplan of August 2024 public consulation of M13B is expected to start in February 2025, together with inital work on M13C, i.e., BE assessment for HVDs, NTIDs, and advanced data analysis considerations (reference-scaling, Two-Stage adaptive designs).239
The textbooks dealing mainly with statistics (marked with ★) are rather tough cookies and not recommended for beginners.
I would like to thank Henning Blume and José A. Guimarães Morais for sharing memories about the BioInternational conferences and the early period of bioequivalence. Special thanks go to my co-organizers of the BioBridges workshops: Jean-Michel Cardot, Vít Perlík, and Ondřej Slanař . I’m indebted to Paulo Paixão for encouraging me to pursue my PhD at the Faculty of Pharmacy, Universidade de Lisboa. I’m also grateful to Susana Almeida for insights into the generic industry.
Dedicated to my friend Dirk Barends (1945 – 2012), who initiated the FIP’s biowaiver monograph series. I miss his wit and laughter. In gratitude to László Endrényi (1933 – 2020), whose work on pharmacokinetic metrics and the bioequivalence of Highly Variable Drugs inspired many scientists in the field.
Helmut Schütz 2025
Abbreviation | Meaning |
AAPS | American Association of Pharmaceutical Scientists |
ABE | Average Bioequivalence |
ABEL | Average Bioequivalence with Expanding Limits |
ANAMED | Agencia Nacional de Medicamentos (competent authority of Chile) |
ANDA | Abbreviated New Drug Application (generics; FDA term) |
ANMAT | Administración Nacional de Medicamentos, Alimentos y Tecnología Médica (competent authority of Argentina) |
ANOVA | Analyis of Variance |
ANVISA | Agência Nacional de Vigilância Sanitária (competent authority of Brazil) |
AOAC | (U.S.) Association of Official Analytical Chemists |
APhA | (U.S.) Academy of Pharmaceutical Sciences |
API | Active Pharmaceutical Ingredient |
APV | Arbeitsgemeinschaft für Pharmazeutische Verfahrenstechnik (International Association for Pharmaceutical Technology) |
\(\small{AUC}\) | Area Under the (concentration-time) Curve |
\(\small{AUC_{0-\text{t}}}\) | \(\small{AUC}\) from the time of administration to the time of the last measurable concentration |
\(\small{AUC_{0-72\text{h}}}\) | \(\small{AUC}\) from the time of administration to 72 hours (IR products) |
\(\small{AUC_{0-\infty}}\) | \(\small{AUC}\) from the time of administration extrapolated to infinite time |
BA | Bioavailability |
BCS | Biopharmaceutic Classification System |
BE | Bioequivalence |
BfARM | Bundesinstitut for Arzneimittel und Medizinprodukte (competent authority of Germany) |
\(\small{\textsf{bound}}\) | 95% upper confidence bound in RSABE |
\(\small{C}\) | Concentration |
CDE | Center for Drug Evaluation (China) |
CDER | Center for Drug Evaluation and Research (FDA) |
CFR | Code of Federal Regulations (U.S.) |
cGMP | curent Good Manufacturing Practices |
CHMP | Committee for Medicinal Products for Human Use (of the EMA) |
CI | Confidence Interval |
\(\small{CL}\) | Clearance |
CL | Confidence Limit |
CLlower, CLupper | Lower and upper CL |
\(\small{C_\text{max}}\) | Maximum concentration |
COFEPRIS | Comisión Federal para la Protección contra Riesgos Sanitarios (competent authority of Mexico) |
CPMP | Committee for Proprietary Medicinal Products (of the EMEA) |
CRO | Contract Research Organization |
\(\small{C_\text{ss,min}}\) | Minimum concentration in steady-state within the dosing interval \(\small{\tau}\) |
\(\small{C_{\text{ss}\,,\tau}}\) | Concentration in steady-state at the end of the dosing interval \(\small{\tau}\) |
\(\small{C_{\text{t}_\text{last}}}\) | Last measured concentration |
\(\small{\widehat{C}_{\text{t}_\text{last}}}\) | Estimated concentration at \(\small{t_\text{last}}\) |
CVM | Center for Veterinary Medicine (FDA) |
\(\small{CV_\text{intra}}\) | Within-subject Coefficient of Variation in a crossover design |
\(\small{CV_\text{wR},CV_\text{wT}}\) | Observed within-subject Coefficient of Variation of the Reference and Test product |
\(\small{D}\) | Dose |
EC | European Community |
EEA | European Economic Area (EU + Liechtenstein, Iceland, Norway) |
EFPIA | European Federation of Pharmaceutical Industries and Associations |
EMA | European Medicines Agency |
\(\small{E_\text{max}}\) | Maximum effect |
EMEA | European Agency for the Evaluation of Medicinal Products (rebranded to EMA in Dec. 2009) |
EOB | Electronic Orange Book (FDA) |
EU | European Union |
EUFEPS | European Federation for Pharmaceutical Sciences |
EWG | Expert Working Group (ICH) |
\(\small{f}\) | Fraction absorbed |
\(\small{f_2}\) | Similarity factor |
FDA | (U.S.) Food and Drug Administration |
FDC | Fixed Dose Combination (product) |
FIP | Federation International Pharmaceutique (International Pharmaceutical Federation) |
GBHI | Global Bioequivalence Harmonisation Initiative |
GI | Gastrointestinal |
GLP | Good Laboratory Practice |
GSD | Group-Sequential Design |
\(\small{H_0}\) | Null hypothesis |
\(\small{H_1}\) | Alternative hypothesis (also \(\small{H_\text{a}}\)) |
HPFB | Health Products and Food Branch (competent authority of Canada) |
HSA | Health Sciences Authority (competent authority of Singapore) |
HVD(P) | Highly Variable Drug (Product) |
IBE | Individual Bioequivalence |
ICH | International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use |
IFPMA | International Federation of Pharmaceutical Manufacturers and Associations |
IGBA | International Generic and Biosimilar Medicines Association |
IR | Immediate Release (product) |
ISR | Incurred sample reanalysis |
IV | Intravenous |
JFDA | Jordan Food and Drug Administration |
JPMA | Japan Pharmaceutical Manufacturers Association |
\(\small{k}\) | Regulatory constant in ABEL: 0.76 |
\(\small{k\,_\text{a}}\) | Absorption rate constant (also \(\small{k_\text{01}}\)) |
\(\small{k\,_\text{el}}\) | Elimination rate constant (also \(\small{k_\text{10}}\)) |
\(\small{L}\) | Lower expanded limit in ABEL |
LALA | Locally applied, locally acting (product) |
LLOQ | Lower limit of quantification |
\(\small{\log_{e}}\) | Natural logarithm (base e ≈ 2.71828…) |
\(\small{\log_{10}}\) | Decadic logarithm (base 10) |
\(\small{L,U}\) | Lower and upper expanded limits in ABEL |
M | Multidisciplinary guideline of the ICH |
MA | Market Authorisation |
MEB | Medicines Evaluation Board (competent authority of The Netherlands) |
MFDS | Ministry of Food and Drug Safety (competent authority of the Republic of Korea) |
MHRA | Medicines and Healthcare products Regulatory Agency (competent authority of the U.K.) |
MR | Modified release (product) |
\(\small{n}\) | Sample size |
\(\small{n_1,n_2}\) | Number of subjects in sequences 1 and 2 of a 2×2×2 crossover design |
NAFDAC | National Agency for Food and Drug Administration and Control (competent authority of Nigeria) |
NMPA | National Medical Products Administration (competent authority of China) |
NDA | New Drug Application (originators; FDA term) |
NTID | Narrow Therapeutic Index Drug (Canada: Critical Dose Drug) |
OGD | Office of Generic Drugs (FDA) |
OGDP | Office of Generic Drug Policy (FDA) |
\(\small{p(x)}\) | Probability of \(\small{x}\) |
\(\small{\text{p}AUC}\) | Partial \(\small{AUC}\) |
PBE | Population Bioequivalence |
PD | Pharmacodynamics |
PE | Point Estimate of \(\small{\mu_\text{T}/\mu_\text{R}}\) |
PhRMA | Pharmaceutical Research and Manufacturers of America |
PMDA | Pharmaceuticals and Medical Devices Agency (competent authority of Japan) |
PQT | Prequalification Team (WHO) |
PK | Pharmacokinetics |
PKWP | Pharmacokinetics Working Party (of the EMA’s CHMP) |
PQT | Prequalification Team (WHO) |
PSG | Product-Specific Guidance |
Q&A | Question and Answer |
\(\small{\text{R}}\) | Reference product |
RLD | Reference Listed Drug (FDA term) |
RMP | Reference Medicinal Product (EU term) |
RSABE | Reference-Scaled Average Bioequivalence |
\(\small{s}\) | Sample standard deviation |
\(\small{s_0}\) | Switching condition in RSABE: for HVD(P)s 0.25 and for NTIDs 0.1 |
\(\small{s^2}\) | Sample variance |
SABE | Scaled Average Bioequivalence |
SAHPRA | South African Health Products Regulatory Authority |
SUPAC | Scale-Up and Postapproval Changes (FDA) |
\(\small{s_\text{wR},s_\text{wT}}\) | Observed within-subject standard deviation of the Reference and Test product |
\(\small{s_{\text{wR}}^{2},s_{\text{wT}}^{2}}\) | Observed within-subject variance of the Reference and Test product |
\(\small{t}\) | Time |
\(\small{\text{T}}\) | Test product |
TFDA | Taiwan Food and Drug Administration |
\(\small{\text{tc}}\) | Cut-off time (multiphasic release products) |
TE | Therapeutic Equivalence |
TGA | Therapeutic Goods Administration (competent authority of Australia) |
TIE | Type I Error |
TITCK | Türkiye İlaç ve Tıbbi Cihaz Kurumu (competent authority of Turkey) |
\(\small{t_\text{last}}\) | Time of the last measured concentration \(\small{C_{\text{t}_\text{last}}}\) |
\(\small{t_\text{max}}\) | Time of \(\small{C_\text{max}}\) |
TOST | Two One-Sided Tests |
TSD | Two-Stage Design |
\(\small{U}\) | Upper expanded limit in ABEL |
\(\small{uc}\) | Upper cap of expansion in ABEL |
URL | Uniform Resource Locator |
\(\small{V}\) | Apparent volume of distribution |
WHO | World Health Organization |
\(\small{\bar{x}_\text{T},\bar{x}_\text{R}}\) | Arithmetic means of \(\small{\text{T}}\) and \(\small{\text{R}}\) |
\(\small{\alpha}\) | Nominal level of the test, probability of Type I Error (patient’s risk) |
\(\small{\beta}\) | Probability of Type II Error (producer’s risk), where \(\small{\beta=1-\pi}\) |
\(\small{\Delta}\) | Clinically relevant difference |
\(\small{\theta_\text{s}}\) | Regulatory constant in RSABE: for HVD(P)s 0.8925742… and for NTIDs 1.053595… |
\(\small{\theta_0}\) | True (in sample size estimation assumed) \(\small{\text{T}/\text{R}}\)-ratio |
\(\small{\theta_1,\theta_2}\) | Fixed lower and upper limits of the BE acceptance range |
\(\small{\theta_{\text{s}_1},\theta_{\text{s}_2}}\) | Scaled lower and upper limits of the BE acceptance range |
\(\small{\widehat{\lambda}_\text{z}}\) | Apparent terminal rate constant (estimated) |
\(\small{\mu_\text{T}/\mu_\text{R}}\) | True \(\small{\text{T}/\text{R}}\)-ratio |
\(\small{\nu}\) | Degrees of freedom |
\(\small{\pi}\) | Prospective (a priori) power, where \(\small{\pi=1-\beta}\) |
\(\small{\widehat{\pi}}\) | Estimated (a posteriori, post hoc, retrospective) power |
\(\small{\sigma}\) | Population standard deviation |
\(\small{\sigma_\text{wR},\sigma_\text{wT}}\) | True within-subject standard deviation of the Reference and Test product |
\(\small{\tau}\) | Dosing interval |
2×2×2 | Two treatment two sequence two period crossover design |
