Foundation of P-values in experimental settings
Clinical investigators have two obligations in clinical trials - patient protection and population protection. Investigators share a responsibility to insure the results of their research do not mislead. This responsibility is administered through controlling sampling error (p-values and type II error rates) and insuring the experiment is executed as planned.
A p-value is a probability, and, in order to understand it, we must understand the event whose relative frequency it represents. P-values are traditionally described as conditional probabilities, e.g. the probability the test statistic falls in the critical region given the condition that the null hypothesis is true. However, the motivation of this mathematical definition lies in the use of population sampling. The strategy of drawing a random sample from the population at large makes the experiment executable, since logistics preclude including all patients in the population in the trial. However, the investigators’ price for executability is that another sample with a different set of subjects might yield different results. The variability of samples generated by a population and the variability of results produced by these samples is sampling error. Populations can generate many different samples. Under correct sampling plans, most of the samples will be representative of the population. However, despite the investigators’ best efforts, the population may have alphadealt them a bad handalpha, i.e. provided a sample whose findings of efficacy will not reflect the findings in the population at large. This sampling error is dangerous to the integrity of the experiment, and if handled inappropriately, critically weakens the investigators’ abilities to generalize their findings to the larger population from which the sample was drawn. The p-value handles this sampling error by linking the experiment’s outcome to the types of samples the population may have produced. Assume an experiment is executed and results in a 15% efficacy for the intervention. The p-value is the probability that the population from which the sample of the experiment was randomly drawn derives no therapeutic benefit, but misleads us by producing an unrepresentative sample with 15% efficacy. We term this event the type I error event. The likelihood of this event is what we hope to minimize when we produce a small p-value.(return to top)
Experimental Concordance vs. Experimental Discordance.
The p-value of an experiment is relevant to the scientific community if the p-value is the final repository of that experiment’s sampling error, i.e. it is the result of a well defined, prospectively fixed experiment whose only random component is the data itself. This is not the case when the investigators allow sampling error to affect the conduct of an experiment. If the data generated by the experiment are allowed to influence the experiment (i.e. leading the investigators to change the experiment’s endpoint), then the experiment has a unanticipated random component. A different (random sample) will have produced different (random) data, and led to a different (random) analysis and a different (random) finding. Since the random data have been allowed to transmit randomness to the analysis and the experiment’s result, the p-value is meaningless since it is the result not just of random data, but of a random analysis plan. The p-value is germane only if it represents a population based sampling event of interest to the scientific community occurring when the experiment is executed according to protocol, a set of circumstances defined here as experimental concordance. With experimental concordance, the type I error event of the experiment is the same as the type I error event of the protocol, and the p-value of the experiment answers the question raised by the research community (within the protocol). When the data alter the experiment’s execution, and discrepancies are created between the protocol’s plan of operation and the experiment’s actual execution occur (experimental discordance) exists, the type I error event for the experiment may not be the type I error event directed by the protocol. If the discordance is mild, the experiment’s type I error event remains pertinent to the medical community and the experiment’s p-value may be adjusted. However, if there is severe or profound experimental discordance, the experimental p-value may be of little value. Severe experimental discordance can unfortunately be lethal to a clinical trial’s interpretation since it produces a random experiment with an uninterpretable, corrupt p-value.(return to top)
Mild experimental discordance - trial p-value adjustment.
Most clinical trials have some experimental discordance. Consider the protocol for a prospectively designed, double blind, randomized controlled clinical trial to assess an intervention for improving survival in patients with myocardial infarction requires 2182 patients to demonstrate a 20% reduction in total mortality from a cumulative placebo event rate of 25% with 80% power and a two sided alpha level of 0.05. However, suppose only 2000 patients are randomized, representing a discrepancy with the experimental plan. Here, the experiment’s type I error event remains close to the protocol type I error event, and the trial’s type I error event remains relevant.(return to top)
A more problematic case would be a two armed clinical trial with a statistically significant p-value for efficacy against a total mortality endpoint but violates the protocol by losing 15% of its randomized patients to follow-up. The implications of this discordance are substantial because the follow-up losses blur our view of the sample’s efficacy. The degree of discordance here depends on the strength of the p-value. If the p-value remained below the threshold of significance when we assume all lost patients assigned to the placebo group were alive and all lost patients in the active group died, we conclude the discordance is mild because the worst implications of the follow-up losses do not vitiate the results and the type I error event is still relevant to the scientific community. However, if the p-value changes in significance we may say the discordance is severe and the alpha error corrupted.(return to top)
As a final example of discordance, consider the findings of the NSABP Protocol B-06^{9-10}, in which ninety-nine ineligible patients were deliberately randomized with falsified data. To assess the experimental discordance produced by this event, first, examine of the trial results by excluding these 99 patients. If their exclusion moved the p-value across the significance threshold, the inclusion of these patients produces unacceptable discordance. However, a second relevant question of the interpretation of the trial must be addressed since the presence of fraudulent data admits the possibility of dishonest behavior elsewhere in the trial apparatus. However, the audit of Christian et al^{11} by its natures is an investigation of the degree of discordance. Since the protocol discrepancies identified were small in number, the degree of discordance was mild. (return to top)
Importance of prospective experimental design
Successful experiments protect experimental concordance. This concordance is easily lost if data collected during the experiment is allowed to affect decisions in an unplanned manner concerning the experiment’s outcome, thus allowing sampling error to besmirch protocol mandated procedures. Since sampling error is a necessary evil in clinical trials, it must be handled with care, insuring that it does not contaminate the experiment. The one acceptable repository for sampling error is the type I and type II event probabilities, since they are constructed as sampling error probabilities. (return to top)
Prospective Allocation of Alpha
The two conflicting forces of endpoint abundance (the desire to measure many different
clinical assessments at the end of the experiment) vs. interpretive parsimony (the alpha
level and therefore the success of the trial rest on the interpretation of one and only
one endpoint) bedevils investigators as they plan their experiments. Guidance for the
selection of endpoints in clinical trials is available^{12}. Motivations for
secondary and tertiary endpoints are both epidemiological (coherency within the trial and
consistency across trials) and cost-efficient. However, how can all of this information on
nonprimary endpoints be interpreted when the medical and scientific community focus on the
experimental p-value for the primary endpoint of a trial? What is the correct
interpretation of a trial which is negative for its primary endpoint, but has nominally
statistically significant secondary endpoints? If a trial has two active arms and one
placebo arm, must a single comparison take precedence? There are various strategies to
interpret the family of p-values from an experiment and various multiple comparison
procedures available^{12-18}. The concept presented here is an adaptation of
multiple testing, taking advantage of the ability to set the multiple comparison values at
different levels. It should only be used for prospectively determined endpoints (i.e.
formal hypothesis testing), as opposed to hypothesis generation, a more inquisitive
investigation, designed to identify relationships not anticipated before the beginning of
the experiment.
(return to top)
Type I error accumulates with each executed hypothesis test and must be controlled by the investigators. By the prospective selection of the alpha levels, they set the standard by which the trial will be judged. P-values for post hoc analyses are uninterpretable within a clinical trial setting since, 1)being data driven (as oppose to protocol driven), they are intertwined with sampling error and 2) many post hoc analyses can be performed with only the most favorable ones being promulgated, leading to hidden alpha accumulation. Post hoc testing produces severe experimental discordance with corrupt alpha levels. The investigators avoid this serious limitation of analysis interpretation by choosing endpoints prospectively. However they further strengthen their experimental design by choosing the allocation of the type I error. Alpha should be allocated to protocol-specified hypotheses and protocol specified subgroup analyses. If the subgroup analysis is carried out based of findings of other trials, alpha can still be allocated if 1) no data from the current experiment is used in the examination of alpha and 2)the additional subgroup analysis is added with appropriate changes in the decision path before the end of the study.
Clinical trials are often evaluated as though alpha of 0.05 is allocated for each hypothesis test. Consider instead an experimental (or trial) alpha alpha _{E, }representing the type I error in the experiment. The primary endpoint will have alpha associated with it (alpha _{P} ), and secondary endpoints will have alpha associated with them (alpha _{S} ). Since the goal is to set an upper bound for the experimental type I error, there are limitations placed on alpha _{P} and alpha _{S. }We begin by writing that the probability of no type I error in the experiment as the product of the probability of no type I error on the primary endpoint or on the secondary endpoint.(return to top)
(1)
The probability invoked here is that of the event alphaat least one successalpha, described in Snedecore and Cochran, page 116. (This assumes independence between the type I error event for the primary endpoint and the type I error event for the secondary endpoint. Relaxing this assumption requires specific information about the nature of the dependence between primary and secondary endpoints, which will be trial specific. For the purpose of this discussion, we will assume independence). Thus, alpha _{E} is the probability of making at least one type I error – an error on either the primary or secondary endpoint. This formula generalizes to n_{p} primary endpoints and n_{s} secondary endpoints.
This probability has its upper bound approximated by Bonferoni’s inequality, but an exact treatment will be developed here. Several examples are provided for the use of this function. We will assume that all hypothesis testing it two tailed. (return to top)
An experiment randomized patients to one of two treatment arms for the assessment of an intervention reducing total mortality. There are three secondary endpoints of equal weight (i.e. to be assessed at the same alpha levels).
In this case alpha _{E} is the probability of making at least one type I error for either of the one primary or three secondary endpoints, and is set as a maximum of 0.05. If we choose alpha _{P} = 0.02, then we can find the available alpha for the secondary endpoints from (1) as
So alpha _{S} = 0.03061 is the available type I error for the family of secondary endpoints. Apportioning this equally, we find
and alpha ^{*}_{S} = 0.01031. An alpha allocation table assembled by the investigators and supplied prospectively in the experiment’s protocol (Table 1) is an unambiguous statement of the investigators’ plans for assessing the impact of the experimental intervention.
Table 1: Alpha Allocation : alpha _{E}= 0.05 (two sided) | |
Endpoint | Allocated Alpha |
Primary Endpoint | 0.02000 |
Total Mortality | 0.02000 |
Secondary Endpoints | 0.03061 |
Hospitalization for CHF | 0.01031 |
Progression of CHF | 0.01031 |
Max O_{2} consumption | 0.01031 |
Scenario 2.
An investigator is designing a clinical trial, with a placebo and two treatment arms A_{1}
and A_{2}. There is equal interest is testing A_{1} against placebo and A_{2}
against placebo. For each of these tests, there is one primary endpoint, total mortality,
and two secondary endpoints (intermittent claudication and unstable angina). Set alpha _{E}
= 0.05, to be divided equally between the two tests (A_{1} vs. placebo and A_{2}
Vs placebo). We find
Allowing 0.02 of this for the primary endpoint, and the remainder to be distributed equally across the secondary endpoints, find
Since A_{2} would be handled analogously, the allocation of alpha for the endpoints can be easily completed (Table 2).
Table 2: Alpha Allocation : alpha _{E}= 0.05 (two sided) | |
Endpoint | Allocated Alpha |
A_{1} vs. Placebo Comparison | 0.02532 |
Primary Endpoint | |
Total Mortality | 0.02000 |
Secondary Endpoints | 0.00543 |
Hospitalization for CHF | 0.00272 |
Progression of CHF | 0.00272 |
A_{2} vs. Placebo Comparison | 0.02532 |
Primary Endpoint | |
Total Mortality | 0.02000 |
Secondary Endpoints | 0.00543 |
Hospitalization for CHF | 0.00272 |
Progression of CHF (medication status) | 0.00272 |
Scenario 3.
Consider a two armed trial testing the impact of an intervention on a primary endpoint of
mortality and each of two secondary endpoints. The investigators give each of these
secondary endpoints equal weight. Applying equation 1, the investigators compute an alpha
allocation table (Table 3). Assume the experiment is executed according to the protocol,
and the significance of the endpoints assessed. The overall alpha expended in the
experiment
= 1 - (1-0.00100)(1-0.02000)(1-0.00400)= 0.02490. The findings for hospitalization for CHF did not reach the threshold, and it should be interpreted as negative, a conclusion reaffirming the importance of the investigators’ prospective statement on level of statistical significance for the trial endpoints.
Alpha spending functions for interim analyses are very useful^{19-22}, and any alpha allocation for the end of trial analysis must be reduced if alpha was allocated during the interim analyses.(return to top)
Discussion
By enforcing experimental concordance the investigators ease the task of interpreting
their research. However, rigor and discipline in experimental execution should not exclude
the prospective determination of acceptable alpha error levels. This is a serious
investigator obligation since both population and patient protection are the
responsibility of clinical scientists. The result of the strategies suggested herein is
that, in an organized fashion, investigators prospectively trace the path of alpha
accumulation through the endpoints, aligning their endpoint choices with the restrictions
of interpretive parsimony.
A consequence of the proposed approach is, since alpha is to be expended on secondary endpoints, less alpha can be expended on the primary endpoint in order to constrain the alpha of the experiment at an acceptable upper bound. Thus, experiments with secondary endpoints must pay a price for these endpoints’ interpretation (an increased sample size for the alpha _{P} < 0.05). Too often, no type I error is allocated for the secondary or non-endpoints, but much interpretive weight is placed there when the experiment has ended. If the secondary endpoint is to have an objective interpretation, this interpretation must occur in the context of the alpha expended.
Opinions on both the necessity and strategy of alpha allocation are diverse^{12-18}. The use of multiple comparison procedures based on the Bonferoni method^{14} has invoke strong criticism from Rothman^{18}, who states that such tools trivialize the possible hypothesis testing outcomes by reducing the maximum p-value acceptable for a positive finding to a level which seems to preclude proclaiming any effect as positive. The tack taken in this manuscript allows experimenters the freedom to choose a priori the alpha level of each hypothesis but allows the levels of alpha to reflect the importance of the endpoint. Thus the investigators tailor their alpha selection to their experience with the intervention and target those hypothesis tests of the greatest clinical relevance, akin to the construction of a minimax rule. However, the total type I error of the experiment should be conserved, since this strategy protects the population to whom the results will be extended from an excessive number of false positive findings. This is in contradistinction to work on partial null hypotheses, (i.e. minimizing alpha within each of a collection of subsets of hypothesis tests e.g. partial null hypothesis) which may not keep the experiments overall alpha below a prespecified level.
The presence of dependent hypothesis tests induced by endpoint set correlation can result in a generous alpha allocation^{15}. In these circumstances, the adjustment presented in the manuscript is an overadjustment, leading to alpha levels lower than required. This consideration is dependency is admissible if 1) there is biologic plausibility for the nature of the dependency and 2) the investigators make a reasonable prospective statement on the magnitude of the dependency.
There is wisdom in the comments from Friedman et al^{23} who state alphait is more reasonable to calculate sample size based on one primary response variable comparison and be cautious in claiming significant results for other comparisonsalpha. However, the degree to which investigators violate this principle in interpreting trial results suggests that this wisdom is not well appreciated. Other strategies in multiple testing are also admissible and those clinical investigators who disagree with the findings of Friedman et.al. would benefit from a structured approach to the allocation of alpha. However, this approach should be consistent with the investigator’s responsibility to protect the population to whom their research will be generalized from excessive false positive errors. (return to top)
1. Pfeffer MA, Braunwald, E, Moye' LA, Basta L, Brown EJ, Cuddy TE, Davis BR, Geltman EM, Goldman S, Flaker GC, Klein M, Lamas GA, Packer M, Rouleau J, Rouleau JL, Rutherford J, Wertheimer JH, Hawkins CM. Effect of captopril on mortality and morbidity in patients with left ventricular dysfunction after myocardial infarction - results of the Survival and Ventricular Enlargement Trial. N Eng J Med 327(10):669-677, Sep 3, 1992.
2. Sacks FM. Pfeffer MA, Moye' , LA, Rouleau JL, Rutherford JD, Cole TG, Brown L, Warnica JW, Arnold JMO, Wun CC, Darvis BR, Braunwald E. . The effect of pravastatin on coronary events after myocardial infarction in patients with average cholesterol levels. (N Engl J Med1996;335:1001-9.
3. The SHEP Cooperative Research Group. Prevention of Stroke by Antihypertensive Drug Therapy in Older Persons with Isolated Systolic Hypertension: Final Results of the Systolic Hypertension in the Elderly Program (SHEP). Journal of the American Medical Association. June 26, 1991: Vol 265, No. 24.
5. Packer M, Bristow MR, Cohn JN, Colucci WS, Fowler MB, Gilbert EM, Shusterman NH. "The Effect of Carvedilol of Morbidity and mortality in patients with chronic heart failure. N Eng. J Med 1996;334:1349-55.)
6. Pfeffer MA, Sevenson LW. Beta-adrenergic blockers and survival in heart failure. Editorial. N Eng. J Med Vol 1996:334 1396-1397.
7. Moye' LA, Abernethy D. Carvedilol in Patients with Chronic Heart Failure. N Engl J. Med. 1997;335 1318-1319.
8. Hilsenbeck SG, Clar GM. Practical P-Values: Adjustment for Optimally Selected Cutpoints. Stat in Med (15); 103-112 (1996).
9.Fisher B. Bauer M. Margolese R. Et. Al. Five year results of a randomized clinical trial comparing total mastectomy and segmental mastectomy with or without radiation in the treatment of breast cancer." N Eng J Med 1985;312:665-73.
10. Fisher B. Bauer M. Margolese R. Et. Al. Eight year results of a randomized clinical trial comparing total mastectomy and segmental mastectomy with or without radiation in the treatment of breast cancer." N Eng J Med 1989;320:822-8.
11. Christian MC, McCabe MS, Korn EL, Abrams JS, Kaplan RS, Friedman MA. "The national cancer institute audit of the national surgical adjuvant breast and bowel project protocol B-06". NEJM 1995;333:1469-1474).
12. Meinert CL. Clinical Trials Design, Conduct, and Analysis. New York. Oxford University Press 1986.
13. Dowdy S. Wearden S. Statistics for Research. Second Edition. New York. John Wiley and Sones 1991.
14. Snedecor GW, Cochran WG. Statistical Methods (7^{th} Edition). Iowa. Iowa State University Press (1980).
15. Dubey SD. "Adjustment of p-values for multiplicities of interconnecting symptoms" in Statistics in the Pharmaceutical Industry 2^{nd} Edition. Buncher RC and Tsay JY (eds_ Marcel Dekkert Inc., New Yrok, pp 513-527.(1994)
16.Gnosh BK, Sen PK. Handbook of Sequential Analysis. Marcel Dekker Inc., New York.
17. Miller RG Simultaneous Statistical Inference 2^{nd} Edition. Springer-Verlag. New York (1981)
18. Rothman RJ. "No adjustments are needed for multiple comparisons" Epidemiology 1:43-46(1990).
19. Davis, BR. Hardy RJ. "Upper bounds for type I and type II error rates in conditional power calculations". Comm. Statis., 19(10), 3571-3584 (1990).
20.Lan KKG, Demets DL. "Discrete sequential boundaries for clinical trials". Biometrika.70, 3.pp 659-663 (1983).
21.Lan KKG, Simon R, Halperin M. "Stochastically curtailed tests in long-term clinical trials".Comm. Statis,(1982).
22. Lan KKG, Wittes J. "The b-value: a tool for monitoring data". Biometrics,44,579-585.(1988).
23. Friedman L, Furberg C, DeMets D. Fundamentals of Clinical Trials 3^{rd} edition; Mosby 1996, p 308.