OBJECTIVE. Randomized controlled trial (RCT) interventions often rely on p values, where statistical significance is assumed to provide evidence of an intervention effect. This study provides a secondary data analysis of the Well Elderly II RCT using multiple approaches that examine clinical meaningfulness.
METHOD. We reanalyzed the Well Elderly II RCT using effect size, standard deviation, standard error of measurement, minimal difference, a fragility index, an assessment of poor scores at baseline, and an analysis with a small subgroup of participants removed.
RESULTS. Although some participants improved on several scales, most stayed the same, and a small subset declined. Omitting a small subgroup of participants led to nonsignificant p values.
CONCLUSION. There is evidence that disparities in baseline scores and regression to the mean may have created the appearance of an intervention effect. Our methods of considering clinical meaningfulness suggest improved approaches to analyzing RCT data.
Randomized controlled trials (RCTs) are consistently regarded as the gold standard of study design (e.g., Portney & Watkins, 2015). This reputation is not without merit: Compared with observational studies, in which selection effects are a concern, RCTs involve random assignment into groups so that group differences in unmeasured, lurking variables are eliminated. If groups are relatively equal at baseline, postintervention differences (including those from placebo effects) tend to be attributed to the intervention. Assuming random assignment was conducted effectively, the only remaining consideration is whether improvements are clinically meaningful.
The primary way of assessing the effectiveness of interventions is to rely on p values, in which p values are calculated for postintervention differences between groups, and statistical significance is assumed to provide evidence of an intervention effect. However, this approach has major problems. Vocal critics of this method (e.g., Cohen, 1992; Goodman, 1999) have expressed exasperation that p values continue to receive undue respect after decades of debate and virtually no remaining arguments in favor of their exclusive use. It seems that the respect for p values is a product of habit rather than appropriateness (Freeman, 1993).
The criticism relates to the essence of p values. In an experimental design comparing groups, p values quantify the probability of obtaining a group difference at least as extreme as that observed, given that the null hypothesis of no difference is true. A statistically significant difference between a treatment and a control group does not provide sufficient information about an intervention’s effectiveness, clinical meaningfulness, or intervention effect size (Kraemer et al., 2003). It is noteworthy that small effects can become statistically significant as the sample size increases (see Carver, 1978; Lin et al., 2013); in addition, if statistical assumptions are not met, clinically meaningless results can achieve significance, and p values alone fail to consider this possibility.
With these concerns in mind, authors have long suggested alternatives to significance testing. Cohen (1962, 1990) spent his career arguing for the universal reporting of effect size, and he expressed dismay that his simple formula failed to catch on (Cohen, 1992). With regard to clinical work, a range of authors have discussed the concept of clinical meaningfulness. Jacobson et al. (1984) were not the first to argue that the focus should not be on group means but on people who may benefit from a treatment. The difficult task is to develop a method of defining and quantifying benefit (Jacobson & Truax, 1991).
Jaeschke et al. (1989) discussed the minimal clinically important difference, which attempts to quantify “the smallest difference in score in the domain of interest which patients perceive as beneficial and which would mandate, in the absence of troublesome side effects and excessive cost, a change in the patient’s management” (p. 408). This difference can be measured either cross-sectionally between patients, often referred to as the minimally important difference, or longitudinally within patients, referred to as the minimally important change (MIC; Beaton et al., 2001; de Vet et al., 2006; de Vet & Terwee, 2010). Ongoing discussions that distinguish between these concepts, although important, are beyond the scope of this article; we refer rather generically to the broader intervention aim of creating change that is clinically meaningful.
To measure or approximate clinical meaningfulness, one can use anchor-based or distribution-based methods (Guyatt et al., 2002; Wyrwich et al., 2005). Anchor-based methods rely on an independently assessed “anchor” that is used to determine cutoff points on a scale with otherwise unknown parameters. For example, in patients with cancer experiencing breakthrough pain, Farrar et al. (2000) first measured pain scores with several subjective numeric scales. Next, they defined clinically meaningful pain relief as the point at which patients decide not to take additional opioid medication. Finally, using medication use as an anchor, they determined clinically meaningful changes in the subjective scales.
Distribution-based methods, which rely on the sample distribution and inferential statistics, are commonly used when anchors are unavailable. One well-defended method is to use standard error of measurement (SEM; Rejas et al., 2008; Wyrwich et al., 1999). Unlike the standard error of the mean, SEM relies not on the sample size but on the reliability of the scale in question, thereby accounting for measurement error that could pass for clinically meaningful change. Others have argued that clinical meaningfulness can be defined simply as a change of 0.5 standard deviation (SD), which for a reliability of 0.75 is precisely 1 SEM (Norman et al., 2003). Finally, others have continued to promote the reporting of effect size, such as Cohen’s d (Samsa et al., 1999). However, these distribution methods do not directly measure anything clinical (for a review, see Crosby et al., 2003), so many authors have used the term minimum detectable change (MDC) to indicate the minimum change required to exceed measurement error (de Vet & Terwee, 2010; Wyrwich et al., 1999).
Even in the presence of anchors, determining meaningful change is difficult, especially with subjective patient-report measures. For example, Terwee et al. (2010) showed that MIC values vary by method, and values determined by a given method vary across studies. Anchor-based measures, they pointed out, suffer from validity problems because they tend to be more strongly correlated with follow-up scores than baseline scores. Yet, anchor-based methods are perhaps universally preferred over distribution-based methods, and distribution-based methods are universally preferred over p values for determining clinical meaningfulness. Finally, there is broad agreement that distribution-based methods approximate clinical meaningfulness (e.g., Rejas et al., 2008), so they should at least be used until anchors are established (Turner et al., 2010).
Purpose and Significance
The purpose of this article is to provide an analysis of clinical meaningfulness by demonstrating the use of distribution-based methods on a well-known occupational therapy intervention: the Well Elderly II RCT, also known as Lifestyle Redesign (Jackson et al., 1998). The Well Elderly II RCT consisted of group and individual sessions to counsel older adults about their daily activities and to develop preventive lifestyle changes that are thought to trickle down to improvements in physical health, mental health, and life satisfaction (see Jackson et al., 1998).
The Well Elderly intervention is well known outside of occupational therapy and has arguably been shown to be cost effective (Clark et al., 2012). It is regularly cited, with more than 250 citations on Google Scholar. Moreover, the intervention is being replicated elsewhere (e.g., Johansson & Björklund, 2016; Mountain et al., 2017) and has been included as a guideline for health promotion in older adults by the United Kingdom’s National Institute for Health and Care Excellence. We randomly sampled 50 of the articles that cited Clark et al. (2012), and 32 (64%) cited the study as having been effective or showing that occupational therapy community interventions are effective.
The Well Elderly II RCT relied on a treatment-control crossover design, in which analysis of covariance (ANCOVA) was used to control for various baseline covariates between the treatment and control groups. Researchers reported statistically significant differences for pain, vitality, social functioning, mental health, and the mental health composite score on the Medical Outcomes Study Short-Form Health Survey (SF–36v2®; Ware, 2000); life satisfaction using the Life Satisfaction Index–Z (LSI–Z; Wood et al., 1969); and symptoms of depression using the Center for Epidemiologic Studies Depression (CES–D; Radloff, 1977) scale. The change scores and p values were mostly analogous using paired t tests, which we used to focus on improvements in the treatment group in the first stage of the study. We considered these same measures with the broader question of clinical meaningfulness, using several distribution-based approaches to reanalyze the data for each of the measures that obtained statistical significance in the original report.
Four hundred sixty older adults participated in the Well Elderly II RCT. After randomization and attrition, data analysis for the treatment group included 187 participants who were primarily female (71.7%), ranged in age from 60 to 90 yr (mean [M] = 74.1, SD = 7.9), and were ethnically diverse. Approximately half of the participants had at least some college education (51.3%) and earned less than $11,999 annually (50.8%). A control group of 173 participants had similar characteristics. Participants were recruited from a graduated care retirement community (7.0%), 11 senior housing residences (46.0%), and 9 senior activity centers (47.0%).
Data and Analysis
Focusing on the seven scales that showed statistical significance in Clark et al. (2012), we used several approaches to examine clinical meaningfulness: effect size, 0.5 SD, 2 SEMs, and minimal difference (MD); a fragility index (F index); number of participants with poor scores at baseline; and an analysis with “extreme improvers” omitted. For 0.5 SD, 2 SEMs, and MD, we generated threshold values for each scale. Then, change scores were calculated as the difference between the baseline and posttest values for each participant in the treatment and control groups, and the threshold values were used to group participants as having declined, stayed the same, or improved in their score for each scale. For effect size, the fragility index, and the analysis with extreme improvers omitted, we conducted paired t tests using Stata (Version 14.2; Stata Corporation, College Station, TX). Each method is discussed in further detail.
Effect size quantifies the strength of the relationship between independent and dependent variables (Kraemer et al., 2003). Numerous measures of effect size exist (e.g., correlation coefficients, Cohen’s d, and measures of risk potency). In this article, we discuss Cohen’s d (), which is computed by taking the difference between the means of two groups and dividing that difference by the pooled SD. Cohen’s d ranges from minus to plus infinity, but d values much greater than 1 are uncommon. Cohen (1992) provided general guidelines for interpreting effect sizes, with cutoffs for small (d = 0.2), medium (d = 0.5), and large (d = 0.8) effects. A medium effect is intended to be a change visible to the naked eye of the careful observer, whereas a small effect is noticeably smaller but not trivial (Cohen, 1992). Although effect size provides information about the relative strength of statistically significant findings, Kraemer et al. (2003) noted that effect size values are relative and not readily interpretable in terms of how much people are affected by treatment.
SD is a measure of the amount of variation, or spread, in a set of values. The variance is calculated by summing the squared differences between each value and the mean, divided by the sample size. SD is the square root of the variance. To determine clinical meaningfulness, a value of 0.5 SD has been found to correspond with patient-reported minimal change across a variety of studies (Norman et al., 2003).
Standard Error of Measurement
SEM is an estimate of the reliability of an obtained score and is typically determined during assessment development. SEM is calculated by subtracting the test reliability (r coefficient or intraclass correlation coefficient) from 1, taking the square root of that difference, and multiplying the square root by the SD of the test scores (). Clinically, SEM is typically used to create a band or confidence interval (CI) around an obtained score to arrive at a sense of the true score (X ± SEM; Harvill, 1991), in which 1 SEM corresponds to a 68% CI, and 2 SEMs correspond to a 95% CI. Some authors have suggested using 1 SEM as a cutoff for meaningful aggregate changes (e.g., Copay et al., 2007), but many authors (including the SF–36 authors; Ware et al., 1994) deem 1 SEM to be too liberal for individual changes. Thus, we used 2 SEMs for our analysis.
Weir (2005) proposed using SEM to determine the MD (others have referred to this as MDC; see Busija et al., 2008) that approximates a real treatment effect when performing pre- and posttests (see also Beaton et al., 2001). MD is calculated by multiplying the SEM of the measure by 1.96 (the z score associated with a 95% CI) and, to account for error coming from two scores (pre and post) rather than one, the square root of 2 (MD ).
Walsh et al. (2014) proposed a fragility index to quantify how “fragile” an RCT’s results are. They calculated the index primarily for dichotomous outcomes, in which p-value calculations involve comparing the proportion of patients experiencing an outcome or event in a treatment group with the proportion experiencing the outcome or event in a comparison group. To calculate the fragility index in this scenario, Walsh et al. iteratively removed participants in the treatment group who experienced the outcome or event, and each time they removed a participant, they recalculated the p value, continuing until it exceeded .05. The required number of participants to lose statistical significance was the fragility index, in which lower numbers (e.g., <5) indicate that the significant result depends on very few outcomes.
Our fragility index (Ohl & Schelly, 2017) was calculated by iteratively omitting the most extreme improving participant from the p-value calculation (i.e., using paired t tests) until the p value exceeded .05. A value of 1, then, indicates that without the most extreme improver, the result would not obtain statistical significance.
Poor Scores at Baseline
Participants with poor scores at baseline are those in the undesirable extremes, such as those reporting severe pain (low scores) or depression (high scores). These people are the most likely to show improvement over time because many have nowhere to go but to improve. One concern is that although aggregate scores may show no statistically significant differences at baseline—which was the case in the Well Elderly II data—disproportionate extreme values can effectively create consequential differences between the treatment and control groups. The 5th percentile of the combined treatment and control groups was used as the cutoff for poor scores for each scale (the 95th percentile was used for depression, in which higher scores indicate depressive symptoms). For example, for the Pain scale, this included people who reported “very severe” pain and either “extreme” or “quite a bit” of interference with daily activities and people who reported “severe” pain and “extreme” interference.
Extreme Improvers Omitted
To determine whether improvers on individual scales were the same people across many scales, as opposed to many people improving on only one or two scales, we used Excel (Microsoft Corp., Redmond, WA) to mark 1-SEM improvers on all seven scales. Participants in the treatment group who improved by at least 1 SEM on six or seven of the scales (n = 8) were labeled “extreme improvers” and were omitted for recalculations of paired t tests.
Medical Outcomes Study Short-Form Health Survey.
The SF–36v2 (Ware, 2000) is a health survey that contains 36 items and yields eight profile scores indicating various aspects of functional health and well-being. Four of the profile scores (physical function, role physical, bodily pain, and general health) yield a physical health composite score, and the remaining four (mental health, role emotional, social functioning, and vitality) yield a mental health composite score. The test–retest reliability coefficients of the profile scores are all equal to or greater than .80, with the exception of social functioning (r = .76; Ware, 2000). Reliability estimates for the physical and mental health composite scores typically exceed .90. The SEM, with a 95% CI, has been reported extensively in the literature and ranges from ±6–7 points for the composite scores to ±13–32 points for the profile scores (Ware, 2000).
Life Satisfaction Index–Z.
The LSI–Z (Wood et al., 1969) is a 13-item self-report measure of subjective well-being in the older adult population. Each item consists of a 3-point scale coded as 0 (disagree), 1 (unsure), or 2 (agree). Total scores range from 0 to 26, with higher scores indicating higher life satisfaction. Test–retest reliability is acceptable (r = .79).
Center for Epidemiologic Studies Depression Scale.
The CES–D scale (Radloff, 1977) is a 20-item self-report scale designed to measure depressive symptomatology in the general population. Respondents rate the frequency with which depressive symptoms have occurred over the past week using a 4-point scale coded as 0 (rarely or none of the time), 1 (some or a little of the time), 2 (occasionally or a moderate amount of time), or 3 (most or all of the time). Total scores range from 0 to 60, with higher scores indicating more depressive symptoms. During its development, the CES–D scale demonstrated high internal consistency in both general (α = .85) and clinical (α = .90) populations (Radloff, 1977). Test–retest reliability was moderate (rs = .45–.70), which the test developer attributed to the variable nature of depressive symptoms and methodological flaws with the consistency and method of data collection between pre- and posttests.
Table 1 reports baseline and posttest means for the seven scales, and p values are included first as they were reported in Clark et al.’s (2012) original report. Clark et al. used ANCOVA to compare regression slopes between the treatment and control groups, in which the posttest score was predicted with several factors, including the baseline score. This method is certainly justifiable, but there is reason to believe it may be misleading under certain conditions of baseline score imbalances (see Lord, 1967). Thus, the second p values reported in Table 1 show the results of independent-samples t tests, which simply compare the differences between baseline and posttest without controlling for any factors. We present these p values to show that a simpler comparison of means produces very similar results. Only mental health and the LSI–Z move slightly beyond the typical .05 cutoff for statistical significance.
Table 1 also reports p values from paired-samples t tests, which are later used to consider changes without the extreme improvers (effect sizes are also calculated from the within-group changes). Note that Clark et al. (2012) reported p values for one-sided tests, but we agree with suggestions to use two-sided tests (see Bland & Altman, 1996). Although the one- versus two-sided debate is far from settled, most researchers seem to agree that two-sided tests should be the norm (Ringwalt et al., 2011), not least because they set a higher bar for significance, thereby protecting slightly against Type 1 errors. In this case, they also allow for declines between baseline and posttest, which are hypothesized to occur in this population in the absence of intervention (Jackson et al., 1998). Using two-sided tests, social functioning does show a significant decline (p = .050) in the control group. The treatment group shows significant improvements in pain (p = .003), vitality (p = .047), the LSI–Z (p = .012), and the CES–D scale (p = .028) but not in social functioning, mental health, or the mental health composite score.
Effect sizes are shown in the Cohen’s d column in Table 2. In the treatment group, effect sizes ranged from 0.13 to 0.19, all less than the cutoff for small effects suggested by Cohen (i.e., d = 0.20). Pain (d = 0.16) and the CES–D scale (d = 0.19) were the largest effects. In the control group, the effect size for social functioning (d = −0.15), in which there was significant decline, was also below the cutoff for a small effect.
0.5 Standard Deviation
The 0.5-SD method, also shown Table 2, was the least conservative method for grouping participants as having declined, stayed the same, or improved. With the exception of the mental composite score, which has a low SD and therefore displays high numbers of improvers and decliners in the treatment and control groups, the percentages of improvers and decliners were relatively similar between the treatment and control groups. The percentages of improvers ranged from 6.4% to 15.5% in the treatment group and from 5.8% to 11.0% in the control group; the percentages of decliners ranged from 4.8% to 8.0% in the treatment group and from 9.2% to 15.1% in the control group. For the mental composite score, slightly more control group participants remained the same relative to the treatment group (48.3% vs. 47.1%); however, for the remaining scales, the treatment group had more participants who remained stable compared with the control group, with between 77.0% and 88.8% showing no improvement and no decline. Overall, the control group tended to have slightly more decliners than improvers, whereas the treatment group had between 3 and 15 additional improvers relative to decliners.
Standard Error of the Mean
The 2-SEMs method was more conservative than the 0.5-SD method. Again, with the exception of the mental composite score, the percentages of improvers ranged from 1.6% to 12.8% in the treatment group and from 0% to 11.0% in the control group; the percentages of decliners ranged from 0.5% to 5.9% in the treatment group and from 4.6% to 13.3% in the control group. The treatment group had more participants who remained stable compared with the control group for all scales, with 62.0% showing no improvement and no decline in the mental composite, and between 81.8% and 97.9% showing no improvement and no decline in the remaining scales. Overall, the control group again tended to have slightly more decliners than improvers, and the treatment group had between 2 and 14 additional improvers relative to decliners.
The MD method yielded the most conservative cutoff for clinical meaningfulness. The mental composite continued to be an exception because 15.5% of treatment participants improved (12.2% of control participants), compared with between 0% and 7.0% on the other scales; the control group had between 0% and 5.2% showing improvement on the other scales.
The fragility index (see Table 2) shows that very few extreme improvers need to be omitted—from 1 to 7—for each scale to lose statistical significance. Three scales had fragility scores of 0 because the p values were nonsignificant with two-sided paired t tests instead of Clark et al.’s (2012) use of one-sided ANCOVA tests.
Poor Scores at Baseline
With the exception of the LSI–Z, more participants in the treatment group than in the control group displayed poor scores at baseline (see Table 2). Pain, vitality, social functioning, and the mental composite had substantially more participants with poor scores at baseline, with 8–18 additional participants with poor scores in the treatment group compared with the control group.
Extreme Improvers Omitted
Eight participants stood out as showing consistent improvement across the scales, showing at least 1 SEM improvement on six or seven of the reported scales. It is noteworthy that these participants also scored more poorly at baseline: As a group, they displayed worse scores by between 0.94 and 5.65 SEMs compared with the remaining group (combined control and treatment); their scores were worse by more than 2 SEMs on mental health, the mental composite score, the LSI–Z, and the CES–D scale. Table 2 shows an analysis of the treatment group in its entirety compared with the treatment group with these 8 participants omitted. The result is that the aggregate improvements from baseline to posttest all but disappear in all of the scales except for pain and the LSI–Z, and only pain remains statistically significant with a two-sided test.
We reexamined the Well Elderly II RCT using distribution-based methods to provide an example of how to analyze and interpret data through the lens of clinical meaningfulness. The effect sizes for all statistically significant results were minimal to small, suggesting minimal aggregate improvements in the treatment group’s perceptions of quality of life, life satisfaction, and depression. Clinically, small effect sizes may or may not be detectable to an occupational therapy practitioner or the participants. To provide a more thorough understanding of intervention effects than p values alone, we conducted subsequent analyses that took into account measurement error (i.e., SD, SEM, MD), a fragility index, an assessment of poor scores at baseline, and an analysis with 8 participants with extreme improvements omitted.
SD, SEM, and MD were used to determine the proportion of participants with scores that stayed the same, improved, and declined. A consistent pattern presented across the three methods. Compared with the control group, a greater proportion of the treatment group improved, a greater proportion stayed the same (with the exception of the mental health composite), and a slightly smaller proportion declined. More important, the clear majority of participants in the treatment and control groups experienced no meaningful change on each scale.
Using the more conservative MD, we found little difference between groups: For example, on the Pain scale, 2.7% of the treatment participants compared with 2.3% of the control group participants improved. This small difference could be explained by baseline disparities between groups. Another way of thinking about the MD results is that the vast majority of both groups either stayed the same or declined in pain (97.3% in the treatment group and 97.7% in the control group). Considering measurement error in this way potentially provides alternative explanations for statistically significant differences between groups. Moreover, it provides a way to single out a small subset of the total sample for additional analyses, with the potential to identify characteristics in the improvers that might explain their response to the intervention.
Given that more participants stayed the same in the treatment group compared with the control group, it is possible that the intervention prevented decline; however, several counterpoints should be noted. First, the assessments with statistically significant improvements in the Well Elderly II RCT were all self-report or subjective measures of health, and they do not indicate any objective cognitive or physical functioning. Therefore, relatively stable scores between pre- and posttests may not translate into a maintenance of function because they may only indicate a maintenance in perception of function, and perceptions may be susceptible to a placebo effect. Second, in a population in which Lifestyle Redesign would be used, it is not clear how much decline to expect in a 6-mo period. To assess age-related declines, researchers conducting medical and epidemiological studies typically collect prospective longitudinal data spanning years to decades, using a combination of objective and subjective assessments (e.g., Inzitari et al., 2007; Paterniti et al., 2002; van Gelder et al., 2004). Considering how declines are typically assessed, the study design used in the Well Elderly II RCT may not be appropriate for examining age-related declines.
The additional analyses point to a small group of participants who are driving the effects, and these effects are at least partially a product of regression to the mean. In brief, regression to the mean is a universal feature of repeated measures in the presence of measurement error. To illustrate, consider the SF–36v2 Pain scale, in which two questions are combined to form a raw score that ranges from 2 to 11. Now take any group with a particular score in the middle, such as all participants who report a 5 at baseline. At posttest, some will go up, and some will go down, such that measurement error will pull at the baseline score from the top and the bottom. However, participants at 11 and those at 2, when measured again at posttest, are affected by measurement error unidirectionally, in which any change at all will pull the group mean down or up, respectively.
With regression to the mean in mind, the small group of participants with extreme values are of special concern. Specifically, the fragility index ranged from 1 to 7, suggesting that the statistical significance was vulnerable to a few extreme improvers. Next, 8 participants showed improvement on all or all but one of the seven scales, and these participants had worse scores (e.g., more pain, less satisfaction) by 0.94–5.65 SEMs at baseline compared with the remaining group, possibly because they had experienced a recent injury or negative life event before the intervention. Finally, there were concerning disparities in baseline scores for participants in these poor extremes: The treatment group had substantially more participants in the poor extremes—8 or more for the majority of the scales—compared with the control group. The implication is that because participants at the poor extremes at baseline are more likely to improve than others (regression to the mean alone will lead to more improvement, on average), the improvements in the treatment group—even in the absence of improvement in the control group—may be related to these disparities. It should be noted that the design of the scales may even exaggerate this effect: When baseline distributions are skewed such that the scale midpoints are above the means and few participants report the poor extremes, as is the case for the reported scales, regression to the mean will be asymmetrical, resulting in an aggregate improvement. Interestingly, the developer of the CES–D scale warned about this possibility (Radloff, 1977, p. 391).
In addition, the intervention may have had a placebo effect on participants who had poor scores at baseline. For example, if 2 participants had hip replacement surgery immediately before the intervention, they both would have been primed to improve on various measures, such as pain, after their surgery. If 1 participant was treated and 1 participant was placed in the control group, the treated participant may have reported slightly greater improvements than the participant in the control group simply because he or she was aware of being treated. Likewise, the individual sessions of the intervention may have targeted his or her rehabilitation. If cases such as this example explain the small group of improvers, then the effectiveness of Lifestyle Redesign may stem from rather traditional occupational therapy rather than preventive strategies.
Although this article provides a more detailed analysis of the statistically significant findings in the participants and provides a discussion of the intervention’s clinical meaningfulness, it does not include all possible methods of examining clinical meaningfulness, and it does not elucidate the underlying mechanisms driving the improvements in the participants.
Implications for Occupational Therapy Practice
Distribution-based approaches yield a more nuanced tableau than p values alone. These approaches suggest that the Well Elderly II intervention was, at best, beneficial for a select group. At worst, the small aggregate improvements could be explained by a combination of regression to the mean and a placebo effect. These findings may be indicative of an endemic pattern in RCTs in which the p-value standard provides somewhat of a blinder to true intervention effects. With this in mind, we offer the following recommendations for occupational therapists and researchers:
For evidence-based practice, RCTs that only report p values should not be regarded as a high level of evidence without the consideration of effect size.
Researchers should assess clinical meaningfulness by considering participants and subgroups rather than only aggregate scores; we recommend simple approaches that rely on descriptive statistics, beginning with comparisons of percentages of participants who declined, stayed the same, and improved between the treatment and control groups.
Researchers should scrutinize baseline values, consider measurement error, and assess the fragility of their results.
What is clear from this analysis is that the effects of the Well Elderly II study are not as straightforward as they appear at first glance. More important, these findings show how misleading statistically significant p values can be for intervention researchers and, perhaps more so, for those investigators citing intervention research that claims effectiveness. After all, one cannot be expected to spend weeks sorting through original data to reassess the effectiveness of published interventions.
The approaches we used only indirectly address clinical meaningfulness, and none of them alone solve the puzzle of whether changes were meaningful or an intervention was effective. However, it is important to note that all of the approaches promote a particular attitude during data analysis: p values based on aggregate changes should be the starting point for additional, descriptive analyses that zero in on participants and subgroups. These additional analyses should consider measurement error, baseline scores, extreme improvers, and regression to the mean (e.g., by considering the distribution of change scores in the control group).
Before the analysis, though, investigators should be clear about intended treatment effects and mechanisms of change. The chosen assessments should, as much as possible, directly measure the predefined treatment effects. Expectations about how these effects operate in the population should be realistic: For some studies, such as pharmacological studies, small effects that reach most participants may indicate intervention success; for others, reaching only a limited number of participants may indicate a need to redesign the intervention or target only extreme values at baseline. Should the investigators choose to cast a wide net with a large number of assessments, it should be noted that under the null hypothesis of no change, statistical significance will be achieved 5% of the time by chance (i.e., at α = .05).
We thank the reviewers for their helpful feedback, as well as several individuals who were kind enough to comment on early drafts. Thank you to Aaron Eakman, Cathy Schelly, Jerry Vaske, and Margaret Kaplan.