Importance: It is crucial to understand the psychometric properties of performance-based assessments of functional cognition to support their use and the interpretation of the outcomes.
Objective: To investigate the psychometric properties of performance-based assessments of functional cognition in adults.
Data Sources: A literature search was conducted in the MEDLINE, Embase, CINAHL, PsycINFO, Web of Science, and Scopus databases from inception to July 2022.
Study Selection and Data Collection: We searched for studies that investigated the psychometric properties of the 25 assessments identified in Part 1 of this review. The included studies examined at least one psychometric property in adults older than age 18 yr. We evaluated psychometric evidence using the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) approach.
Findings: The most extensively researched psychometric property was hypothesis testing for construct validity, including convergent and discriminative validity. The original Allen Cognitive Level Screen, the Kitchen Task Assessment, the Menu Task, and the original Observed Tasks of Daily Living have high evidence of sufficient construct validity. The hospital-based Multiple Errands Test has moderate evidence for sufficient reliability. However, the reported evidence for other psychometric properties, including structural validity, is limited. Additionally, no studies investigated measurement error and responsiveness. No studies were deemed to satisfy the standards for establishing cross-cultural validity suggested by the COSMIN approach.
Conclusions and Relevance: Although several assessments demonstrate sufficient construct validity, additional studies are needed to elucidate other psychometric properties, including structural validity, reliability, and responsiveness, to support the use and interpretation of assessments of functional cognition in adults.
Plain-Language Summary:Functional cognition refers to the integration of cognitive skills during the performance of everyday activities. We investigated whether the existing performance-based assessments of functional cognition in adults are valid and reliable. The findings of this review indicate that certain assessments, such as the original Allen Cognitive Level Screen, the Kitchen Task Assessment, the Menu Task, and the original Observed Tasks of Daily Living, may be valid. However, additional studies are needed to elucidate other psychometric properties, including reliability, structural validity and responsiveness, to support the use and interpretation of the assessments of functional cognition in adults.
Occupational therapy professionals have established the concept of functional cognition to describe the dynamic integration of cognitive skills required to perform everyday activities in the real world (Baum et al., 2023; Giles et al., 2017; Toglia & Foster, 2021, Wesson et al., 2016). The American Occupational Therapy Association (AOTA; 2019) defines functional cognition as “how people use and integrate thinking and processing skills to accomplish everyday activities in clinical and community living settings” (p. 2). Occupational therapy practitioners have used performance-based assessments, which include direct observation of performance of everyday activities, to measure functional cognition in simulated naturalistic and real-world settings (Skidmore, 2017). Performance-based assessments of functional cognition may compensate for the limited ecological validity of traditional neuropsychological tests (Giles et al., 2022, 2024). Indeed, a study has reported that a performance-based assessment of functional cognition (the medication management task of the Executive Function Performance Test [EFPT]) better predicts poor participation in daily activities than a neuropsychological test (Color Trails Test) among older adults (Arieli et al., 2022). In this regard, functional cognition seems to be critical for independent living and, thus, a core priority in the occupational therapy profession (Giles et al., 2017, 2020).
Our team previously identified 25 performance-based assessments of functional cognition (Lee et al., 2025). Most of the assessments were originally developed to measure multiple cognitive constructs, such as executive functioning and high-level cognitive processes (e.g., multitasking). However, because the concept of functional cognition was developed later than these assessments, we retrospectively described the characteristics of these 25 assessments and deemed them as being consistent with the assessment of functional cognition. Occupational therapy and psychology were the disciplines most frequently involved in the development of performance-based assessments of functional cognition. In the assessments identified, the most commonly used instrumental activities of daily living (IADLs) were cooking and meal preparation, managing finances, using the telephone, and managing medication. Performance time was frequently used as a scoring metric across the assessments. Additionally, we identified indicators of functional cognitive abilities (e.g., organization, planning, judgment, strategy use) that have been incorporated into the existing assessments. This review provides fundamental knowledge on how functional cognition is conceptualized in the existing assessments (Lee et al., 2025).
To further facilitate the use of performance-based assessments of functional cognition, it is crucial to examine their psychometric properties, including their validity and reliability. Using standardized assessments with evidence-supported psychometric properties is part of an evidence-based approach in occupational therapy, ensuring that clinical practices are grounded in scientifically supported methods and leading to a better understanding of the outcomes for recipients of occupational therapy (Unsworth, 2011). Using standardized assessments is particularly important because occupational therapy practitioners worldwide have reported that they frequently use nonstandardized clinical observation to measure function among clients with cognitive impairments (Goodchild et al., 2023; Manee et al., 2020; Stigen et al., 2018; Ward et al., 2024). Nonstandardized clinical observation is essential during the process of clinical practice and may provide an opportunity to understand cognitive skills in everyday life. However, nonstandardized clinical observation requires occupational therapy practitioners to rely on subjective inferential skills to evaluate cognitive abilities. In contrast, standardized performance-based assessments of functional cognition provide structured scoring systems that promote more objective evaluation of an individual’s integrated cognitive abilities during the performance of everyday activities. Knowing the psychometric strengths and weaknesses of the various standardized performance-based assessments of functional cognition can help occupational therapy practitioners with assessment selection, interpretation, and synthesis of findings during the evaluation process. Therefore, the objective of this study was to investigate the psychometric properties of performance-based assessments of functional cognition among adults.
Method
This comprehensive systematic review includes two articles: In Part 1, we identified performance-based assessments of functional cognition used with adults and described the characteristics of the identified assessments (Lee et al., 2025). In this article, Part 2, we investigate the psychometric properties of the assessment tools identified in Part 1. This systematic review followed the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) guidelines, and the protocol was registered in PROSPERO (CRD42022321918). In the registered protocol, we included clinical utility as one of the research questions we planned to address in this review. However, because of the extensive content covered in both Part 1 and Part 2, we left clinical utility out of the scope of both reviews. Instead, we suggest that future studies address the clinical utility of the assessments.
Identification of Studies Reporting on Psychometric Properties of Assessments
Search Strategy and Selection of Studies
We conducted a search for each identified assessment tool from Part 1 (for a list of the included assessments, see Appendix A in the Supplemental Material, available online with this article at https://research.aota.org/ajot) from inception to July 2022 in the following databases: MEDLINE (n = 344), Embase (n = 467), CINAHL (n = 294), PsycINFO (n = 540), Web of Science (n = 420), and Scopus (n = 741). In our search strategy, we used the name of each identified assessment from Part 1 (25 in total) as the primary search term. We did not apply search filters specific to psychometric properties. This decision was made to include research articles that might not explicitly mention terms such as validity and reliability but use psychometric properties–related research designs and statistical analyses (e.g., independent t tests comparing healthy controls and target populations). By not limiting our search to psychometric properties, we aimed to capture a broader range of studies that might otherwise be overlooked. In other words, we searched all studies that used the 25 assessments, then included any studies that investigated psychometric properties. Two independent reviewers (Moon Young Kim and Samantha B. Randolph) conducted the screening process on the basis of the studies’ titles and abstracts. Subsequently, full-text screening was performed to identify eligible studies. Any disagreements between the two reviewers (Kim and Randolph) were resolved through discussions involving a third reviewer (Yejin Lee) and other coauthors, ensuring consensus was reached.
Eligibility Criteria
We included studies with adults (those older than age 18 yr) that focused on investigating at least one psychometric property of the assessments identified in Part 1 of our search process. In the eligibility criteria, we did not specify diagnosis to allow for comprehensive inclusion. Similarly, we did not include setting as one of the eligibility criteria because the included assessments require varied settings (naturalistic, simulated activities performed in clinical or laboratory settings or in real-world settings).
Data Extraction
Description of Assessments and Characteristics of the Included Studies
A standardized data extraction form was developed in Microsoft Excel to ensure systematic data extraction. This form was designed to capture characteristics of and relevant information from the included studies (e.g., study design, participants’ characteristics, psychometric properties, statistics, and results), as well as key details of the assessments (e.g., which version of the assessment was used). Two independent reviewers (Kim and Randolph) extracted the relevant data from the included studies using this standardized form. Disagreements or discrepancies were discussed with a third reviewer (Lee) and other coauthors to reach consensus and ensure accuracy and reliability in the data extraction process.
Evaluation of the Quality of Psychometric Properties of the Assessment
We followed the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) methodology for systematic reviews (Mokkink, Prinsen, et al., 2018). This methodology includes three steps: (1) evaluating methodological quality using the COSMIN risk-of-bias checklist (Mokkink, De Vet, et al., 2018), (2) applying the quality criteria to evaluate the good measurement properties of each study and the overall rating per psychometric property, and (3) evaluating the quality of evidence using the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) approach. Two reviewers (Kim and Randolph) independently rated each step throughout the process of evaluation, and, in cases of disagreement, a third reviewer (Lee) was consulted to reach consensus.
Evaluating Methodological Quality of Psychometric Properties of the Included Studies
To assess the methodological quality of the psychometric properties of the included studies, we used the COSMIN risk-of-bias checklist (Mokkink, De Vet, et al., 2018). This checklist provides methodological standards for study design and statistical approaches to evaluate the methodological quality of psychometric properties investigated in a research study (Mokkink, De Vet, et al., 2018). The checklist uses a 4-point rating scale (very good, adequate, doubtful, and inadequate) to rate the quality of each property. Multiple items are included to evaluate the quality of each psychometric property. Thus, the overall quality of each psychometric property was determined using the worst-score-counts method, whereby the lowest rating of any item within each property determined its overall quality (Mokkink, De Vet, et al., 2018). For example, if any item within a property was rated as inadequate, the overall quality of that property was graded as inadequate (Mokkink, De Vet, et al., 2018).
We evaluated seven psychometric properties (reliability, measurement error, criterion validity, hypothesis testing for construct validity, structural validity, cross-cultural validity, responsiveness). We did not evaluate content validity in this review because none of the included studies examined content validity, including the assessments’ relevance, comprehensiveness, and comprehensibility, as defined by the COSMIN guidelines (Mokkink, De Vet, et al., 2018). Additionally, we did not evaluate internal consistency because it may not be the most suitable measure of reliability for performance-based assessments of functional cognition, because it would assume that functional cognition is a unidimensional construct (Giles et al., 2022). Moreover, according to the COSMIN checklist, internal consistency should be determined once unidimensionality is confirmed through structural validity using factor analysis, item response theory, or Rasch analysis (Mokkink, De Vet, et al., 2018). However, most of the included assessments did not have evidence for structural validity that demonstrated the assessment’s unidimensionality. Therefore, we did not include it in this review.
Applying Quality Criteria to Evaluate Good Measurement Properties and Evaluating the Overall Ratings of Each Measurement Property
We used the statistical quality evaluation criteria to assess the quality of statistical findings of each study for each psychometric property (Mokkink, Prinsen, et al., 2018). Each criterion was rated on a 3-point rating scale as sufficient (+), indeterminate (?), or insufficient (−) (Mokkink, Prinsen, et al., 2018). Regarding hypothesis testing for construct validity, the COSMIN methodology suggests that the research team can determine the hypothesis (Mokkink, Prinsen, et al., 2018). We set the hypothesis as having correlation coefficients ≤.7 for convergent validity because a correlation exceeding .7 indicates that the assessments are measuring the same construct (Abma et al., 2016); however, none of the comparators used for convergent validity measured the same construct (i.e., functional cognition).
After evaluating the individual results for each psychometric property, we evaluated the overall rating on a 4-point rating scale: sufficient (+), insufficient (–), inconsistent (±), or indeterminate (?). For example, if at least 75% of the results are sufficient (or insufficient), the overall rating is determined as sufficient (or insufficient). Additionally, if the results of individual studies are inconsistent, we determined the overall rating as inconsistent, and we then further evaluated whether the inconsistency could be explained using subgroup analysis (e.g., whether consistency can be found within a specific diagnosis group or within a specific version of the assessment). However, if there was no reasonable explanation for the inconsistency, we rated the psychometric property as inconsistent (±). Conversely, if no information was available for the rating, we assigned an overall rating of indeterminate (?).
Evaluating the Quality of Evidence
We assessed the quality of the evidence to ensure the trustworthiness of the summarized results by using the GRADE approach, as outlined by the COSMIN methodology (Mokkink, Prinsen, et al., 2018). This methodology uses a modified GRADE approach tailored to psychometric properties. According to this modified approach, the quality of evidence is classified as high (high confidence), moderate (moderate confidence), low (limited confidence), or very low (very little confidence) on the basis of four factors: (1) the methodological quality, determined with the risk- of-bias checklist (Mokkink, De Vet, et al., 2018); (2) unexplained inconsistencies in the results across the included studies; (3) imprecision, which refers to the total sample size of the available studies; and (4) indirectness, which refers to evidence derived from populations that are not the focus of the review. This GRADE approach is applied to downgrade the rating of evidence when there are concerns about the quality of evidence based on the four factors.
Data Visualization
To visually depict the distribution of correlation coefficients and the significance of p values, we used box plots and a bar chart, respectively. These plots were based on the data extracted from studies that examined the convergent validity between performance-based assessments of functional cognition and other various comparison measures, such as neuropsychological tests, global cognition measures, self-perceived cognition measures, and activities of daily living (ADL) and IADL measures. The box plots provide a graphical representation of the spread of correlation coefficients (absolute value in the case of a negative correlation), showcasing the minimum, maximum, and quartile values. The bar chart visualizes the significance of the p values, allowing for a comparison of the statistical strength of the relationships between the performance-based assessments of functional cognition and the other measures. The magnitude of the correlation is defined according to the following criteria: weak (.0 < r < .3), low (.3 ≤ r <.5), moderate (.5 ≤ r <.7), high (.7 ≤ r <.9), and very high (.9 ≤ r <1.0; Mukaka, 2012). We were unable to create plots for other types of psychometric properties because they were either rarely addressed in the included studies or their outcomes and statistical analyses were too heterogeneous to be displayed as a plot. Similarly, we did not address the populations in this visualization because the populations included in the studies were too heterogeneous. We did not limit our review to specific diagnoses because we believe it generates an opportunity to synthesize and understand the psychometric properties across varied diagnostic populations of adults. Including varied populations will enhance this study’s applicability to occupational therapy practice because it is common in practice to work across diagnostic groups.
Results
Identification of Studies With Psychometric Properties
The initial search yielded a total of 2,810 records, including 2,806 records from the databases and 4 records obtained through a forward-and-backward manual search of the reference lists of the included studies. A total of 2,810 records were retrieved to EndNote, and 1,547 duplicates were then removed. A total of 1,263 studies were included for title and abstract screening (Figure 1). After screening titles and abstracts, we identified 179 studies for full-text screening and ultimately included 117 studies. Of the 25 assessments, there are multiple versions of 7: the Allen Cognitive Level Screen (ACLS; the original, Version 5 [ACLS–5], expanded, problem-solving, large, large Version 5), the Cognitive Performance Test (CPT; original, revised), the Cooking Task (CT; home-based, original), the EFPT (original, additional tasks, enhanced, Brazilian, Korean, Spanish, Swedish), the Multiple Errands Test (MET; Baycrest, big-store, generic version, home version, contextualized, hospital- based, modified for intellectual disabilities), the Observed Tasks of Daily Living (OTDL; original, revised), and the Weekly Calendar Planning Activity (WCPA; original, 10 item, university student). Considering the versions of each assessment, we have included a total of 47 assessments. Figure 1 shows a flow diagram of the study selection process. In combining the multiple versions of each assessment, the most frequently used assessment was the EFPT, which was used in 19 studies, followed by the MET (n = 18), the ACLS (n = 16), the OTDL (n = 10), the Hotel Task (HT; n = 8), the WCPA (n = 8), the Actual Reality (AR; n = 5), the CPT (n = 4), and the Complex Task Performance Assessment (CTPA; n = 4). Appendix B in the Supplemental Material shows the publication trend for the included assessments over time.
In the included studies, the number of studies investigating each psychometric property was as follows: n = 3 for structural validity, n = 34 for reliability, n = 4 for criterion validity, and n = 108 for hypothesis testing for construct validity. Among the seven psychometric properties investigated, measurement error and responsiveness were not explored by any study. Additionally, according to the COSMIN approach, no studies were deemed to meet the standards for establishing cross-cultural validity. Thus, this review focuses on reporting findings related to structural validity, reliability, criterion validity, and hypothesis testing for construct validity.
Assessing Psychometric Properties of the Assessment
Appendix C in the Supplemental Material provides details on the characteristics of the included studies and the results of the evaluation of individual studies.
Structural Validity of the Included Assessments
Structural validity refers to the extent to which the underlying structure of an assessment represents the theoretical construct it is intended to assess. Three studies used factor analysis, including one study with the original OTDL (Diehl et al., 1995) and one with the Revised OTDL (OTDL–R; Diehl et al., 2005) and one study with the Kitchen Task Assessment (KTA; Baum & Edwards, 1993). In the two studies with the original OTDL and OTDL–R, confirmatory factor analysis (CFA) suggested that three factors (food preparation, medication intake, and telephone use) was an appropriate structure for assessing older adults’ abilities (Diehl et al., 1995, 2005). The original OTDL was rated as having adequate methodological quality (Diehl et al., 1995), and the OTDL–R was rated as having very good methodological quality (Diehl et al., 2005). The KTA showed a unidimensional structure, with one factor accounting for 84% of the variance in individuals with dementia (Baum & Edwards, 1993). However, the KTA was rated as having inadequate methodological quality because CFA was not performed. We did not evaluate the ratings of the results and quality of evidence for the original OTDL, OTDL–R, and KTA because of ambiguity in building hypotheses with unagreed-upon structures of functional cognition.
Reliability of the Included Assessments
Table 1 shows details of the results of reliability. Reliability is examined in two ways: Test–retest reliability measures the consistency of results when the assessment is repeated over time, and interrater reliability examines the consistency of results across different administrators of the assessment. Ten studies examined test–retest reliability, and 32 examined interrater reliability. Only the hospital-based MET showed moderate evidence for sufficient reliability among people with dementia and acquired brain injury (ABI); one level was downgraded from high to moderate evidence because of imprecision caused by an insufficient sample size (Table 1). Four studies evaluated reliability of the hospital-based MET (Cuberos-Urbano et al., 2013; Knight et al., 2002; Lai, Dawson, et al., 2020; Morrison et al., 2013). One study investigated both its test–retest reliability and its interrater reliability with older adults with mild to moderate dementia (Lai, Dawson, et al., 2020), and 3 studies examined its interrater reliability with people with ABI (Cuberos-Urbano et al., 2013; Knight et al., 2002; Morrison et al., 2013). Lai, Dawson, et al. (2020) was rated as having doubtful methodological quality because of the short interim period (1 wk) between test and retest sessions and unclear details about whether participants’ abilities in performing the test task or conditions were stable during the retest period. Thus, the moderate evidence of sufficient reliability was mainly derived from the 3 studies that investigated interrater reliability in people with ABIs (Cuberos-Urbano et al., 2013; Knight et al., 2002; Morrison et al., 2013). However, caution should be exercised when interpreting the results of the hospital-based MET because, although the various versions of the hospital-based MET share a certain basic format, they have varied characteristics, such as different errands and environments (Cuberos-Urbano et al., 2013; Knight et al., 2002; Lai, Dawson, et al., 2020; Morrison et al., 2013). Except for the hospital-based MET, a low level of evidence for reliability was noted for 2 assessments and a very low level of evidence for reliability was reported for 15 assessments for multiple reasons, such as limited methodological quality of the studies (risk of bias), inconsistency of the ratings of individual results, and insufficient sample size (imprecision). We were unable to evaluate quality of evidence for 9 assessments because of failure to calculate the intraclass correlation coefficient (ICC); the studies used correlation coefficients, not ICCs, or inappropriate κ scores for continuous variables.
Criterion Validity of the Included Assessments
We applied criterion validity, which refers to the degree to which the assessment is related to a gold standard (Mokkink, Prinsen, et al., 2018), when a short or modified version of an assessment tool was compared with the long or original version because there is no gold standard for performance-based assessments of functional cognition (Mokkink, Prinsen, et al., 2018). Four studies compared the alternate or revised versions of new assessments with their original versions: Large ACLS (Kehrberg et al., 1993), alternate form of the CTPA (Saa et al., 2017), additional tasks of the EFPT (Hahn et al., 2014), and the Home-MET (Lai, Yan, & Yu, 2020). Studies investigating the ACLS (Kehrberg et al., 1993), CTPA (Saa et al., 2017), EFPT (Hahn et al., 2014), and MET (Clark et al., 2017; Lai, Yan, & Yu, 2020) demonstrated very good methodological quality. However, we did not assess the ratings of results and the GRADE level for criterion validity because the hypothesis, which recommends a correlation coefficient ≥.7, suggested by the COSMIN approach (Mokkink, Prinsen, et al., 2018), seems inappropriate for criterion validity of functional cognition assessments. This decision was made because, except for the Large ACLS, the modified versions of the assessments use different everyday activities or environments than the original assessment, which can highly influence the correlation coefficients between them.
Hypothesis Testing for Construct Validity of the Included Assessments
A total of 108 studies examined construct validity, including 90 studies on convergent validity and 78 studies on discriminative validity. The results are presented according to high, moderate, and low quality of evidence, and each section includes the results for both convergent validity and discriminative validity. A summary of evidence for each psychometric property is presented in Table 1.
Assessments rated as having high evidence for construct validity.
High evidence for sufficient construct validity was found for the following assessments and populations: the original ACLS with psychiatric inpatients, ACLS–5 with individuals with ABI, psychiatric inpatients, and those with substance use disorders, the Large ACLS with healthy elderly individuals and patients with stroke, the Large ACLS–5 with people with mild cognitive impairment (MCI), the KTA with people with dementia and schizophrenia, the Menu Task with older adults, the original OTDL with older adults, the Pillbox Test with people with dementia and neurological disorders, and the Sequential Daily Life Multitasking (SDLM) with people with dementia.
Of the versions of ACLS, the original ACLS scored as sufficient for construct validity with psychiatric inpatients (Leung & Man, 2007; Mayer, 1988; Secrest et al., 2000; Velligan et al., 1998). In contract, this assessment was rated as having insufficient construct validity with individuals with ABI because of an unsatisfied hypothesis for convergent validity (Park & Lee, 2020). Similarly, the Large ACLS scored high on evidence for sufficient construct validity with healthy elderly individuals and patients with stroke. However, it scored as inconsistent for construct validity with individuals with Alzheimer’s disease (AD) because one study did not meet our hypothesis for convergent validity, although the hypothesis for discriminative validity was satisfied with people with AD (Kehrberg et al., 1993). The Large ACLS–5 was rated as having high evidence for sufficient construct validity with very good methodological quality, including both convergent and discriminative validity, with people with MCI (Wesson et al., 2017).
The KTA showed high evidence for sufficient construct validity with people with dementia and schizophrenia; two studies investigated construct validity, mostly examining convergent validity, with adequate methodological quality (Baum & Edwards, 1993) and with very good methodological quality (Lipskaya-Velikovsky et al., 2015), respectively. The Menu Task had high evidence for sufficient construct validity with older adults but indeterminate construct validity with orthopedic patients; two studies examined the construct validity of the Menu Task with very good methodological quality, yielding sufficient convergent validity with very good methodological quality (Al-Heizan et al., 2020; Edwards et al., 2019). However, Edwards et al. (2019) showed insufficient discriminative validity because no significant differences were found between orthopedic patients and healthy control participants on Menu Task scores. The original OTDL demonstrated high evidence for sufficient construct validity with older adults but indeterminate construct validity with individuals with HIV. All three studies involved in construct validity of the original OTDL examined convergent validity, with two studies showing adequate methodological quality (Diehl et al., 1995; Vance et al., 2013) and one study having very good methodological quality (Sartori et al., 2012). Vance et al. (2013) was the only study that examined discriminative validity, but it showed nonsignificant differences between individuals with HIV and healthy control participants, resulting in insufficient discriminative validity and overall insufficient construct validity with individuals with HIV. The Pillbox Test scored high in two studies on evidence for sufficient construct validity: one study with adequate methodological quality for sufficient discriminative validity with people with dementia (Logue et al., 2015) and one study with very good methodological quality for both convergent and discriminative validity for people with neurological disorders (Zartman et al., 2013). Moreover, the SDLM demonstrated high evidence for sufficient construct validity for people with dementia, based on a study that showed very good methodological quality for both convergent and discriminative validity (Jarry et al., 2021).
Assessments rated as having moderate evidence for construct validity.
Moderate evidence was found for sufficient construct validity for the following assessments and populations: the AR with people with multiple sclerosis (MS) and traumatic brain injury (TBI); the ACLS—Expanded, with psychiatric inpatients; the ACLS–Problem Solving with psychosocial patients; the Cognitive Effort Test (CET) with people with Parkinson’s disease (PD) and schizophrenia; the original CPT with people with dementia and older adults; the home-based CT with people with TBI; the original CT with people with ABI; the CTPA with people with PD and stroke; the Day Out Task (DOT) with older adults and people with MCI; the original EFPT with people with MS, ABI, and schizophrenia; the Everyday Multitasking Test (EM) with those with ABI and HIV; the Functional Cognitive Assessment Scale (FUCAS) with people with MCI; the HT with people with bipolar disorder, attention deficit hyperactivity disorder (ADHD), dementia, PD, and schizophrenia; the Kettle Test with stroke and geriatric patients; the Baycrest MET with people with ABI; the home-based MET (MET–Home) with people with stroke and MCI; the hospital-based MET with people with ABI and schizophrenia; the Multiple Object Test (MOT) with people with PD; the OTDL–R with older adults and people with schizophrenia, PD, and stroke; the original WCPA with older adults and people with MS and PD; the 10-item version of the WCPA (WCPA–10) with people with stroke and MCI; and the WCPA for university students with people with ADHD.
Of these assessments, four—the ACLS–Expanded, the ACLS–Problem Solving, the home-based CT, and the original CT—were downgraded from high to moderate evidence because of imprecision (i.e., insufficient sample size). Moderate evidence was found for 15 assessments—the AR, CET, CPT, CTPA, DOT, original EFPT, EM, Kettle Test, Baycrest MET, MET–Home, hospital-based MET, MOT, OTDL–R, original WCPA, and the WCPA–10—because of inconsistency in the individual ratings of the results. The HT has mixed findings across populations: it was rated as having moderate evidence for sufficient construct validity with people with bipolar disorder, ADHD, dementia, PD, and schizophrenia but showed insufficient construct validity with those with MS and TBI. The HT was downgraded from high to moderate evidence because of inconsistency in the findings for discriminative validity. Additionally, the FUCAS and the WCPA for university students were downgraded from high to moderate evidence because they had, respectively, only one study of adequate methodological quality without any additional evidence (Kounti et al., 2006) and one study of very good methodological quality (Lahav et al., 2018).
Assessments rated as having low evidence for construct validity.
The Charge of Quarters Duty Task (CQ) with people with TBI; the Alternate EFPT (additional tasks) with people with stroke; the Spanish version of the EFPT with people with substance use disorders; the Swedish version of the EFPT with people with stroke; and the original Executive Secretarial Task with people with ABI were rated as having low evidence for sufficient construct validity. The CQ (Radomski et al., 2018); Alternate EFPT (Hahn et al., 2014); and Spanish version of the EFPT (Rojo-Mota et al., 2021) were downgraded from high to low evidence because of risk of bias (only one study had adequate methodological quality) and imprecision (insufficient sample size). The Swedish version of the EFPT was downgraded from high to low evidence because of an insufficient sample size (<50; Cederfeldt et al., 2011).
Assessments rated as having very low evidence for construct validity.
Very low evidence for sufficient construct validity was found for the EFPT– Enhanced with people breast cancer; the EFPT–Brazilian version with people with stroke; the EFPT for Koreans with people with stroke; the Front Desk Duty with people with stroke; the MET–Contextualized Version with people with substance dependence; the Modified MET for Intellectual Disabilities with people with intellectual disabilities; and the Night Out Task with older adults. The original Kitchen Behavioral Scoring Scale with people with schizophrenia and the generic version of the MET with people with ABI scored as having very low evidence for insufficient construct validity. These assessments were scored as having very low evidence for multiple reasons, such as poor methodological quality, inconsistency of result ratings, and imprecision because of insufficient sample size.
Visualization of convergent validity.
Of the 90 studies that investigated convergent validity, neuropsychological tests (n = 68; 75.6%) were the most commonly used comparators, followed by global cognition (n = 41; 45.6%), self-perceived cognition (n = 12; 13.3%), ADL performance (n = 7; 7.8%), and IADL performance (n = 11; 12.2%). Among the neuropsychological tests, executive function (N = 63; 70.0%), including working memory and executive control, was the most frequently studied. Figure 2 displays the relationships between performance-based assessments of functional cognition and other measures used as comparators in the included studies for convergent validity.
Most studies demonstrated weak to low correlations between performance-based assessments of functional cognition and neuropsychological tests (i.e., visuospatial functions, attention, working memory, executive control, memory, and language). Most studies noted low to moderate correlations between performance-based assessments of functional cognition and global cognition and ADL and IADL performance, with a few studies reporting high correlations. The box plots illustrate that performance-based assessments of functional cognition have a stronger relationship with ADL and IADL performance compared with isolated cognitive skills measured by neuropsychological tests. Most (70%) of the studies showed statistically nonsignificant relations between performance-based assessments of functional cognition and self-perceived cognition, with the remaining studies reporting weak to low correlations. Note that we did not address the populations in this figure because those included in the studies were heterogeneous. However, this enhances the application of our findings to occupational therapy practice, because it is common in practice to work across varied diagnostic groups. Appendix D in the Supplemental Material provides details on the assessments used for convergent validity and the number of studies included in or excluded from the box plots and bar chart.
Discussion
Overall Findings
The primary objective of this systematic review was to examine the psychometric properties of performance-based assessments of functional cognition in adults. We included 117 studies reporting on the psychometric properties of 47 assessments. Among these, we identified 3 studies of structural validity, 34 studies of reliability, 4 studies of criterion validity, and 108 studies of hypothesis testing for construct validity. None of the included studies investigated measurement error and responsiveness. Moreover, no studies satisfied the standards for establishing cross-cultural validity suggested by the COSMIN approach. The findings indicate that hypothesis testing for construct validity, including convergent and discriminative validity, has received the most attention compared with other types of psychometric properties. Our findings demonstrate that there is more evidence regarding the psychometric properties of performance-based assessments of functional cognition compared with the previous review by Wesson et al. (2016). Moreover, assessments, including the Menu Task and the EFPT–Enhanced, have been newly developed since the previous review was published (Wesson et al., 2016). Among the included assessments, the EFPT, MET, ACLS, OTDL, HT, and WCPA have gained substantial evidence since their development. However, most assessments still require additional studies to elucidate their psychometric properties. Even though previous reviews have consistently suggested the need for additional studies on the psychometric properties of these assessments (Poulin et al., 2013; Wesson et al., 2016), this gap does not seem to have been meaningfully addressed with more recent scientific advances. Limited research examining other psychometric properties, such as structural validity, reliability, and responsiveness, leaves a gap in the understanding of whether the assessments are trustworthy and useful within and across populations.
Content Validity
Among the psychometric properties, content validity is considered the most important property in the COSMIN approach (Mokkink, Prinsen, et al., 2018) because it measures how well the assessment’s content reflects the target construct. This includes the assessment’s relevance, comprehensiveness, and comprehensibility (Mokkink, Prinsen, et al., 2018). In Part 1 of this review (Lee et al., 2025), we identified indicators of functional cognitive abilities in each assessment, supporting that the 25 included assessments aim to measure functional cognition. However, we must note that we were unable to evaluate content validity using the COSMIN approach because none of the included studies investigated the relevance, comprehensiveness, and comprehensibility of the assessments as defined by the COSMIN approach. To address this issue, we recommend that the COSMIN guidelines be used in designing future studies to effectively measure the content validity of an assessment by including both occupational therapy practitioners and the recipients of occupational therapy (Mokkink, Prinsen, et al., 2018). In addition, there are currently no criteria for how comprehensive an assessment needs to be to optimally measure functional cognition. Therefore, we suggest future studies include content experts to develop conceptual frameworks for how to define and evaluate the comprehensiveness of performance-based assessments of functional cognition.
Structural Validity and Internal Consistency
We identified two studies, one of the original OTDL (Diehl et al., 1995) and one of the OTDL–R (Diehl et al., 2005), that investigated structural validity using CFA to demonstrate unidimensionality and produced three factors (food preparation, medication intake, and telephone use). However, factor models with unidimensionality consisting of different everyday activities may not be the proper structure for performance-based assessments of functional cognition (Giles et al., 2022). Because the concept of functional cognition has to do with integrated cognitive skills supported by multiple underlying constructs, items of a functional cognitive assessment by nature do not need to be strongly interrelated. Similarly, Giles et al. (2022) suggested that internal consistency may not be a relevant psychometric property of performance-based assessments of functional cognition because it does not have a unidimensional structure. Therefore, we have not evaluated internal consistency in this review. In this regard, we suggest building a conceptual framework of the structure of functional cognition to illustrate how the structural validity of these assessments should differ from other assessments measuring isolated cognitive skills (e.g., neuropsychological tests).
Reliability
Our findings suggest that only one assessment, the published versions of the hospital-based MET, showed moderate evidence for sufficient reliability, particularly interrater reliability, with people with dementia and ABI. However, considering that the published versions of the hospital-based MET have different characteristics, such as errands and environments, even though they share basic formats (e.g., performed in a hospital), this finding needs to be carefully interpreted. The remaining assessments were rated as having very low or low quality of evidence, or the quality of evidence was not assessed because of indeterminate ratings of the results. The lack of evidence for reliability may stem from methodological challenges caused by the nature of performance-based assessments, such as learning effects of the tasks limiting test–retest validity or their relatively long administration time. Learning effects were observed for the AR (Goverover & DeLuca, 2018), but not for the WCPA (Holmqvist et al., 2020) and the CPT (Burns et al., 1994). However, to our knowledge, no other studies have investigated learning effects. Giles et al. (2022) proposed that, in general, many performance-based assessments of functional cognition require the test taker to be naive to the test; therefore, low test– retest reliability may be expected. Giles et al. (2022) suggested that video recording may be a good approach for testing interrater reliability because it is likely to compensate for learning effects on the tasks that can occur when the test taker is familiar with the test or long administration times. This approach also allows multiple raters to assess the same performance (Giles et al., 2022). Additionally, we recommend developing simplified administration or scoring guidelines as another approach to enhance the reliability of these assessments. With more studies on reliability, these assessments could be used more consistently and reliably.
Hypothesis Testing for Construct Validity
Our findings suggest that nine assessments are likely to have strong evidence for sufficient construct validity. Compared with the Wesson et al. (2016) review, which included studies up to June 2015, we found more evidence for construct validity, especially for the ACLS and original OTDL. Additionally, we found more assessments, such as the original EFPT, WCPA (original, 10-item, and university student versions), and AR, which have been frequently used and have moderate evidence for sufficient construct validity with various populations. Considering the contents of the assessments described in Part 1, these assessments can be considered valid for use in occupational therapy practice and research for measuring functional cognition.
Convergent Validity With Neuropsychological Tests
Most studies found weak to low correlations of performance-based assessments of functional cognition with neuropsychological measures of attention, working memory, executive control, memory, and language. This finding suggests that although discrete cognitive skills may contribute to functional cognition to some extent, neuropsychological tests and performance-based assessments of functional cognition do not measure the same construct. Performance-based assessments of functional cognition measure the integration of cognitive skills in the performance of everyday activities, whereas neuropsychological tests measure discrete cognitive skills. This interpretation is supported by Baum et al.’s (2023) factor analysis with a large clinical sample showing that the latent construct measured by the EFPT is a distinct cognitive construct (i.e., functional cognition) from fluid and crystallized cognition measured using neuropsychological tests. To the best of our knowledge, this is the first attempt to visually synthesize correlation coefficients between performance-based assessments of functional cognition and neuropsychological tests. This synthesized evidence may strengthen the argument for the need for performance-based assessments of functional cognition to compensate for neuropsychological tests.
Additionally, even though neuropsychological tests may not be the best way to understand the convergent validity of performance-based assessments of functional cognition, scientists have consistently used them as comparators for this purpose. Thus, because the purpose of this review was to summarize and synthesize the existing evidence, this review needed to include this in the section on convergent validity. We would like to highlight that other methods of determining validity, such as discriminative validity, may be better for evaluating the validity of functional cognitive assessments. For example, a previous study used the Menu Task to divide the groups into individuals with impaired and unimpaired functional cognition, testing whether the WCPA (Al-Heizan et al., 2022) or neuropsychological tests (Edwards et al., 2019) significantly discriminated between these two groups.
Convergent Validity With Self-Perceived Cognition Measured Using Questionnaires
Scientists have attempted to use self-report questionnaires to measure cognitive ability experienced in real life. For example, Burgess et al. (1998) used the self-report Dysexecutive Questionnaire (DEX) as an assessment for everyday executive functioning to examine whether a performance-based assessment of functional cognition (MET) predicts everyday executive functioning better than neuropsychological tests. Similarly, our review found that the included studies used self-report questionnaires of executive functioning such as the DEX or Behavior Rating Inventory of Executive Function–Adult Version to explore the associations between performance-based assessments of functional cognition and self-perceived cognition in real life. Our findings show that most (70%) of the studies showed statistically nonsignificant relations between performance-based assessments of functional cognition and self-report questionnaires of cognition. This finding suggests that performance-based assessments of functional cognition may compensate for self-perceived cognition by providing information on what clients can actually do, which is distinct from how clients perceive their cognitive abilities. However, caution should be exercised because self-perceived cognition can be influenced by several factors, including emotional status (Hill et al., 2016) and poor self-awareness of cognitive ability (Fragkiadaki et al., 2016). Thus, similar to neuropsychological tests, self-report questionnaires of cognition may not be the best way of assessing the convergent validity of performance-based assessments of functional cognition.
Convergent Validity With ADL and IADL Assessments
This review shows that performance-based assessments of functional cognition may have a stronger relationship with ADLs and IADLs than with neuropsychological tests of discrete cognitive skills. This stronger relationship may be due to a closer alignment in the demands of performance-based assessments of functional cognition and ADL and IADL performance, namely, the need for the integration of cognitive skills during the performance of everyday activities (AOTA, 2019; Giles et al., 2017, 2020, 2024).
Responsiveness
Although all psychometric properties are important, the lack of studies examining responsiveness to change over time in response to intervention or health status (including the establishment of a minimal clinically important difference) and clinical utility is particularly concerning. Without an understanding of responsiveness, occupational therapy practitioners and scientists may face limitations in measuring changes in clinical practice or trials, as well as in determining clinical significance. To better understand responsiveness to intervention, it is important to use appropriate assessments at multiple stages, from screening to monitoring (Fuchs & Vaughn, 2012). It is promising that occupational therapy professionals have developed the Menu Task for screening functional cognition (Al-Heizan et al., 2020; Edwards et al., 2019), opening up the possibility of investigating responsiveness at multiple stages. However, more studies are required to better measure responsiveness using performance-based assessments of functional cognition. In addition, responsiveness is related to our earlier discussion regarding an assessment’s learning effects. Because performance-based assessments of functional cognition involve performing everyday activities, a test taker can learn how to perform the test tasks during the first trial, which is likely to influence performance on the second trial. Thus, more studies are required to evaluate whether performance-based assessments of functional cognition have significant learning effects, potentially complicating the interpretation of their responsiveness to change because of intervention or recovery.
Predictive Validity
Similarly, although we did not examine predictive validity separately in this study, it is critical to investigate the predictive validity of performance-based assessments of functional cognition. The theoretical rationale for using these assessments is to better predict cognitive ability in the real world. In our results, we have identified the use of neuropsychological tests, ADL and IADL assessments, and self-report questionnaires of cognition to determine convergent validity. Among these, relationships with ADL and IADL assessments or self-report questionnaires of cognition experienced in real life may support predictive validity to some extent because they may reflect real-world functioning. However, there is a lack of consensus regarding the assessments and constructs that should be used for comparison in analyses of predictive validity.
We suggest that performance of ADLs and IADLs and participation, defined as involvement in life situations (World Health Organization [WHO], 2001), may be the most appropriate outcomes for measures of functional cognition to predict real-world functioning. The construct measured by performance-based assessments aligns more with the concepts of activity and participation in the International Classification of Functioning, Disability and Health (WHO, 2001) than with impairment level, referring to the disorder the person has (Chan et al., 2008). Investigating predictive validity with ADL and IADL performance and participation in everyday activities as comparators can reveal whether individuals who score higher on performance-based assessments of functional cognition can actively participate more in real-world daily occupations. For example, a study revealed that the medication task of the EFPT better predicts poor participation in daily activities than neuropsychological tests (Arieli et al., 2022). Similarly, accuracy on the WCPA was significantly associated with participation in social roles and activities, but fluid and crystallized cognition were not (Foster et al., 2022). Thus, conducting more studies using activity and participation measures may provide better evidence for the predictive validity of performance-based assessments of functional cognition.
Limitations
The registered protocol for this review included plans to perform meta-analysis to quantitatively synthesize the findings; however, we were unable to do so because of the heterogeneity of the included assessments and the lack of necessary data. Still, we visually displayed the results, which was the first attempt to synthesize the findings for convergent validity. However, we strongly recommend that scientists report correlation coefficients, means and standard deviations, and exact p values in their studies to support meta-analytic efforts for data synthesis. Moreover, because of the inherent limitations of Figure 2, we were unable to distinguish the multiple versions and types of assessments as well as the populations. This limitation can be addressed through meta-regression analysis. Therefore, comprehensive reporting of the data will contribute to future meta-analyses and further strengthen the evidence for performance-based assessments of functional cognition. Last, we did not use the published search filter for psychometric properties so that we could include any studies that might not have used terminology related to psychometric properties but still investigated assessments’ psychometric properties. We note that readers should consider this search process when interpreting the findings of this study.
Implications for Occupational Therapy Practice
The concept of functional cognition and performance-based assessments of functional cognition are crucial for enhancing the expertise of occupational therapy professionals, because they enable these professionals to address integrated cognitive skills in the real world. Our systematic review has provided information on the psychometric quality of performance-based assessments of functional cognition. This information can guide occupational therapy professionals in selecting appropriate assessments with demonstrated construct validity for their research and clinical practice. However, our findings reveal that existing studies have mostly focused on construct validity and have provided limited evidence on the other psychometric properties of these assessments. Future studies are required to further support the valid and reliable use of performance-based assessments of functional cognition in the field of occupational therapy. Moreover, we strongly encourage collaboration among occupational therapy scientists, practitioners, educators, and policymakers to disseminate and facilitate the implementation of performance-based assessments of functional cognition.
Conclusion
Our findings indicate that certain assessments have some supportive evidence regarding their construct validity. However, additional studies are required to ascertain the psychometric properties of performance-based assessments of functional cognition. Furthermore, it is important to have consensus on conceptual frameworks of functional cognition to evaluate content validity, structural validity, and internal consistency.
*Indicates studies included in the systematic review.
Acknowledgments
We appreciate the librarian who helped create search terms. We are thankful to Washington University in St. Louis for providing technical support.