We used the Safe Driving Behavior Measure (SDBM) to determine rater reliability and rater effects (erratic responses, severity, leniency) in three rater groups: 80 older drivers (mean age = 73.26, standard deviation = 5.30), 80 family members or caregivers (age range = 20–85 yr), and two driving evaluators. Rater agreement was significant only between the evaluators and the family members or caregivers. Participants rated driving ability without erratic effects. We observed an overall rater effect only between the evaluator and family members or caregivers, with the evaluators being the more severe rater group. Training family members or caregivers to rate driving behaviors more consistently with the evaluator’s ratings may enhance the SDBM’s usability and provide a role for occupational therapists to interpret proxy reports as an entry point for logical and efficient driving safety interventions.
Statistics have shown that older drivers’ motor vehicle crash, injury, and fatality rates continue to be a concern for occupational therapists because of this population’s current and future growth. Accurate and precise measurement of older adults’ unsafe driving behaviors is an essential first step in curtailing crashes and preventing adverse effects such as injuries and fatalities (Classen et al., 2010). The comprehensive driving evaluation (clinical tests and an on-road test conducted by a driving rehabilitation specialist), the gold standard measure for driving evaluation, is highly valid and reliable but has limitations such as being time consuming, providing limited access, and holding an element of threat (mandatory or ethical reporting with driver failure). Self-report can be used, complementary to other forms of assessment, to identify older adults’ unsafe driving behaviors, increase driving safety awareness and knowledge, and promote behavior change and safer driving outcomes (Eby, Molnar, Shope, Vivoda, & Fordyce, 2003; McGee & Tuokko, 2003).
Advances have been made in developing self-report measures for older drivers. Such tools include Drivers 55 Plus: Check Your Own Performance (AAA Foundation for Traffic Safety, 1994) and the computer-based Roadwise Review (American Automobile Association, 2004). These measures have strengths and limitations. For example, the computer-based Roadwise Review has great face validity, but it takes approximately 40 minutes to complete and may be challenging to use with older adults with low computer fluency. Although these two self-report measures provide meaningful descriptions of a driver’s ability level, they emphasize person and environment factors, with a gap in items addressing the vehicle. Understanding a participant’s level of safe driving behaviors is a critical step toward providing an entry point for logical and efficient occupational therapy interventions, identifying optimal training parameters, and predicting future safe driving ability.
In ongoing work, we have developed a self-report Safe Driving Behavior Measure (SDBM; Classen et al., 2010; Winter et al., 2011). We have tested it among 80 older drivers, 80 family members and caregivers (F–C), and two driving evaluators, and we have conducted psychometric analyses (Classen et al., in press). Findings from the pilot work encouraged us to further refine the SDBM as a precise and accurate measure for detecting safe driving behaviors among older adults. As such, the objective of this study was to quantify the rater reliability and rater effects, using Item Response Theory (IRT), among three rater groups (older drivers, F–C, driving evaluators).
Interrater reliability is defined as the extent to which different raters agree on the same people or characteristics. The terms interrater reliability, rater agreement, and rater correlation are often used interchangeably. Two studies investigated the relationship of driving performance rated by evaluators and older drivers (Marottoli & Richardson, 1998; Wild & Cotrell, 2003). Marottoli and Richardson (1998) investigated the relationship between on-road driving performance rated by evaluators and self-reported driving ability rated by older drivers using the Pearson correlation coefficient. They did not find a significant association between the ratings of these two groups. Wild and Cotrell (2003) investigated the differences between evaluators’ ratings and older drivers’ ratings on the Driving Safety Evaluation scale using t tests. They found that only 2 of 10 items showed significant differences between evaluators’ ratings and older drivers’ ratings.
Neither the Pearson correlation coefficient nor the t-test statistic can accurately determine the potential rater effects. Although the Pearson correlation coefficient can indicate the strength of the relationship between two sets of data (the concordance of the data), it cannot detect whether the value of one set of data is consistently greater than the value of the other, which might indicate that one rater is more severe or lenient than the other. The t-test statistic detects the significant difference of the means of two sets of data; however, using the mean may partial out the individual differences that exist within the rater group. Moreover, the Pearson correlation coefficient and t-test statistic cannot provide information regarding the response pattern, that is, whether someone responds to the items erratically (i.e., rating inconsistently). Thus, although the Pearson correlation coefficient and t-test statistic provide necessary methods for assessing rater agreement, they are not sufficient to make an accurate determination of rater effects. Examining rater effects is critical, especially when people will be reporting on safety aspects of driving.
Rater effects are a function of severity or leniency, defined as the tendency for a rater to assign ratings consistently higher or lower than those of other raters (Myford & Wolfe, 2004). In addition to having tendency to assign higher or lower ratings, raters may also assign ratings erratically (erratic response pattern); that is, the raters inappropriately assign low scores (cannot do) to drivers with a higher ability level or high scores (no difficulty) to drivers with a lower ability level. The Many Facets Rasch Model (MFRM; Linacre, 2004), an extension of the Rasch model, is useful in investigating rater severity and response patterns. The Rasch model, a one-parameter IRT model, converts ordinal scales into interval measures (using logit as its unit) and provides a useful, efficient, and objective framework for developing, evaluating, and revising measures.
Five published driving studies applied Rasch analysis to develop or evaluate driving scales (Kay, Bundy, & Clemson, 2008, 2009; Myers, Paradis, & Blanchard, 2008; Patomella, Kottorp, & Tham, 2008; Patomella, Tham, & Kottorp, 2006). Patomella and colleagues (2006) first applied Rasch analysis to examine the Performance Analysis of Driving Ability (P-Drive) using a driving simulator with 31 people with brain injury; they later (Patomella et al., 2008) used Rasch analysis to evaluate the P-Drive with 101 people with stroke. Kay et al. (2008) applied Rasch analysis to a standard on-road test to transform the on-road test into a linear interval measure with hierarchical ordered tasks. Myers and colleagues (2008) also used a Rasch model to examine the structure of a scale assessing driving confidence. Most recently, Kay and colleagues (2009) applied Rasch analysis to a simulated test rated by trained professionals and an awareness test to investigate the construct validity and internal reliability of the simulated test and awareness test. Although we have seen an increased application of Rasch analysis in developing and evaluating assessments, no driving-related published study has yet applied the Rasch model to assess rater effects.
Beyond estimating item difficulties and person abilities, the MFRM includes an additional parameter, the rater, to detect whether the response differences are caused by systematic rater severity or leniency. Moreover, by fitting data to the Rasch model, the MFRM can detect the erratic raters.
Rater effect is particularly important in our field of study as we develop an older driver and proxy self-report tool, the SDBM. When comparing older drivers’ self-reports with F–C reports or driver evaluator reports, we anticipate a discrepancy. That is, we expect that older drivers may be the least severe in their self-ratings (e.g., do not want to lose their license), and evaluators may be the most severe in their ratings (e.g., are trained to focus on deficits). The F–C may be somewhat in the middle with their ratings of their loved one’s driving safety; some may be overly severe (i.e., really want the driver to stop driving), and some may be less severe (i.e., do not want them to lose their means of transportation).
In this study, we addressed interrater reliability among three groups of raters (older driver, F–C, and driving evaluators) and investigated the rater effects between two of the groups (F–C and driving evaluators) on the SDBM’s 41 items. We expected our findings to provide the evidence to use the self- or proxy-report SDBM as a reliable measure of safe driving among older adults, their F–C, and occupational therapists conducting such evaluations.
This study received approval from the institutional review boards of the University of Florida and Lakehead University.
A convenience sample of older drivers and F–C was recruited from north Florida and Thunder Bay, Ontario. All participants completed an SDBM. The older drivers underwent an on-road driving evaluation conducted by trained driving evaluators, who also completed an SDBM after the on-road test.
We recruited participants in north Florida, United States, and Ontario, Canada, by means of advertisements in newspapers, word-of-mouth referrals, and flyers distributed to local community facilities and through an aging registry. Participants were included if they were older drivers (ages 65–85 yr); had a valid driver’s license; were driving at the time of recruitment; had the cognitive ability to complete the SDBM, as evidenced during the telephone interview; and had the ability to participate in an on-road driving test (behind-the-wheel test in a dual-brake vehicle with the driving rehabilitation specialist using a standardized scoring sheet to evaluate driving performance (Justiss, Mann, Stav, & Velozo, 2006), as evidenced in not having missing limbs or a severe psychiatric diagnosis. The participants generally met the inclusion criteria. Participants were excluded if they had been medically advised not to drive, had experienced uncontrolled seizures in the past year, or took medications that caused central nervous system impairment. F–C between ages 18 and 85 yr were included if they were able to report (on the basis of observation) on the older adult’s driving behavior and excluded if they showed the presence of a physical or mental condition that impaired their ability to make an active contribution. At the primary site, the certified driving rehabilitation specialist (Desiree Lanford) with 7 years of clinical practice experience collected the data. At the Canadian site, the driving evaluator, who was a driving instructor accredited by the Province of Ontario with >10 yr of experience, completed the on-road test and evaluator SDBM. Thus, the rater groups were older drivers, F–C, and driving evaluators.
We standardized the SDBM and clinical test administration, as well as the on-road driving evaluation test, across sites by (1) using a set testing protocol for the two sites, (2) having the driving evaluator at the primary site (Florida) conduct a 3-day training session with the evaluator at the Canadian site, and (3) ensuring 100% congruence between the two on-road driving evaluators by using a 4-point scale to rate the driving of three healthy volunteers. All older drivers and their F–C provided consent in a private research office before completing demographic information and the SDBM, which was part of a larger battery of clinical and on-road assessments (Classen et al., 2008; Stav, Justiss, McCarthy, Mann, & Lanford, 2008). The two evaluators, who were blinded to the participants’ SDBM self-ratings or proxy ratings, also completed an SDBM on each driver after the on-road test. All participants received $50 for their study participation.
The SDBM is a 68-item self-report or proxy measure to assess safe driving behaviors (Classen et al., 2010; Winter et al., 2011). The measure score (derived from Rasch analysis) represents the reported level of difficulty for the items given the participant’s ability level. Difficulty with the driving task is rated via a 5-point adjectival scale (ranging from 1 = cannot do to 5 = not difficult). The SDBM items are displayed in the Appendix.
Data Collection and Management
All the data (SDBM, demographic information, scores on the clinical tests and the on-road tests) of the older drivers and F–C were collected and recorded by research assistants in a central, secure, and password-protected data repository, which was located at the primary site, the University of Florida. Data entry was monitored by Sherrilene Classen, the principal investigator, and quality control spot checks and corrections were made intermittently to ensure data completion and accuracy.
Item Inclusion and Exclusion.
We excluded 27 items from the analysis, 22 items that were not observable by the driving evaluator at the time of testing (e.g., driving in snow) and 5 items that added little or no variance to the responses. For example, >95% of respondents used the same rating category (i.e., not difficult) for 5 items.
We conducted an interclass correlation (ICC) to examine rater reliability on the 41 remaining items. We used SPSS version 17.0 (SPSS, Inc., Chicago) for the analyses and considered p ≤ .05 significant for the correlations.
We used the MFRM to analyze rater effects using Facets software version 3.57 (Linacre, 2004). The MFRM extends the rating-scale Rasch model by adding one component or facet (Cj) to calibrate rater severity:
where Pnik is the probability of observing category k for person n who answers item i; Pni(k − 1) is the probability of observing category k − 1; Bn is person ability; Dgi is item difficulty for item i in group g; Fgk is the difficulty of being observed in category k relative to category k − 1 for an item in group g; and Cj is the severity of judge j, who gives rating k to person n on item i.
We used Facet ruler, fit statistics, fixed χ2, and paired comparisons to investigate the rater effects. Facet ruler, displaying three facets (rater, item difficulty, person ability) in the same linear continuum, provides a visual map to compare the relative hierarchy within and between facets. To illustrate the relative distribution of the drivers’ abilities and item difficulties simultaneously, we anchored the mean of the rater severity to 0.
We used fit statistics (infit mean squares [MnSqs] and outfit MnSqs) to detect erratic raters, that is, raters who assign high scores to drivers with a low ability level and low scores to drivers with a high ability level. Infit statistics are more responsive to the variance of those well-targeted observations, and outfit statistics are sensitive to the variance of outliers or extreme observations. Ideal fit occurs when the observed response patterns exactly match the predicted pattern (MnSq = 1) of the model. We considered infit MnSq and outfit MnSq ranging from 0.6 to 1.4 an adequate fit for survey data (Bond & Fox, 2001). The measure represents the average ratings of the rater in logits, with higher scores indicating greater severity in rating.
We used fixed χ2 to examine whether at least one rater group, on the overall scale level, consistently used the ratings differently from other rater groups. If the fixed χ2 test was significant, then we performed paired comparisons to identify item-level rater effects. For example, if three rater groups are tested, a significant fixed χ2 statistic means that at least one of these three rater groups is more severe or lenient in their ratings on the overall scale.
We then performed paired comparisons to identify which rater group was significantly more severe or lenient in its ratings or to show which items the raters rated significantly more severely or leniently. We used an α level of .05 to determine a significant rater effect.
Table 1 presents the driver and F–C demographics. We tested 80 licensed drivers with a mean age of 73.26 yr (standard deviation [SD] = 5.30), even gender distribution, and mainly White. They were an educated and healthy older community-dwelling group. We also tested 80 F–C (age range = 20–85 yr), all of whom were community dwelling, who were mainly female and White; most lived with a partner or spouse.
The ICCs among the ratings of the three rater groups was significant but weak (ICC = .256, p < .001, 95% confidence interval [CI] = .118–.403). Of the 41 items, we found a significant correlation only between the ratings of the evaluator and F–C groups (ICC = .462, p < .001, 95% CI = .271–.618). We observed no significant correlations between the ratings of the older driver and F–C groups (ICC = .127, p = .129) or between the older driver and evaluator groups (ICC = .088, p = .217).
Facet Ruler of the SDBM.
Figure 1 depicts three facets (raters, drivers, items) on the linear interval scale for the SDBM. The first column, Measure, is the interval scale expressed as a logit unit. The second column displays the severity of raters, representing, from bottom to top, lenient to severe raters. The third column shows the distribution of the drivers’ safe driving ability, from bottom to top, representing the drivers with safe driving abilities ranging from poor to good. The fourth column displays item difficulties representing, from bottom to top, the essentially easy items and then progressing to levels of increasing difficulty. The fifth column shows the likelihood of applying the rating scale in relation to the raters’ abilities; that is, when a driver’s estimated ability is between 1 and 2 logits, he or she will likely receive a rating of 4 on this measure. In the second column, the driving evaluator is located above the caregiver, indicating that the driving evaluator is the more severe rater. The distribution of the drivers’ abilities was on the upper part of the ruler, as displayed in the third column, and the distribution of the item difficulties was on the lower part of the ruler, as displayed in the fourth column. This distribution indicated that the drivers had, generally speaking, high safe driving abilities as measured on this 41-item scale.
Fit Statistics of the Rater Groups.
The fixed χ2 value, 166.9 with one degree of freedom, was statistically significant (p < .001). The ratings between the F–C and evaluator groups showed significant rater effects: The evaluator group was more severe, considering that the measure of the evaluator group is higher (mean = −3.32, SD = 0.03) than that of the F–C group (mean = −3.98, SD = 0.04).
In this study, we addressed interrater reliability among three groups of raters (older drivers, F–C, and driving evaluators) and investigated the rater effects between the evaluators and the F–C to identify erratic responses and to determine the severity or leniency of the groups’ ratings on the 41 items of the SBDM.
We found no statistically significant correlation between the ratings of the driver and evaluator groups or between the ratings of the driver and F–C groups. However, we found a significantly moderate agreement (.53) between the evaluator and F–C groups. Two studies have previously investigated the relationship of driving performance rated by evaluators and older drivers (Marottoli & Richardson, 1998; Wild & Cotrell, 2003) and found no significant correlation between the evaluators’ rating and the drivers’ rating (Marottoli & Richardson, 1998) and no significance for 8 of 10 items rating the drivers’ driving performance (Wild & Cotrell, 2003). Our study’s findings are therefore somewhat consistent with these two studies in that the evaluators’ ratings were not associated with the drivers’ ratings, but they are correlated with the F–C’s ratings.
Facet Ruler of the SDBM.
The distribution of the drivers’ ability relative to the distribution of the items’ difficulty indicates that the participants in this study performed well on the instrument. As can be seen in Figure 1, many of our items are on the same logit level. Taking into account that only the means of the items are represented, we have more overlap among the items because each item consists of five difficulty levels corresponding to a 5-point adjectival scale. Having different items at the same difficulty level in the item pool may be redundant for paper-and-pencil tests; however, it will increase the item pool, which will in turn provide more choices for future applications, such as using computer adaptive testing (the next step in the development of our instrument).
Fit Statistics of the Three Rater Groups.
The fit statistics across the rater groups (evaluators and F–C) showed that no rater group was erratic and that overall the evaluators were more severe raters (Facets) than the F–C.
Fixed χ2 and Paired Comparisons.
Although the evaluator group rated more severely than the F–C group on the overall scale, the F–C group rated 10 items more severely than did the evaluator group. However, the evaluator group rated 7 items more severely than did the F–C group. Evaluators have the formal training to rate driving behaviors according to the standards of regulatory bodies such as the Department of Motor Vehicle and Highway Safety licensure guidelines, and one can therefore expect that they will be more technical and more stringent in their ratings. The F–C group did not have such formal training and were rating the drivers on their perceptions of their loved one’s driving safety. The tendency for evaluators to rate more severely (than the F–C) may be influenced by their training to focus on identifying deficits. The F–C, however, may be influenced by concern for their loved one’s safety, thus rating more stringently, or may be concerned with maintaining their own independence in transportation and rating leniently, especially given that 31.3% of F–C stated that their independence would be affected if the older driver stopped driving. In future studies, we may want to control for this variable by stratifying F–C on the basis of whether their independence will or will not be affected if the older driver stops driving.
The generalizability of our findings is limited because we used only two evaluators, had a convenience sample, and had a sample size of 80 F–C and 80 older drivers. Our driver sample was skewed to include mainly White (97.5%) and educated participants (63.8% had some college education or a university degree). The F–C sample was mainly female (77.5%) and White (98.8%), and 48.8% had completed a college or university degree.
Despite the study’s limitations, our findings suggest that because of the significant relationship between F–C and evaluator findings, we may train caregivers to more accurately recognize older adults’ unsafe driving behaviors. As such, after a short training program for F–C, we expect that the paired comparisons of the identified items (displayed in Table 2) may show improved congruence between F–C and evaluators. We are developing a caregiver training protocol to test this hypothesis.
Implications for Occupational Therapy Practice
Because occupational therapists play an important role in driving rehabilitation, understanding the differences in ratings among raters can help occupational therapists accurately identify the problems and efficiently develop driving safety interventions. This article presents several implications for occupational therapy practice:
The SDBM can be used as a reliable tool to assess older drivers’ safe driving behavior in occupational therapy practice.
Driving evaluators may tend to rate more severely than F–C.
When reviewing the ratings of the SDBM, occupational therapists must take rater effects into account.
With adequate training, F–C may be the most available, accurate, and reliable resource for following up on the safe driving behaviors of older people.
Our findings address an understudied area in the older driver safety literature: the reliability, leniency, and severity of F–C and evaluator ratings of older drivers through the SDBM. This study makes it clear that a correlation exists between the evaluator and the F–C ratings, that neither of these groups is erratic in their rating responses, that the driving evaluator is the most severe rater, and that the F–C show potential to be trained to increase the accuracy of their ratings. A future implication is to devise, implement, and test an F–C training protocol to enhance the accuracy and reliability of their ratings. If this proves successful, then the SDBM will have the potential to be used by F–C as a proxy self-report tool for identifying safe and unsafe driving behaviors. Occupational therapists may play a critical role in interpreting the findings of such proxy reports and identifying entry points for logical and efficient driver safety interventions.
This project was funded by National Institute on Aging Grant (R21) PAR-06-247 (Principal Investigator, Sherrilene Classen) and the University of Florida’s Center for Multimodal Studies on Congestion Mitigation (CMS) 00063055 (Principal Investigator, Sherrilene Classen). We acknowledge the support of the Institute for Mobility, Activity, and Participation at the University of Florida and the Centre for Research on Safe Driving at Lakehead University.
Appendix. Items on the Safe Driving Behavior Measure
Response options: cannot do, very difficult, somewhat difficult, a little difficult, not difficult.
How difficult is it for him or her to …
Open the car door?
Get in his or her car?
Turn the steering wheel?
Adjust the car mirrors?
Stay awake while driving?
Adjust the driver’s seat so he or she can see above the steering wheel?
Stop for pedestrians crossing the roadway?
Drive in good weather?
Stay in the proper lane?
Drive during daylight hours?
Remember to turn on the headlights before driving in the dark?
Check for a clear path when backing out from a driveway or parking space?
Reach the gas pedal (accelerator) and brake pedal?
Press the gas or the brake when intended?
Use the car controls (such as the turn signals, windshield wipers, or headlights)?
Place the car in the correct gear (such as drive or reverse)?
Operate the emergency brake?
Check car mirrors when changing lanes?
Read road signs far enough in advance to react (such as make a turn)?
Obey varied forms of traffic lights (such as green arrow for turn lane or flashing lights)?
Drive and hold a conversation with one or more passengers?
Drive with a passenger who is providing driving directions or assistance?
Drive in light rain?
Drive on a highway with two or more lanes in each direction?
Keep up with the flow of traffic?
Keep distance from other vehicles when changing lanes?
Change lanes in moderate traffic?
Drive cautiously (to avoid collisions) in situations when others are driving erratically (such as speeding, road rage, crossing lane lines, or driving distracted)?
Brake at a stop sign so car stops completely before the marked line?
Maintain lane when turning (not cut corner or go wide)?
Back out of parking spots?
Enter the flow of traffic when turning right?
Share the road with vulnerable road users such as bicyclists, scooter drivers, motorcyclists?
Drive on graded (unpaved) road?
Check blind spots before changing lanes?
Drive with surrounding tractor trailers (transport trucks)?
Merge onto a highway?
Use a map while driving?
Make a left-hand turn crossing multiple lanes and entering traffic (with no lights or stop signs)?
Stay within the lane markings unless making a lane change?
Stay within the proper lane in the absence of road features such as clearly marked lane lines, reflectors, or rumble strips?
Keep distance between his or her car and others (allow time to react to hazards)?
Look left and right before crossing an intersection?
Drive in a construction zone?
Drive in dense traffic (such as rush hour)?
Pass (overtake) a car in the absence of a passing lane?
Pass (overtake) a larger vehicle such as an RV, tractor trailer (transport truck), or dump truck in the absence of a passing lane?
Drive in an unfamiliar urban area?
Control his or her car when going down a steep hill?
Exit an expressway or interstate from a left-hand lane?
Drive in a highly complex situation (such as a large city with high-speed traffic, multiple highway interchanges, and several signs)?
Control the car (brake hard or swerve) to avoid collisions?
Drive a different car (such as another person’s car or a rental car)?
Alter his or her driving in response to changes in health (such as vision, reaction time, fatigue, thinking, joint stiffness, medications)?
Drive when upset (anxious, worried, sad, or angry)?
Stay focused on driving when there are distractions (such as radio, eating, drinking, pet in the car)?
Drive in an unfamiliar area?
Drive at night?
Avoid dangerous situations (such as car door opening, car pulling out, road debris, or animal darting in front of car)?
Drive when there is fog?
Drive at night on a dark road with faded or absent lane lines?
Drive when there is glare or the sun is in his or her eyes?
Turn left across multiple lanes when there is no traffic light?
Drive in a thunderstorm with heavy rains and wind?
Control his or her car on a wet road?
Drive on a snow-covered road?
Drive on an icy road?
Note. The complete Safe Driving Behavior Measure is available on request from Sherrilene Classen.