Test-retest reliability of the eating disorder examination-questionnaire (EDE-Q) in a college sample

Background The Eating Disorder Examination-Questionnaire (EDE-Q), a widely used self-report instrument, is often used for measuring change in eating disorder symptoms over the course of treatment. However, limited data exist about test-retest reliability, particularly for men. The current study evaluated EDE-Q 7-day test-retest reliability in male (n = 47) and female (n = 44) undergraduate students together and separately by gender. Results Internal consistency was consistently higher for women and at Time 2, but remained acceptable for both men and women at both time points. Cronbach’s α ranged from .75 (Restraint at Time 1) to .93 (Shape Concern at Time 2) for women and from .73 (Eating Concern at Time 2) to .89 (Shape Concern at Time 2) for men. With the exception of some of the eating disorder behaviors, test re-test reliability was fairly strong for both men and women. Shape Concern and the global EDE-Q score were highest for both men and women (Spearman’s rho > 0.89 with the exception of Shape Concern for women for which Spearman’s rho = .86). Test re-test reliability was lower for the eating disorder behavior measures, particularly for men, for whom Kendall’s tau-b for frequency and phi for occurrence was less than 0.70 for all but objective bulimic episodes. Conclusions Results were consistent with past research for women, indicating strong test re-test reliability in attitudinal features of eating disorders, but lower test re-test reliability in behavioral features. Internal consistency and test re-test reliability was good for the attitudinal features of eating disorder in men, but tended to be lower for men compared to women. The EDE-Q appears to be a reliable instrument for assessing eating disorder attitudes in both male and female undergraduate students, but is less reliable for assessing ED behaviors, particularly in men.


Background
The EDE-Q [1] is a widely used measure to assess eating disorder (ED) attitudes and behaviors in both community and clinical populations. Eating disorders are especially prevalent among college women and are becoming more prevalent among young men [2]. Consequently, identifying students with eating disorders is important so that treatment can be made available to these students. The EDE-Q is a particularly useful measure to assess eating disorder attitudes and behavior in the broader population of college students as it is easy and inexpensive to administer and can quickly measure eating disorders and compensatory behaviors in large samples. However, since assessment for detection of eating disorders in college students is likely to occur infrequently in non-research university settings, temporal stability is a critical component of any ED measure used for this purpose.

Participants
The EDE-Q was administered to N = 91 male (N = 47) and female (N = 44) undergraduate students recruited for research participation credit in a large introductory psychology course at a university in the northeastern United States. The mean age was 19 (sd = 1.16; range = 18-23). Participants were able to identify with more than one ethnicity. The majority (59%) of participants identified as White, 33% identified as Asian, 11% identified as Black, and 8% identified as Hispanic. For both assessments, students completed a paper and pencil selfreport EDE-Q questionnaire in person during individually scheduled appointments. The average test re-test interval was 6.88 days (sd = 1.36 days, range = 5-14 days), and the test re-test interval was between 6 and 8 days for 94.5% of the participants. No other questionnaires besides the EDE-Q were completed at either time point. All but 3 participants completed both assessments. The study was approved by the university's institutional review board. EDE-Q V6.0 procedures (Fairburn, 2008) were used to score the EDE-Q at both time points. Subscale scores were created by averaging the corresponding items, provided that participants responded to more than half of those items. Subscales included Restraint (5 items), Eating Concern (5 items), Shape Concern (8 items), and Weight Concern (5 items). A global EDE-Q score was created averaging the 4 subscales. Both frequency (number of times) and occurrence (a binary variable representing engaging in the behavior at least one time; yes/ no) of ED behavioral features (objective bulimic episodes (OBE), OBE days, objective overeating (OO) episodes, and exercise to control weight or shape) were examined, as was a composite behavior score which was an average of OBE, OBE days, and OO episodes frequency variables. Subjective binge eating episodes (SBE), which could be determined in earlier versions of the EDE-Q, cannot be determined in Version 6.0 of the EDE-Q. An SBE is defined as an occasion when there is a perceived loss of control, but the amount of food eaten is not large. The EDE-Q 6.0 assesses loss of control, but only with regard to occasions when a large amount of food is consumed. Because vomiting (N = 2) and laxative use (N = 1) were rare, test re-test reliability statistics were not computed for these variables.

Analysis
Internal consistency was calculated using Cronbach's coefficient alpha (α) for the four continuous EDE-Q subscales (Restraint, Eating Concern, Shape Concern and Weight Concern) and the global EDE-Q score. To facilitate comparison to previous studies, 7-day test-retest reliability of each continuous subscale, the global EDE-Q score, frequency of OO, OBE, and OBE days, and the binge behaviors composite score was estimated using Pearson r and Spearman's rho statistics. It has been suggested that test retest reliability coefficients of .80 or higher for these statistics are indicative of acceptable test re-test reliability [14].
Kendall's tau-b was also calculated for the ED behavior frequency variables due to more extreme nonnormality in these measures compared to the global EDE-Q score and the four subscales. In cases of extreme nonnormality, Kendall's tau-b has been found to be superior to Spearman's rho [15]. Kendall's tau-b is a nonparametric test of rank association. Similar to the Pearson correlation coefficient and Spearman's rho, Kendall's tau-b can range from −1 (perfect disagreement) to +1 (perfect agreement). Although there is no well established criterion for acceptable test retest reliability for Kendall's taub, its magnitude is generally lower by a ratio of Spearman's rho to Kendall's tau-b of approximately 3/2 due to differences in computation [16]. Finally, phi coefficients were calculated for the binary binge behavior occurrence variables. All statistics were calculated for the entire sample, as well as separately by gender. Table 2 show means and standard deviations for the continuous measures, and the number and percentage of students indicating having engaged in the behavior at least once are shown for binge behavior occurrence. Means for women on the global EDE-Q score, Shape Concern, and Weight Concern were consistent with established EDE-Q norms for college women, but women in this study had slightly lower means than the norm on Restraint and Eating Concern [17]. The rate of reported OBE and Excessive Exercise was higher for women compared to the norm for college women, but lower for Vomiting and Laxative Use [17]. Men had slightly lower means on Eating Concern, Shape Concern, and Weight Concern compared to the norm for college men, but were consistent with the norm for Restraint [4]. Similar to women, the rate of Excessive Exercise was higher for men compared to the norm, but lower for Vomiting and Laxative Use, whereas the rate of reported OBE episodes for men was consistent with the norm [4].

Descriptive statistics
Men scored significantly lower at both time points than women on all EDE-Q subscales and global EDE-Q, with the exception of Restraint. Men reported fewer OBEs (mean = 1.07 and 0.90 Times 1 and 2, respectively) and OBE days (mean = 1.11 and 2.08 Times 1 and 2, respectively) compared to women. However, these differences were statistically significant only for OBE days at Time 2. Conversely, men reported significantly more OO episodes (mean = 5.83 and 5.93 Times 1 and 2, respectively) compared to women (mean = 2.71 and 1.77 at Times 1 and 2, respectively). Men had higher scores on the binge behaviors composite score due to their higher rates of OO. Vomiting and laxative use were rare. None of the participants reported using laxatives at Time 1 and only one male participant reported laxative use at Time 2. Two women reported vomiting to control shape or weight at Time 1 and Time 2. None of the men reported vomiting to control shape or weight. However, 45% of participants in Time 1 and 35% in Time 2 reported exercising to control shape or weight. There were no significant gender differences in frequency of excessive exercise, although women reported a significantly higher level of excessive exercise occurrence at Time 2. Table 3 shows Cronbach's α internal consistency for the four EDE-Q subscales. Internal consistency was acceptable for all four subscales. Overall, internal consistency was lower at Time 1 than Time 2 and lowest for Restraint at Time 1, yet remained acceptable at both time points for both men and women. Internal consistency was consistently higher for women, with the exception of Restraint at Time 2 (α = .86 for men and .81 for women). Cronbach's α ranged from .74 (Restraint) to .89 (Shape Concern) for men and from .75 (Restraint) to .93 (Shape Concern) for women.

Internal consistency
Test re-test reliability Tables 4 and 5 show the test re-test reliability coefficients for the EDE-Q measures. With the exception of some of the ED behaviors, test re-test reliability was fairly strong for both men and women. Shape Concern and the global EDE-Q score were highest for both men and women (Spearman's rho =0.89 or greater with the exception of Shape Concern for women for which Spearman's rho = .86). Test re-test reliability was lower for the ED behavior measures, particularly for men, for whom Kendall's tau-b for frequency and phi for occurrence was less than 0.70 for all but OBE. Among women, Kendall's tau-b was less than .70 for all but Excessive Exercise frequency, although test re-test reliability for ED Behavior occurrence was more reasonable.

Discussion
The current study examined internal consistency and 7-day test re-retest reliability among college men and women. Consistent with past research, internal consistency was reasonable for all four subscales and higher for the global EDE-Q measure [6,11]. Internal consistency was lowest for the Restraint subscale. Internal consistency was slightly lower for men compared to women, but still acceptable. Interestingly, internal consistency was higher for both men and women for Time 2 compared to Time 1. Given the relatively short 7-day interval between assessments, this might reflect greater familiarity with the EDE-Q at Time 2, thus producing a higher correlation among the attitudinal items. Test re-test reliability was generally high for the four attitudinal subscales and the global attitudinal EDE-Q score, but lower for ED behavior frequency and occurrence. This is consistent with past research indicating greater temporal stability in ED attitudes compared to ED behaviors [5,[9][10][11][12]. Men had lower test re-test reliability for ED attitudes and behaviors compared to women. This might reflect that, for many men, eating attitudes and behaviors may be more likely to be driven by a desire for muscularity [18]. Consequently, men may have different ED concerns and behaviors unmeasured by the EDE-Q that may influence the reliability of the EDE-Q constructs in men. For example, rather than overeating or binge eating to be thinner, some men may engage in these behaviors to build larger bodies with more muscle mass. The higher rate of overeating  without perceived loss of control in men may be due in part to a conscious decision to eat more in order to increase muscle building. Further, research has indicated that men experience fewer shape and weight concerns than women [19], and this is supported by the lower scores on ED attitudes for men. Men may engage in more intermittent dieting behaviors related to muscle building, which might impact temporal stability of eating behaviors. To our knowledge, this is the first study that examined temporal stability of the EDE-Q in men. However, this study could not assess the validity of the measure in men. Consequently, more research examining both reliability and validity of the EDE-Q in men is warranted in order to replicate and understand the findings in this study, and more clearly determine the extent to which the EDE-Q is a valid measure for men. Despite lower test reliability for ED behaviors compared to ED attitudes in this study, temporal stability of ED behaviors was higher compared to previous studies. This may be due to the short interval between assessments, which results in an overlap in recall of these behaviors because participants are asked to recall their behavior over the past 28 days. Test re-test reliability for ED behaviors has been found to decrease as the interval between assessments increases [7], and is often unacceptably low for test re-test intervals that extend over several months [6]. Establishing good temporal stability for a short interval is important, as it can be considered an upper limit on the stability of the EDE-Q because attitudes and behaviors are less likely to change over such a short period of time. If short term test retest reliability is poor, then observed changes in EDE-Q scores resulting from true changes in attitudes and behaviors that might occur over a longer period of time will be confounded with unreliability in the measure.
There are some limitations to this study that should be noted. First, the sample was too small to examine laxative use and vomiting to control shape or weight. This problem has plagued most past research as well [6,8,9,11]. Only a few studies have examined temporal stability in laxative use and vomiting, which have shown low to moderate temporal stability for these behaviors [5,10,11]. However, most of these studies were conducted on populations from countries other than the United States, and tended to have considerably larger samples sizes. Second, the test re-test reliability coefficients were calculated based on the originally proposed four factor structure for the EDE-Q subscales [1]. Although other studies examining the factor structure of the EDE-Q subscales have found a varying range of factors [20,21], we chose to examine test re-test reliability of the four original subscales in order to be comparable to other studies examining the psychometric properties of the EDE-Q. We did not collect body mass index (BMI) data in this study. It is reasonable to assume that there would be little to no change in BMI within individuals from the first to the second assessment only 7 days later. Consequently, BMI is not likely to have influenced test re-test reliability in this study because it likely to have remained stable between assessments. However, a lack of BMI data makes it more difficult to compare overall EDE-Q attitude and behavior scores in this study to scores in other studies. Finally, the current study relied on self-reports of ED attitudes and behavior, so it is possible that observed gender differences may be a function of differences in retrospective or other recall bias.

Conclusions
This study examined test re-test reliability of the EDE-Q in college women and men, and is the first study to report test re-test reliability in men specifically. Results were consistent with past research for women, indicating good stability in attitudinal features of ED and lower stability in behavioral features for a relatively short 7-day test re-test interval. Internal consistency and test re-test reliability was good for the attitudinal features in men, but tended to be lower compared to women, particularly for the behavioral features of ED. This suggests that men are less consistent in their ED behaviors, possibly due in part to having different goals for ED behaviors. However more research is necessary to determine whether this is a reliable finding and whether it extends to longer test re-test intervals. This study indicates that the EDE-Q is a reliable instrument for assessing eating disorder attitudes in both male and female undergraduate students, but is less reliable for assessing ED behaviors, particularly in men for whom only OBEs appeared to have acceptable test re-test reliability.

Competing interests
The authors declare that they have no competing interests.