Abstract
Titles The reliability o£ holistic and analytic evaluations of the EFL essays by Turkish University preparatory students Authors Şehnaz Şahinkarakaş Thesis Chairpersons Dr. Dan J. Tannacito, Bilkent University, MA TEFL Program Thesis Committee Members: Ms. Patricia Brenner, Dr. Linda Laube, Bilkent University, MA TEFL Program This study attempted to investigate a reliable method of scoring essays. Two hypotheses were tested. Observations were made pertaining to the scoring system used at the preparatory school of Çukurova University. A total of 150 EFL preparatory students participated in the study. These students wrote two essays s one for the first hypothesis and one for the second. The first essays were rated analytically by the teachers at Çukurova University. The second essays were rated holistically and analyt ically by four raters who have experience at EFL teaching situation for at least five years. Correlations were made to find the relationships between the scores given by the raters for the scoring methods. The first hypothesis was that the scoring system used at Çukurova University did not have a high level of reliability. The correlational analysis of data rejected this hypothesis (r=.97). However, descriptive analysis showed that the correlation of the scores alone would not be sufficient to claim that this system was reliable. In fact, observations indicate the raters who scored essays for the second time saw the first scores, thus creating a self-fulfilling bias. The second hypothesis was that holistically scored essays have signif icantly greater reliability than analytically scored ones in this educa tional context. The analysis of data was twofold s interrater reliability and intrarater reliability. The correlation for interrater reliability indicated that both scoring systems had high reliabilities. The interrater reliability of holistic scoring method was.85, and of analytic scoring method was.84. The difference is negligible. Since the analytic scoring method has five categories, the studyinvestigated the reliability of each category individually as well as the total. The analysis of categories revealed that the reliability of the categories was not as high as the total scores for analytic rating. The interrater reliability was.75 for content,.69 for organization,.80 for vocabulary,.82 for language use, and.71 for mechanics. The correlations for intrarater reliability showed that there was not a significant difference between the two scoring methods (p<.01 for both scoring). The intrarater reliability of holistic scoring ranged from.70 to.85 and of analytic scoring from.65 to.86. However, the categories scored on the analytic rubric had low intrar ater reliabilities. The intrarater reliability ranged from.34 to.83 for content, from.23 to.81 for organization, from.46 to.80 for vocabulary, from.63 to.77 for language use, and from.55 to.80 for mechanics. We may conclude that holistic scoring is more reliable than analytic scoring. Although the total scores of analytic scoring might have high reliability, the categories of this scoring method might have very low reliability which may raise a question about the reliability of analytic scoring.