Rater Reliability in Speaking Assessment in a Japanese Senior High School: Case of Classroom Group Discussion and Debate

Rie Koizumi; Susumu Hatsuzawa; Reina Isobe; Koichi Matsuoka

Journal ArticleOPEN ACCESS

Rater Reliability in Speaking Assessment in a Japanese Senior High School: Case of Classroom Group Discussion and Debate

JALT Journal (2022) 44(2) 281-322

DOI: 10.37546/JALTJJ44.2-5

1Citations

7Readers

Abstract

Securing rater reliability for classroom speaking tests can be difficult because teacher-raters typically do not have much time to engage in rater training to understand and discuss rubrics and scores. Furthermore, a teacher typically faces difficulties asking colleagues to help double mark each student’s performance. Intensive rater training and double scoring are typical procedures to maintain high reliability (Knoch et al., 2021) but are not well practiced in the classroom. However, in some cases, extensive training or double scoring is not necessary when teachers use a rubric with a few criteria and levels, which is simpler than conventional detailed rubrics (Koizumi & Watanabe, 2021). Thus, we use a group discussion and a debate to explore rater reliability when Japanese senior high school teachers use simple analytic rubrics without detailed rater training. We pose the following research questions (RQs): RQ1: To what degree are raters similar in terms of interrater consensus and consistency? RQ2: To what degree do raters score students’ responses consistently? RQ3: How many raters are required to maintain reliability? We analyzed ratings for two speaking tests administrated in September or November to 227 third-year students at a public senior high school. Each test, taken by a group of four students, included either a five-minute group discussion or a 21-minute group debate; the test administration and marking were conducted during the lesson time. An analytic rubric was developed for each task and consisted of three or four criteria with three levels (e.g., content, expression, and technique). Two of the three raters scored each student’s response during the test. Teachers had no time to discuss the rubrics in detail and engaged in only a 10-minute discussion about the rubrics before the tests. The ratings were analyzed separately for each test using weighted kappa statistics, Spearman’s rank-order correlations, many-facet Rasch measurement (MFRM), and multivariate generalizability theory (mG theory). The results indicated that the overall rater reliability was adequate, but some cases required careful training. For RQ1, the kappa statistics of two raters’ scores for each criterion ranged from poor to substantial agreement (-.06 to .84). Correlations between two raters’ scores ranged from negligible to strong (-.07 to .91) and there were not large differences in rater severity (i.e., differences in fair mean-based average values of 0.07 to 0.16 with full marks of 3). In addition, the overall agreement percentages from MFRM were higher than those predicted by MFRM (e.g., 72.9% > 71.6%). The intrarater consistency examined for RQ2 using Infit and Outfit mean squares from MFRM was also adequate (e.g., 0.86 to 1.35). The number of raters needed to maintain sufficient reliability (Φ = .70) for RQ3 was one at the overall test levels and one to three at the criterion levels. Using simple rubrics, a group discussion task, and a debate task, the results showed that rater reliability can be maintained without extensive rater training. Although the current results may have been affected by study contexts, such as procedures and students’ and raters’ characteristics, they provide pedagogical and methodological implications for developing speaking assessment tasks and procedures and reporting rater reliability statistics from multiple perspectives.

Author supplied keywords

Cite

CITATION STYLE

APA

Koizumi, R., Hatsuzawa, S., Isobe, R., & Matsuoka, K. (2022). Rater Reliability in Speaking Assessment in a Japanese Senior High School: Case of Classroom Group Discussion and Debate. JALT Journal, 44(2), 281–322. https://doi.org/10.37546/JALTJJ44.2-5

Rater Reliability in Speaking Assessment in a Japanese Senior High School: Case of Classroom Group Discussion and Debate

Abstract

Author supplied keywords

Cite

Register to see more suggestions