EvalAssist: Insights on Task-Specific Evaluations and AI-Assisted Judgment Strategy Preferences

Zahra Ashktorab; Michael Desmond; Qian Pan; James M. Johnson; Martin Santillán Cooper; Elizabeth M. Daly; Rahul Nair; Tejaswini Pedapati; Hyo Jin Do; Werner Geyer

Conference ProceedingsOPEN ACCESS

EvalAssist: Insights on Task-Specific Evaluations and AI-Assisted Judgment Strategy Preferences

UIST 2025 - Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology (2025)

DOI: 10.1145/3746059.3747740

3Citations

6Readers

Get full text

Abstract

With the broad availability of large language models and their ability to generate vast outputs using varied prompts and configurations, determining the best output for a given task requires an intensive evaluation process, one where machine learning practitioners must decide how to assess the outputs and then carefully carry out the evaluation. This process is both time-consuming and costly. As practitioners work with an increasing number of models, they must now evaluate outputs to determine which model performs best for a given task. LLMs are increasingly used as evaluators to filter training data, evaluate model performance or assist human evaluators with detailed assessments. Our application, EvalAssist, supports this process by aiding users in interactively refining evaluation criteria. In our study with machine learning practitioners (n=15), each completing 6 tasks yielding 131 evaluations, we explore how task-related factors and judgment strategies influence criteria refinement and user perceptions. Findings show that users performed more evaluations with direct assessment by making criteria task-specific, modifying judgments, and changing the AI evaluator model. We conclude with recommendations for how systems can better support practitioners with AI-assisted evaluations.

Cite

CITATION STYLE

APA

Ashktorab, Z., Desmond, M., Pan, Q., Johnson, J. M., Santillán Cooper, M., Daly, E. M., … Geyer, W. (2025). EvalAssist: Insights on Task-Specific Evaluations and AI-Assisted Judgment Strategy Preferences. In UIST 2025 - Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. Association for Computing Machinery, Inc. https://doi.org/10.1145/3746059.3747740

EvalAssist: Insights on Task-Specific Evaluations and AI-Assisted Judgment Strategy Preferences

Abstract

Cite

Register to see more suggestions