EvalAssist: Insights on Task-Specific Evaluations and AI-Assisted Judgment Strategy Preferences

3Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

With the broad availability of large language models and their ability to generate vast outputs using varied prompts and configurations, determining the best output for a given task requires an intensive evaluation process, one where machine learning practitioners must decide how to assess the outputs and then carefully carry out the evaluation. This process is both time-consuming and costly. As practitioners work with an increasing number of models, they must now evaluate outputs to determine which model performs best for a given task. LLMs are increasingly used as evaluators to filter training data, evaluate model performance or assist human evaluators with detailed assessments. Our application, EvalAssist, supports this process by aiding users in interactively refining evaluation criteria. In our study with machine learning practitioners (n=15), each completing 6 tasks yielding 131 evaluations, we explore how task-related factors and judgment strategies influence criteria refinement and user perceptions. Findings show that users performed more evaluations with direct assessment by making criteria task-specific, modifying judgments, and changing the AI evaluator model. We conclude with recommendations for how systems can better support practitioners with AI-assisted evaluations.

Cite

CITATION STYLE

APA

Ashktorab, Z., Desmond, M., Pan, Q., Johnson, J. M., Santillán Cooper, M., Daly, E. M., … Geyer, W. (2025). EvalAssist: Insights on Task-Specific Evaluations and AI-Assisted Judgment Strategy Preferences. In UIST 2025 - Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology. Association for Computing Machinery, Inc. https://doi.org/10.1145/3746059.3747740

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free