Kaleidoscope: Semantically-grounded, context-specific ML model evaluation

Harini Suresh; Divya Shanmugam; Tiffany Chen; Annie G. Bryan; Alexander D'Amour; John Guttag; Arvind Satyanarayan

Conference ProceedingsOPEN ACCESS

Kaleidoscope: Semantically-grounded, context-specific ML model evaluation

Conference on Human Factors in Computing Systems - Proceedings (2023)

DOI: 10.1145/3544548.3581482

7Citations

7Readers

Abstract

Desired model behavior often differs across contexts (e.g., different geographies, communities, or institutions), but there is little infrastructure to facilitate context-specific evaluations key to deployment decisions and building trust. Here, we present Kaleidoscope, a system for evaluating models in terms of user-driven, domain-relevant concepts. Kaleidoscope's iterative workflow enables generalizing from a few examples into a larger, diverse set representing an important concept. These example sets can be used to test model outputs or shifts in model behavior in semantically-meaningful ways. For instance, we might construct a "xenophobic comments"set and test that its examples are more likely to be flagged by a content moderation model than a "civil discussion"set. To evaluate Kaleidoscope, we compare it against template- and DSL-based grouping methods, and conduct a usability study with 13 Reddit users testing a content moderation model. We find that Kaleidoscope facilitates iterative, exploratory hypothesis testing across diverse, conceptually-meaningful example sets.

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Suresh, H., Shanmugam, D., Chen, T., Bryan, A. G., D’Amour, A., Guttag, J., & Satyanarayan, A. (2023). Kaleidoscope: Semantically-grounded, context-specific ML model evaluation. In Conference on Human Factors in Computing Systems - Proceedings. Association for Computing Machinery. https://doi.org/10.1145/3544548.3581482

Readers' Seniority

PhD / Post grad / Masters / Doc 3

100%

Readers' Discipline

Computer Science 3

100%

Kaleidoscope: Semantically-grounded, context-specific ML model evaluation

Abstract

References Powered by Scopus

Using thematic analysis in psychology

The magical number seven, plus or minus two: some limits on our capacity for processing information

Shortcut learning in deep neural networks

Cited by Powered by Scopus

Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms

Compress and Compare: Interactively Evaluating Efficiency and Behavior Across ML Model Compression Experiments

(Beyond) Reasonable Doubt: Challenges that Public Defenders Face in Scrutinizing AI in Court

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline