Desired model behavior often differs across contexts (e.g., different geographies, communities, or institutions), but there is little infrastructure to facilitate context-specific evaluations key to deployment decisions and building trust. Here, we present Kaleidoscope, a system for evaluating models in terms of user-driven, domain-relevant concepts. Kaleidoscope's iterative workflow enables generalizing from a few examples into a larger, diverse set representing an important concept. These example sets can be used to test model outputs or shifts in model behavior in semantically-meaningful ways. For instance, we might construct a "xenophobic comments"set and test that its examples are more likely to be flagged by a content moderation model than a "civil discussion"set. To evaluate Kaleidoscope, we compare it against template- and DSL-based grouping methods, and conduct a usability study with 13 Reddit users testing a content moderation model. We find that Kaleidoscope facilitates iterative, exploratory hypothesis testing across diverse, conceptually-meaningful example sets.
CITATION STYLE
Suresh, H., Shanmugam, D., Chen, T., Bryan, A. G., D’Amour, A., Guttag, J., & Satyanarayan, A. (2023). Kaleidoscope: Semantically-grounded, context-specific ML model evaluation. In Conference on Human Factors in Computing Systems - Proceedings. Association for Computing Machinery. https://doi.org/10.1145/3544548.3581482
Mendeley helps you to discover research relevant for your work.