An experimental study measuring the generalization of fine-tuned language representation models across commonsense reasoning benchmarks

Ke Shen; Mayank Kejriwal

Journal ArticleOPEN ACCESS

An experimental study measuring the generalization of fine-tuned language representation models across commonsense reasoning benchmarks

Expert Systems (2023) 40(5)

DOI: 10.1111/exsy.13243

10Citations

18Readers

Abstract

In the last 5 years, language representation models, such as BERT and GPT-3, based on transformer neural networks, have led to enormous progress in natural language processing (NLP). One such NLP task is commonsense reasoning, where performance is usually evaluated through multiple-choice question answering benchmarks. Till date, many such benchmarks have been proposed, and ‘leaderboards’ tracking state-of-the-art performance on those benchmarks suggest that transformer-based models are approaching human-like performance. Because these are commonsense benchmarks, however, such a model should be expected to generalize, that is, at least in aggregate, should not exhibit excessive performance loss across independent commonsense benchmarks regardless of the specific benchmark on (the training set of) which it has been fine-tuned. In this article, we evaluate this expectation by proposing a methodology and experimental study to measure the generalization ability of language representation models using a rigorous and intuitive metric. Using five established commonsense reasoning benchmarks, our experimental study shows that the models do not generalize well, and may be (potentially) susceptible to issues such as dataset bias. The results therefore suggest that current performance on benchmarks may be an over-estimate, especially if we want to use such models on novel commonsense problems for which a ‘training’ dataset may not be available, for the language representation model, to fine-tune on.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Shen, K., & Kejriwal, M. (2023). An experimental study measuring the generalization of fine-tuned language representation models across commonsense reasoning benchmarks. Expert Systems, 40(5). https://doi.org/10.1111/exsy.13243

Readers over time

Readers' Seniority

Lecturer / Post doc 4

44%

PhD / Post grad / Masters / Doc 3

33%

Professor / Associate Prof. 2

22%

Readers' Discipline

Computer Science 4

36%

Business, Management and Accounting 3

27%

Engineering 3

27%

Arts and Humanities 1

An experimental study measuring the generalization of fine-tuned language representation models across commonsense reasoning benchmarks

Abstract

Author supplied keywords

References Powered by Scopus

WordNet: A Lexical Database for English

SphereFace: Deep hypersphere embedding for face recognition

Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey

Cited by Powered by Scopus

Navigating the security landscape of large language models in enterprise information systems

Multilingual entity alignment by abductive knowledge reasoning on multiple knowledge graphs

A noise audit of human-labeled benchmarks for machine commonsense reasoning

Register to see more suggestions

Cite

Readers over time

Readers' Seniority

Readers' Discipline