This paper searches for optimal ways of employing deep contextual models to solve practical natural language processing tasks. It addresses the diversity in the problem space by utilizing a variety of techniques that are based on the deep contextual BERT (Bidirectional Encoder Representation from Transformer) model. A collection of datasets on COVID-19 social media misinformation is used to capture the challenge in the misinformation detection task that arises from small labeled data, noisy labels, out-of-distribution (OOD) data, fine-grained & nuanced categories, and heavily-skewed class distribution. To address this diversity, both domain-agnostic (DA) and domain-specific (DS) BERT pretrained models (PTMs) for transfer learning are examined via two methods, i.e., fine-tuning (FT) and extracted feature-based (FB) learning. The FB is implemented using two approaches: non-hierarchical (features extracted from a single hidden layer) and hierarchical (features extracted from a subset of hidden layers are first aggregated, then passed to a neural network for further extraction). Results obtained from an extensive set of experiments show that FB is more effective than FT and that hierarchical FB is more generalizable. However, on the OOD data, the deep contextual models are less generalizable. It identifies the condition under which DS PTM is beneficial. Finally, bigger models may only add an incremental benefit and sometimes degrade the performance.
CITATION STYLE
Hasan, M. R. (2021). How to Churn Deep Contextual Models? In ACM International Conference Proceeding Series (pp. 226–233). Association for Computing Machinery. https://doi.org/10.1145/3486622.3493962
Mendeley helps you to discover research relevant for your work.