ACRF: Aggregated Conditional Random Field for out of Vocab (OOV) Token Representation for Hindi NER

3Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Named entities are random, like emerging entities and complex entities. Most of the large language model's tokenizers have fixed vocab; hence, they tokenize out-of-vocab (OOV) words into multiple sub-words during tokenization. During fine-tuning for any downstream task, these sub-words (tokens) make the named entity classification more complex since, for each sub-word, an extra entity type is assigned for utilizing the word embedding of the sub-word. This work attempts to reduce this complexity by aggregating token embeddings of each word. In this work, we have applied Aggregated-CRF (ACRF), where a conditional random field (CRF) is applied at the top of aggregated token embeddings for named entity prediction. Aggregation is done at embeddings of all tokens generated by a tokenizer corresponding to a word. The experiment was done with two Hindi datasets (HiNER and Hindi Multiconer2). This work showed that the ACRF is better than vanilla CRF (where token embeddings are not aggregated). Also, our result outperformed the existing best result at HiNER data, which was done by applying a cross-entropy classification layer. Further, An analysis of the impact of tokenization has been conducted, both generally and according to entity types for each word present in test data, and the results show that ACRF performed better for the words which tokenized in more than one sub-words (OOV) compared to vanilla CRF. In addition, this work conducts a comparative analysis between two transformer-based models, MuRIL-large and XLM-roberta-large and investigates how these models adopt aggregation strategy based on OOV.

Author supplied keywords

Cite

CITATION STYLE

APA

Singh, S., & Tiwary, U. S. (2024). ACRF: Aggregated Conditional Random Field for out of Vocab (OOV) Token Representation for Hindi NER. IEEE Access, 12, 22707–22717. https://doi.org/10.1109/ACCESS.2024.3362645

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free