A Survey of Text Representation and Embedding Techniques in NLP

Rajvardhan Patil; Sorio Boit; Venkat Gudivada; Jagadeesh Nandigam

Journal ArticleOPEN ACCESS

A Survey of Text Representation and Embedding Techniques in NLP

IEEE Access (2023) 11 36120-36146

DOI: 10.1109/ACCESS.2023.3266377

35Citations

169Readers

Abstract

Natural Language Processing (NLP) is a research field where a language in consideration is processed to understand its syntactic, semantic, and sentimental aspects. The advancement in the NLP area has helped solve problems in the domains such as Neural Machine Translation, Name Entity Recognition, Sentiment Analysis, and Chatbots, to name a few. The topic of NLP broadly consists of two main parts: the representation of the input text (raw data) into numerical format (vectors or matrix) and the design of models for processing the numerical data. This paper focuses on the former part and surveys how the NLP field has evolved from rule-based, statistical to more context-sensitive learned representations. For each embedding type, we list their representation, issues they addressed, limitations, and applications. This survey covers the history of text representations from the 1970s and onwards, from regular expressions to the latest vector representations used to encode the raw text data. It demonstrates how the NLP field progressed from where it could comprehend just bits and pieces to all the significant aspects of the text over time.

Author supplied keywords

Cite

CITATION STYLE

APA

Patil, R., Boit, S., Gudivada, V., & Nandigam, J. (2023). A Survey of Text Representation and Embedding Techniques in NLP. IEEE Access, 11, 36120–36146. https://doi.org/10.1109/ACCESS.2023.3266377

A Survey of Text Representation and Embedding Techniques in NLP

Abstract

Author supplied keywords

Cite

Register to see more suggestions