NLP Overview

Introduction

NLP is short for Natural Language Processing, the research area I'm currently working on in PhD.

We humans speak in natural languages like Hindi, English, French, Tamil, etc. Humans can write, speak as well as understand these languages quite easily. But same cannot be said for computers. It is a hard task for computer to understand and generate natural language.

NLP Research Areas

The whole research area of NLP is divided into following major categories:

NLP Pipeline

NLP Pipeline is sequence of steps followed to perform any natural language processing task

Data Acquisition

We need data (text) to do any NLP task. Data can be acquired from various sources. Some of these sources are:

Web scraping
Publicly available datasets (from sites like Kaggle, PennTreebank corpus, etc.)
Hackathons Datasets

If the acquired data is not sufficient then we can use technique like data augmentation to generate more data.

Sometimes data can be of multiple types like text, image, video and other format combined. You need to get the textual data from all these data.

Text Cleaning

We need to clean the data in the appropriate form to be able to process it further. Unclean data may contain spelling mistakes, special characters, HTML tags, etc.

Some common text cleaning steps are

Unicode Normalization: Remove or convert the machine unreadable data like symbols, graphic characters, special characters and emojis to machine readable format.
Spelling correction
Removing extra whitespace
Handling emojis and emoticons
Removing HTML tags
Removing URLs
Removing non-desired language words
Removing redundant text
Handling accented characters
Removing digits
Fixing broken sentences
Expanding contractions
Removing punctuations
Lowercase conversion

Text Preprocessing

Text Pre-processing include following:

Tokenization:
- Word Tokenization: Splitting text into individual words.
- Sentence Tokenization: Splitting text into individual sentences.
Removing Stop Words: Eliminating common words that do not contribute significant meaning (e.g., "and," "the," "is").
Stemming/Lemmatization:
- Stemming: Reducing words to their root form (e.g., "running" to "run").
- Lemmatization: Reducing words to their base or dictionary form (e.g., "running" to "run"), considering the context.
Part Of Speech (POS) Tagging: It involves labeling each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, adverb, etc. This helps in understanding the grammatical structure of a sentence.
Named Entity Recognition (NER): It involves identifying and classifying named entities in text into predefined categories such as the names of persons, organizations, locations, dates, quantities, and more.

There could be more steps as per the datasets.

Feature Engineering

Feature engineering is the process of transforming raw data into meaningful features that enhance the performance of machine learning models. It involves selecting, creating, and optimizing features to better capture the underlying patterns and relationships in the data. Sometimes, it might also involve dimension reduction.

Feature Extraction: Text Vectorization techniques to convert text into numerical representation of vectors.
- Bag of Words (BoW): Converting text into a fixed-length vector by counting word occurrences.
- TF-IDF (Term Frequency-Inverse Document Frequency): Weighing terms based on their frequency in a document relative to their frequency in the corpus.
- Word Embeddings: Representing words as dense vectors (e.g., Word2Vec, GloVe).
- Contextual Embeddings: Using models like BERT and GPT that provide contextual word representations.
- One-Hot Encoding: Representing words as binary vectors.
Feature Selection
1. Filter Method: Use statistical tests to select the most relevant features.
2. Wrapper Method: Using ML algorithms like Recursive Feature Elimination (RFE) to select features.
3. Embedded Method: Using algorithms that perform feature selection during model training (e.g., Lasso regression)
Feature Optimization
1. Particle Swarm Optimization (PSO)
2. Genetic Algorithms (GA)
3. Firefly Algorithm (FA)
4. Cuckoo Search (CS)
5. Grey Wolf Optimizer (GWO)
6. Whale Optimization Algorithm (WOA)
Dimension Reduction
1. Principal Component Analysis (PCA): Transforms features into a lower-dimensional space while retaining most of the variance.
2. t-SNE (t-Distributed Stochastic Neighbor Embedding): Reduces dimensionality for visualization, capturing local structure in high-dimensional data.
3. UMAP (Uniform Manifold Approximation and Projection): Another dimensionality reduction technique that preserves more of the global structure.
Model Building

Train machine learning or deep learning models on the processed data.

Choose a Model: Select algorithms or architectures suited to the NLP task (e.g., classification, named entity recognition).
Train the Model: Fit the model on the training data and adjust parameters.

Model Evaluation

Test the model and assess the model’s performance using metrics such as

Accuracy
Precision
Recall
F1 Score
Area Under Curve (AUC)

Use the trained model to make predictions or extract insights from new, unseen text data

Model Deployment

Deploying model in platform like AWS, GCP or others.

Some Famous NLP Datasets

Dataset Name	Link
Penn Treebank Corpus	Link
NLTK Corpus	Link
IMDB Movie dataset	Link
Sentiment 140	Link
Stanford Sentiment TreeBank	Link
Jeopardy Dataset	Link
ArXiv	Link
Yelp Review Dataset	Link
UCI’s Spambase	Link
Enron Dataset	Link

NLP Overview

Introduction

NLP Research Areas

NLP Pipeline

Data Acquisition

Text Cleaning

Text Preprocessing

Feature Engineering

Model Building

Model Evaluation

Model Deployment

Some Famous NLP Datasets

Did you find this article valuable?