NLP Overview

NLP Overview

Introduction

NLP is short for Natural Language Processing, the research area I'm currently working on in PhD.

We humans speak in natural languages like Hindi, English, French, Tamil, etc. Humans can write, speak as well as understand these languages quite easily. But same cannot be said for computers. It is a hard task for computer to understand and generate natural language.

NLP Research Areas

The whole research area of NLP is divided into following major categories:

NLP Pipeline

NLP Pipeline is sequence of steps followed to perform any natural language processing task

  1. Data Acquisition

We need data (text) to do any NLP task. Data can be acquired from various sources. Some of these sources are:

  1. Web scraping

  2. Publicly available datasets (from sites like Kaggle, PennTreebank corpus, etc.)

  3. Hackathons Datasets

If the acquired data is not sufficient then we can use technique like data augmentation to generate more data.

Sometimes data can be of multiple types like text, image, video and other format combined. You need to get the textual data from all these data.

  1. Text Cleaning

We need to clean the data in the appropriate form to be able to process it further. Unclean data may contain spelling mistakes, special characters, HTML tags, etc.

Some common text cleaning steps are

  • Unicode Normalization: Remove or convert the machine unreadable data like symbols, graphic characters, special characters and emojis to machine readable format.

  • Spelling correction

  • Removing extra whitespace

  • Handling emojis and emoticons

  • Removing HTML tags

  • Removing URLs

  • Removing non-desired language words

  • Removing redundant text

  • Handling accented characters

  • Removing digits

  • Fixing broken sentences

  • Expanding contractions

  • Removing punctuations

  • Lowercase conversion

  1. Text Preprocessing

Text Pre-processing include following:

  1. Tokenization:

    • Word Tokenization: Splitting text into individual words.

    • Sentence Tokenization: Splitting text into individual sentences.

  2. Removing Stop Words: Eliminating common words that do not contribute significant meaning (e.g., "and," "the," "is").

  3. Stemming/Lemmatization:

    • Stemming: Reducing words to their root form (e.g., "running" to "run").

    • Lemmatization: Reducing words to their base or dictionary form (e.g., "running" to "run"), considering the context.

  4. Part Of Speech (POS) Tagging: It involves labeling each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, adverb, etc. This helps in understanding the grammatical structure of a sentence.

  5. Named Entity Recognition (NER): It involves identifying and classifying named entities in text into predefined categories such as the names of persons, organizations, locations, dates, quantities, and more.

There could be more steps as per the datasets.

  1. Feature Engineering

Feature engineering is the process of transforming raw data into meaningful features that enhance the performance of machine learning models. It involves selecting, creating, and optimizing features to better capture the underlying patterns and relationships in the data. Sometimes, it might also involve dimension reduction.

  1. Feature Extraction: Text Vectorization techniques to convert text into numerical representation of vectors.

    • Bag of Words (BoW): Converting text into a fixed-length vector by counting word occurrences.

    • TF-IDF (Term Frequency-Inverse Document Frequency): Weighing terms based on their frequency in a document relative to their frequency in the corpus.

    • Word Embeddings: Representing words as dense vectors (e.g., Word2Vec, GloVe).

    • Contextual Embeddings: Using models like BERT and GPT that provide contextual word representations.

    • One-Hot Encoding: Representing words as binary vectors.

  2. Feature Selection

    1. Filter Method: Use statistical tests to select the most relevant features.

    2. Wrapper Method: Using ML algorithms like Recursive Feature Elimination (RFE) to select features.

    3. Embedded Method: Using algorithms that perform feature selection during model training (e.g., Lasso regression)

  3. Feature Optimization

    1. Particle Swarm Optimization (PSO)

    2. Genetic Algorithms (GA)

    3. Firefly Algorithm (FA)

    4. Cuckoo Search (CS)

    5. Grey Wolf Optimizer (GWO)

    6. Whale Optimization Algorithm (WOA)

  4. Dimension Reduction

    1. Principal Component Analysis (PCA): Transforms features into a lower-dimensional space while retaining most of the variance.

    2. t-SNE (t-Distributed Stochastic Neighbor Embedding): Reduces dimensionality for visualization, capturing local structure in high-dimensional data.

    3. UMAP (Uniform Manifold Approximation and Projection): Another dimensionality reduction technique that preserves more of the global structure.

  5. Model Building

Train machine learning or deep learning models on the processed data.

  • Choose a Model: Select algorithms or architectures suited to the NLP task (e.g., classification, named entity recognition).

  • Train the Model: Fit the model on the training data and adjust parameters.

  1. Model Evaluation

Test the model and assess the model’s performance using metrics such as

  • Accuracy

  • Precision

  • Recall

  • F1 Score

  • Area Under Curve (AUC)

Use the trained model to make predictions or extract insights from new, unseen text data

  1. Model Deployment

Deploying model in platform like AWS, GCP or others.

Some Famous NLP Datasets

Dataset NameLink
Penn Treebank CorpusLink
NLTK CorpusLink
IMDB Movie datasetLink
Sentiment 140Link
Stanford Sentiment TreeBankLink
Jeopardy DatasetLink
ArXivLink
Yelp Review DatasetLink
UCI’s SpambaseLink
Enron DatasetLink

Did you find this article valuable?

Support JYOTI MAURYA TECH BLOG by becoming a sponsor. Any amount is appreciated!