Introduction
NLP is short for Natural Language Processing, the research area I'm currently working on in PhD.
We humans speak in natural languages like Hindi, English, French, Tamil, etc. Humans can write, speak as well as understand these languages quite easily. But same cannot be said for computers. It is a hard task for computer to understand and generate natural language.
NLP Research Areas
The whole research area of NLP is divided into following major categories:
NLP Pipeline
NLP Pipeline is sequence of steps followed to perform any natural language processing task
Data Acquisition
We need data (text) to do any NLP task. Data can be acquired from various sources. Some of these sources are:
Web scraping
Publicly available datasets (from sites like Kaggle, PennTreebank corpus, etc.)
Hackathons Datasets
If the acquired data is not sufficient then we can use technique like data augmentation to generate more data.
Sometimes data can be of multiple types like text, image, video and other format combined. You need to get the textual data from all these data.
Text Cleaning
We need to clean the data in the appropriate form to be able to process it further. Unclean data may contain spelling mistakes, special characters, HTML tags, etc.
Some common text cleaning steps are
Unicode Normalization: Remove or convert the machine unreadable data like symbols, graphic characters, special characters and emojis to machine readable format.
Spelling correction
Removing extra whitespace
Handling emojis and emoticons
Removing HTML tags
Removing URLs
Removing non-desired language words
Removing redundant text
Handling accented characters
Removing digits
Fixing broken sentences
Expanding contractions
Removing punctuations
Lowercase conversion
Text Preprocessing
Text Pre-processing include following:
Tokenization:
Word Tokenization: Splitting text into individual words.
Sentence Tokenization: Splitting text into individual sentences.
Removing Stop Words: Eliminating common words that do not contribute significant meaning (e.g., "and," "the," "is").
Stemming/Lemmatization:
Stemming: Reducing words to their root form (e.g., "running" to "run").
Lemmatization: Reducing words to their base or dictionary form (e.g., "running" to "run"), considering the context.
Part Of Speech (POS) Tagging: It involves labeling each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, adverb, etc. This helps in understanding the grammatical structure of a sentence.
Named Entity Recognition (NER): It involves identifying and classifying named entities in text into predefined categories such as the names of persons, organizations, locations, dates, quantities, and more.
There could be more steps as per the datasets.
Feature Engineering
Feature engineering is the process of transforming raw data into meaningful features that enhance the performance of machine learning models. It involves selecting, creating, and optimizing features to better capture the underlying patterns and relationships in the data. Sometimes, it might also involve dimension reduction.
Feature Extraction: Text Vectorization techniques to convert text into numerical representation of vectors.
Bag of Words (BoW): Converting text into a fixed-length vector by counting word occurrences.
TF-IDF (Term Frequency-Inverse Document Frequency): Weighing terms based on their frequency in a document relative to their frequency in the corpus.
Word Embeddings: Representing words as dense vectors (e.g., Word2Vec, GloVe).
Contextual Embeddings: Using models like BERT and GPT that provide contextual word representations.
One-Hot Encoding: Representing words as binary vectors.
Feature Selection
Filter Method: Use statistical tests to select the most relevant features.
Wrapper Method: Using ML algorithms like Recursive Feature Elimination (RFE) to select features.
Embedded Method: Using algorithms that perform feature selection during model training (e.g., Lasso regression)
Feature Optimization
Particle Swarm Optimization (PSO)
Genetic Algorithms (GA)
Firefly Algorithm (FA)
Cuckoo Search (CS)
Grey Wolf Optimizer (GWO)
Whale Optimization Algorithm (WOA)
Dimension Reduction
Principal Component Analysis (PCA): Transforms features into a lower-dimensional space while retaining most of the variance.
t-SNE (t-Distributed Stochastic Neighbor Embedding): Reduces dimensionality for visualization, capturing local structure in high-dimensional data.
UMAP (Uniform Manifold Approximation and Projection): Another dimensionality reduction technique that preserves more of the global structure.
Model Building
Train machine learning or deep learning models on the processed data.
Choose a Model: Select algorithms or architectures suited to the NLP task (e.g., classification, named entity recognition).
Train the Model: Fit the model on the training data and adjust parameters.
Model Evaluation
Test the model and assess the model’s performance using metrics such as
Accuracy
Precision
Recall
F1 Score
Area Under Curve (AUC)
Use the trained model to make predictions or extract insights from new, unseen text data
Model Deployment
Deploying model in platform like AWS, GCP or others.