Computer vision, a field of artificial intelligence, involves the development of algorithms and models that allow machines to understand and process images and videos similarly to how humans do. This can include tasks like object detection, image classification, image segmentation, and facial recognition, etc.

Pre-requisite: Basic Machine Learning concepts

Computer Vision Pipeline

Step 1: Image/Video Acquisition

This is the first step in the pipeline, where images or videos are captured using cameras or other sensors. The quality of the images is crucial for the accuracy of the subsequent steps.

You can also choose to use the pre-existing datasets. Some of the most common datasets for computer vision tasks are given in the last section.

Step 2: Preprocessing

In this stage, the raw image/video data is prepared for further analysis. This may involve tasks such as:

Noise reduction: Removing unwanted noise or artifacts from the images.
Geometric transformations: Correcting distortions or perspective shifts.
Normalization: Ensuring that all images have a consistent format and scale.

The preprocessing steps are done so that the images easier to work with or analyze.

Step 3: Feature Extraction

This stage involves identifying and extracting relevant features from the image that can be used for further analysis or classification. Some common features include:

Edges: Boundaries between regions with different intensities.
Corners: Points where edges intersect.
Textures: Patterns or structures within an image.
Descriptors: Numerical representations of image features

Step 4: Prediction/Recognition Model

Based on the task you are trying to solve you need to make the Machine Learning model accordingly.

For object detection, this involves identifying the location and boundaries of objects within an image. This can be achieved using techniques such as:

Sliding window: A window is moved across the image, and a classifier is applied at each location to determine if an object is present.
Region-based methods: Regions of interest are identified and classified.
Deep learning-based methods: Convolutional neural networks (CNNs) are trained to directly detect objects in images.

For Object Recognition and classification, once objects have been detected, they can be recognized and classified. This involves assigning a label or category to each object. This can be done using:

Traditional machine learning algorithms: Such as support vector machines (SVMs) or random forests.
Deep learning models: Such as CNNs or recurrent neural networks (RNNs).

There could be some post processing needed to refine the results and preparing them for further use. This may involve tasks such as:

Non-maximum suppression: Removing duplicate detections.
Tracking: Following objects over time in videos.
Visualization: Displaying the results in a human-understandable format.

Step 5: Model Evaluation

Evaluating a computer vision model is crucial to ensure its performance and reliability in real-world applications.

Accuracy: The proportion of correct predictions.
Precision: The proportion of true positive predictions among all actual positive instances.
F1-score: The harmonic mean of precision and recall.
Mean Average Precision (mAP): A common metric for object detection tasks.
Intersection over Union (IoU): Measures the overlap between predicted and ground truth bounding boxes.

Confusion Matrix: A table that summarizes the performance of a classification model.
ROC Curve: A graphical plot that illustrates the performance of a binary classifier.
Precision-Recall Curve: A graphical plot that shows the trade-off between precision and recall.

Step 6: Model Deployment

Once a model has been evaluated and deemed satisfactory, it can be deployed into a production environment.

Model can be deployed on cloud platforms like AWS, GCP, Azure, and other cloud providers which offers a range of services for deploying machine learning models.

Computer Vision Research Areas

Research Areas based on Task

Image Processing

Image enhancement (noise reduction, sharpening, contrast adjustment)
Image restoration (deblurring, super-resolution)
Image segmentation (object/region identification)
Image registration (aligning images)

Object Detection and Recognition

Object detection (locating and classifying objects in images)
Object tracking (following objects in videos)
Facial recognition (identifying and verifying individuals)
Action recognition (understanding human or animal actions)
Gesture recognition (interpreting hand and body movements)

Image and Video Understanding

Image captioning (generating text descriptions for images)
Visual question answering (answering questions about images)
Video summarization (creating concise representations of videos)
Video action recognition (understanding actions in videos)

Image and Video Generation

Image generation (creating new images)
Image editing (modifying existing images)
Video generation (creating new videos)
Style transfer (transferring style from one image to another)

Research Areas based on Application

Medical Image Analysis

Medical image segmentation
Disease detection (cancer, Alzheimer's, etc.)
Medical image registration
Computer-aided diagnosis

Autonomous Vehicles

Object detection and tracking
Lane detection
Pedestrian detection
Traffic sign recognition

Robotics

Visual servoing
Object manipulation
Simultaneous Localization and Mapping (SLAM)
Human-robot interaction

Surveillance

Person re-identification
Anomaly detection
Crowd analysis

Augmented Reality

Object tracking and recognition
3D reconstruction
Image registration

Research Areas based on Techniques

Deep Learning

Convolutional Neural Networks (CNNs)
Recurrent Neural Networks (RNNs)
Generative Adversarial Networks (GANs)
Transformers
Self-supervised learning
Reinforcement learning

Traditional Computer Vision

Feature extraction (SIFT, SURF, HOG)
Image matching
Stereo vision
Structure from motion

Emerging Research Areas

Explainable AI
Adversarial Attacks and Defenses
Few-shot Learning
Unsupervised and self-supervised learning
Video Understanding
3D Vision and Augmented Reality
Large Scale Multi-modal ML Models

Some Famous CV Datasets

Category	Dataset	Details	Link
Image Classification	CIFAR-10/100	Contains 32x32 color images in 10 (CIFAR-10) or 100 (CIFAR-100) categories.	CIFAR-10 and CIFAR-100 datasets (toronto.edu )
	ImageNet	A large-scale dataset with over 14 million images across 20,000 categories.	[ImageNet Object Localization Challenge	Kaggle](https://www.kaggle.com/c/imagenet-object-localization-challenge/overview/description)
	MNIST	A handwritten digits dataset commonly used for testing machine learning algorithms.	MNIST Dataset (kaggle.com )
Object Detection	PASCAL VOC	A dataset with annotated images for object detection, segmentation, and person layout	PASCAL VOC 2012 DATASET (kaggle.com )
	COCO (Common Objects in Context)	A large-scale dataset with object instances, person keypoints, stuff segmentation, and captions.	COCO - Common Objects in Context (cocodataset.org )
	Open Images	A dataset with over 9 million images and bounding box annotations for object detection	Open Images Dataset V7 (storage.googleapis.com )
Semantic Segmentation	Cityscapes	A dataset for urban scene understanding with pixel-level semantic labels.	Cityscapes Dataset – Semantic Understanding of Urban Street Scenes (cityscapes-dataset.com )
	ADE20K	A large-scale dataset with dense pixel-level annotations for a variety of scenes.	ADE20K dataset (mit.edu )
	CamVid	A dataset with video sequences annotated with pixel-level semantic labels.	CamVid (Cambridge-Driving Labeled Video Database) (kaggle.com )
Facial Recognition	Labeled Faces in the Wild (LFW)	A dataset of unconstrained face images used for face verification and recognition.	Labelled Faces in the Wild (LFW) Dataset (kaggle.com )
	MegaFace	A large-scale face recognition benchmark with over 670K images and 690K identities	MegaFace (washington.edu )
	MS-Celeb-1M	A large-scale dataset of celebrity faces with over 100M images.
Action Recognition	UCF101	A dataset with 101 human action categories and over 13K videos	UCF101 - Action Recognition (kaggle.com )
	HMDB-51	A dataset with 51 human actions and over 7K videos	HMDB51 (kaggle.com )
	Kinetics-400	A large-scale dataset with 400 human action categories and over 240K videos.	kinetics Dataset (kaggle.com )
Medical Imaging	ImageNet-Medical	A dataset of medical images for classification and detection tasks.
	Montgomery County Breast Images	A dataset of mammograms for breast cancer detection.
	BraTS (Brain Tumor Segmentation Challenge)	dataset for brain tumor segmentation.
Remote Sensing	UC Merced Land Use	A dataset of aerial images for land use classification.
	Sentinel-2	A satellite imagery dataset with high spatial and spectral resolution
	Aerial Imagery Dataset (AID)	A dataset of aerial images for object detection and scene classification.
Other Categories	Flickr30K	Sentence based image description	Flickr Image dataset (kaggle.com )

	WikiArt	Collection of WikiArt images	WikiArt (kaggle.com )
	KITTI	Object detection dataset	kitti dataset (kaggle.com )
	Waymo Open Dataset		Download – Waymo Open Dataset

After words

Check out more articles in the Computer Vision Series

Computer Vision: An Overview

Computer Vision Pipeline