Computer Vision: An Overview

I create cross platform mobile apps with AI functionalities. Currently a PhD Scholar at Indira Gandhi Delhi Technical University for Women, Delhi. M.Tech in Artificial Intelligence (AI).
Computer vision, a field of artificial intelligence, involves the development of algorithms and models that allow machines to understand and process images and videos similarly to how humans do. This can include tasks like object detection, image classification, image segmentation, and facial recognition, etc.
Pre-requisite: Basic Machine Learning concepts
Computer Vision Pipeline

Step 1: Image/Video Acquisition
This is the first step in the pipeline, where images or videos are captured using cameras or other sensors. The quality of the images is crucial for the accuracy of the subsequent steps.
You can also choose to use the pre-existing datasets. Some of the most common datasets for computer vision tasks are given in the last section.
Step 2: Preprocessing
In this stage, the raw image/video data is prepared for further analysis. This may involve tasks such as:
Noise reduction: Removing unwanted noise or artifacts from the images.
Geometric transformations: Correcting distortions or perspective shifts.
Normalization: Ensuring that all images have a consistent format and scale.
The preprocessing steps are done so that the images easier to work with or analyze.
Step 3: Feature Extraction
This stage involves identifying and extracting relevant features from the image that can be used for further analysis or classification. Some common features include:
Edges: Boundaries between regions with different intensities.
Corners: Points where edges intersect.
Textures: Patterns or structures within an image.
Descriptors: Numerical representations of image features
Step 4: Prediction/Recognition Model
Based on the task you are trying to solve you need to make the Machine Learning model accordingly.
For object detection, this involves identifying the location and boundaries of objects within an image. This can be achieved using techniques such as:
Sliding window: A window is moved across the image, and a classifier is applied at each location to determine if an object is present.
Region-based methods: Regions of interest are identified and classified.
Deep learning-based methods: Convolutional neural networks (CNNs) are trained to directly detect objects in images.
For Object Recognition and classification, once objects have been detected, they can be recognized and classified. This involves assigning a label or category to each object. This can be done using:
Traditional machine learning algorithms: Such as support vector machines (SVMs) or random forests.
Deep learning models: Such as CNNs or recurrent neural networks (RNNs).
There could be some post processing needed to refine the results and preparing them for further use. This may involve tasks such as:
Non-maximum suppression: Removing duplicate detections.
Tracking: Following objects over time in videos.
Visualization: Displaying the results in a human-understandable format.
Step 5: Model Evaluation
Evaluating a computer vision model is crucial to ensure its performance and reliability in real-world applications.
Accuracy: The proportion of correct predictions.
Precision: The proportion of true positive predictions among all actual positive instances.
F1-score: The harmonic mean of precision and recall.
Mean Average Precision (mAP): A common metric for object detection tasks.
Intersection over Union (IoU): Measures the overlap between predicted and ground truth bounding boxes.
Confusion Matrix: A table that summarizes the performance of a classification model.
ROC Curve: A graphical plot that illustrates the performance of a binary classifier.
Precision-Recall Curve: A graphical plot that shows the trade-off between precision and recall.
Step 6: Model Deployment
Once a model has been evaluated and deemed satisfactory, it can be deployed into a production environment.
Model can be deployed on cloud platforms like AWS, GCP, Azure, and other cloud providers which offers a range of services for deploying machine learning models.
Computer Vision Research Areas

Research Areas based on Task
Image Processing
Image enhancement (noise reduction, sharpening, contrast adjustment)
Image restoration (deblurring, super-resolution)
Image segmentation (object/region identification)
Image registration (aligning images)
Object Detection and Recognition
Object detection (locating and classifying objects in images)
Object tracking (following objects in videos)
Facial recognition (identifying and verifying individuals)
Action recognition (understanding human or animal actions)
Gesture recognition (interpreting hand and body movements)
Image and Video Understanding
Image captioning (generating text descriptions for images)
Visual question answering (answering questions about images)
Video summarization (creating concise representations of videos)
Video action recognition (understanding actions in videos)
Image and Video Generation
Image generation (creating new images)
Image editing (modifying existing images)
Video generation (creating new videos)
Style transfer (transferring style from one image to another)
Research Areas based on Application
Medical Image Analysis
Medical image segmentation
Disease detection (cancer, Alzheimer's, etc.)
Medical image registration
Computer-aided diagnosis
Autonomous Vehicles
Object detection and tracking
Lane detection
Pedestrian detection
Traffic sign recognition
Robotics
Visual servoing
Object manipulation
Simultaneous Localization and Mapping (SLAM)
Human-robot interaction
Surveillance
Person re-identification
Anomaly detection
Crowd analysis
Augmented Reality
Object tracking and recognition
3D reconstruction
Image registration
Research Areas based on Techniques
Deep Learning
Convolutional Neural Networks (CNNs)
Recurrent Neural Networks (RNNs)
Generative Adversarial Networks (GANs)
Transformers
Self-supervised learning
Reinforcement learning
Traditional Computer Vision
Feature extraction (SIFT, SURF, HOG)
Image matching
Stereo vision
Structure from motion
Emerging Research Areas
Explainable AI
Adversarial Attacks and Defenses
Few-shot Learning
Unsupervised and self-supervised learning
Video Understanding
3D Vision and Augmented Reality
Large Scale Multi-modal ML Models
Some Famous CV Datasets
| Category | Dataset | Details | Link | ||
| Image Classification | CIFAR-10/100 | Contains 32x32 color images in 10 (CIFAR-10) or 100 (CIFAR-100) categories. | CIFAR-10 and CIFAR-100 datasets (toronto.edu) | ||
| ImageNet | A large-scale dataset with over 14 million images across 20,000 categories. | [ImageNet Object Localization Challenge | Kaggle](https://www.kaggle.com/c/imagenet-object-localization-challenge/overview/description) | ||
| MNIST | A handwritten digits dataset commonly used for testing machine learning algorithms. | MNIST Dataset (kaggle.com) | |||
| Object Detection | PASCAL VOC | A dataset with annotated images for object detection, segmentation, and person layout | PASCAL VOC 2012 DATASET (kaggle.com) | ||
| COCO (Common Objects in Context) | A large-scale dataset with object instances, person keypoints, stuff segmentation, and captions. | COCO - Common Objects in Context (cocodataset.org) | |||
| Open Images | A dataset with over 9 million images and bounding box annotations for object detection | Open Images Dataset V7 (storage.googleapis.com) | |||
| Semantic Segmentation | Cityscapes | A dataset for urban scene understanding with pixel-level semantic labels. | Cityscapes Dataset – Semantic Understanding of Urban Street Scenes (cityscapes-dataset.com) | ||
| ADE20K | A large-scale dataset with dense pixel-level annotations for a variety of scenes. | ADE20K dataset (mit.edu) | |||
| CamVid | A dataset with video sequences annotated with pixel-level semantic labels. | CamVid (Cambridge-Driving Labeled Video Database) (kaggle.com) | |||
| Facial Recognition | Labeled Faces in the Wild (LFW) | A dataset of unconstrained face images used for face verification and recognition. | Labelled Faces in the Wild (LFW) Dataset (kaggle.com) | ||
| MegaFace | A large-scale face recognition benchmark with over 670K images and 690K identities | MegaFace (washington.edu) | |||
| MS-Celeb-1M | A large-scale dataset of celebrity faces with over 100M images. | ||||
| Action Recognition | UCF101 | A dataset with 101 human action categories and over 13K videos | UCF101 - Action Recognition (kaggle.com) | ||
| HMDB-51 | A dataset with 51 human actions and over 7K videos | HMDB51 (kaggle.com) | |||
| Kinetics-400 | A large-scale dataset with 400 human action categories and over 240K videos. | kinetics Dataset (kaggle.com) | |||
| Medical Imaging | ImageNet-Medical | A dataset of medical images for classification and detection tasks. | |||
| Montgomery County Breast Images | A dataset of mammograms for breast cancer detection. | ||||
| BraTS (Brain Tumor Segmentation Challenge) | dataset for brain tumor segmentation. | ||||
| Remote Sensing | UC Merced Land Use | A dataset of aerial images for land use classification. | |||
| Sentinel-2 | A satellite imagery dataset with high spatial and spectral resolution | ||||
| Aerial Imagery Dataset (AID) | A dataset of aerial images for object detection and scene classification. | ||||
| Other Categories | Flickr30K | Sentence based image description | Flickr Image dataset (kaggle.com) | ||
| WikiArt | Collection of WikiArt images | WikiArt (kaggle.com) | |||
| KITTI | Object detection dataset | kitti dataset (kaggle.com) | |||
| Waymo Open Dataset | Download – Waymo Open Dataset |
After words
Check out more articles in the Computer Vision Series




