Computer Vision, Simplified!

Computer Vision (CV) is defined as a field of study that seeks to develop tools and techniques to help computers “see” and understand, classify and categorize ( recognize could be the right word here) the content of digital images such as camera feeds, media files such as photographs or videos either in real time.

As an established technology Computer Vision is all about pattern recognition, by way of training a computer model to understand visual data (Supervised Learning, a subset of Machine Learning) via feeding the image dataset ( i.e. lots of images thousands, millions if possible that is labeled, often called training dataset) and then subject this dataset to various AI/ML techniques, or algorithms, that allow the system to hunt down patterns in all the elements that relate to those labels and establish a context from the new images it ‘sees’ from the real time inputs from camera feeds or media files. The contextual reference here means recognition of object, feature and activity thereof.

While the Internet has text and images it was simple for anyone to search the textual content but was difficult to search the images unless they were labeled by the user who created them and shared on the internet. The Computer vision if applied to these images enables the users to see inside the image and identify and recognize objects inside those images such as animals, objects ( non-living things), colors and even the texts embedded in the image ( OCRs, Optical character recognition) and help create a logical connection / context wrt content ( what is shot/clicked) and metadata ( where and when it is created) etc.

This is where computer vision separates itself from legacy image processing techniques. Image processing is just limited to digital signal processing for enhancing or transforming the content and create new images but does not understand the content or the context thereof since it does not have ability to recognize patterns via artificial intelligence that deploys machine learning technology. Computer vision however might use image processing as one of the methods to manipulate the data sets as and when use case demands.

To decouple the complexities in the computer vision, there are multiple activities / tasks associated / integrated as a part of the workflow these are…

1. Object Classification: What is the broad category of objects in the image / frame?

2. Object Identification: Which type of a given object is it in the image / frame?

3. Object Verification: Is the object present in the given image / frame?

4. Object Detection: Where are the objects in the image / frame?

5. Object Activity: What is the activity of object/s happening within the image / frame?

6. Object Landmark Detection: What are the key points for the object in the image / frame?

7. Object Segmentation: What pixels belong to the object in the image / frame?

8. Object Recognition: Which objects are in this photograph and where are they?

How does Computer Vision work?

  1. Image acquisition — a stage wherein the system will acquire images from various sources and devices
  2. Pre-processing — i.e. Digitization will deal with any analog content encoding to digital, so that further processing will be achieved. I have detailed out the concept for video processing here. The pre-processing for images will help smoothening, apply digital, linear and Gaussian filters etc, this stage prepares the content (noise reduction, contrasts, image structure enhancements etc.) as pre-requisites for establishing contextual references in subsequent stages.
  3. Feature extraction — a crucial stage since it deals with first stage of the content discovery such as lines, ridges and edge detection (outlines for an object in 2D and 3D etc), shapes, textures and colors thereof.
  4. Segmentation & detection — mark regions of interest in the image, find objects inside specified regions, detect multiple regions and multiple objects, group objects, spatial-taxon scene hierarchy, correlations between objects, multiple feeds / frames into a series of per-frame foreground masks, maintain temporal semantics for continuity etc.
  5. High level processing — is where next level of processing is applied, for acquired images / frames for model-based and application-specific assumptions, estimating object direction, size and postures within each frame / image and detect objects and co-relate different views of the objects (image registration) etc
  6. Decision point — the final stage where decisions are made based on the criteria applied by the compute vision applications on pass/fail parameters, selection and identification (flag / fail) and inspection of interested / identified objects, recognition of the objects based on checksums, tags and classifications thereof. The results are returned, marked and sent for summarization.

The specific implementation of a computer vision ecosystem also depends on whether its functionality is pre-specified, unique or runs dynamically since if some of the modules it can be learned or modified dynamically during runtime. Typically for deep learning and neural networks run the criteria and above stages in the hidden layers to arrive at results based on the flags of availability of object or stage at which it is identified and cascaded for further evaluation or discovery.

Behind the Curtains…

Deep learning allows computational models of multiple processing layers to learn and represent data with multiple levels of abstraction mimicking how the brain perceives and understands multimodal information, thus implicitly capturing intricate structures of large‐scale data. Deep learning is a rich family of methods, encompassing neural networks, hierarchical probabilistic models, and a variety of unsupervised and supervised feature learning algorithms. The recent surge of interest in deep learning methods is due to the fact that they have been shown to outperform previous state-of-the-art techniques in several tasks, as well as the abundance of complex data from different sources (e.g., visual, audio, medical, social, and sensor).

A Convolutional neural network, also called as ConvNet/CNN is actually an algorithm which can read a single frame or image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. Since architecture of a ConvNet is analogous to that of the Neuron connections in our Brain and was inspired by the organization of the Visual Cortex. As part of our vision cortex, each Individual neuron respond to stimuli only in an identified region of the visual field known as the Receptive Field. Large segmentation of such fields then overlaps to cover the entire visual area and our brain processes the Spatial and Temporal dependencies in the receptive fields as a single frame abstracting the leaning from what we see and identify or recognize the objects as a unified process.


The ConvNet is designed on similar principle adding multiple layers of application so relative filters to extract the high-level features of objects such as edges, shapes, outlines, orientation etc from the input frame, thus deploying multiple convolutional Layers. With each added layer, the architecture deploys high-level feature maps and flattening techniques gaining the wholesome understanding of images in the dataset, similar to how we, humans would.

Few usecases of Computer Vision …

Surveillance and Safety


The facial recognition technology however has seem much more adoption from retail stores, building access management systems, mobile devices to airport immigrations to identify and recognize the known and unknown facial data and locate/alert POI ( Read, person of Interest) usecases helping locate notorious anti-social elements and establish real time monitoring for security agencies and bodies. The facial recognition also positively helps validate identities at ATMs, identify missing persons, track school attendance, track frequent customers to stores or friends and family members visiting homes on their arrival etc. Millions of Mobile users now literally use face recognition technology in the palms of their hands, protecting their data and personal information on their devices due to secure face authentication algorithms in built into the devices.


The application of computer vision is becoming reality for the future of automotive industry propelling the Industry 4.0 agenda of smart factories, the automotive industry is increasingly relying on collaborative robots and autonomous forklifts that are now becoming integral part of production process via quick and automated analysis of 3D-based machine vision methods for detecting direction of robot and human movements for seamless, flexible and secure interactions. Further generate accurate and diverse annotations on the datasets to train, validate, and test algorithms related to autonomous vehicles and self-driving cars via semantic segmentation, object and motion detection etc will take the computer vision deployments on the field and expand the canvas of computer vision in the automotive sector.



Computer Vision is becoming an integral part of agriculture for farming aids, crop coverage, yield mapping, yield estimation, disease detection, and harvesting using multitemporal remote sensing imagery processing. It spans beyond outdoor ground conditions to weather conditions, food processing till granular quality assessment of the growth of individual plants/fruits to visible infestations or weed detection etc.

The effective usage of computer vision in farming helps prevention and control of crop diseases, insects and weeds are the key steps in producing high-quality and pollution-free agricultural products and achieving high yields.


AI based Diagnostics

Applying these algorithms in evaluating an image detects the most subtle cancerous or precancerous pattern within seconds, offering a great supplementary resource for doctors. In surgical interventions for instance, a specialized image-processing model can calibrate, orient, and navigate input images to improve visualization and guide surgical movements during orthopedic procedures. These vision systems can also process, correct, and calibrate the images of the operating room, the patient’s body, and the surgical tools to create a magnified 3D image of all three components and overlays these three images / layers into a single view that allows the robot to track its position and the positions of the surgical tools such that it makes accurate movements adding the next generation robotic surgeries for precision and real time cognitive context.


Shelf Analytics via Computer Vision

Kiosk that uses computer vision, 3D reconstruction, and deep learning to scan several items at the same time without the need of barcodes, people counting, facial recognition and sentiment analytics. Computer vision can increase sales is by digitizing the shelf using CCTV footage and computer vision algorithms can transpose the retail store into a digital version of the merchandising efforts providing accurate reports of the person’s interactions with their products, including how many times it gets picked up and put back determining how much space competitors take on the shelves for effectiveness of the merchandise displays and sales enablement.

The list can go on and on since inclusion of computer vision in our day to day life has created enormous value proposition for the beneficiaries.

In Summary, Although Computer vision has the potential to change business workflows and outcomes the technology still has hard times tackling small visual variations i.e. classifying something like an ostrich with just a little bit of added noise in the original image, better and varied test data sets, tuned algorithms and cognitive layers of criteria’s etc. Each element of the domian needs to be improved further. Studying biological vision requires an understanding of the perception organs like the eyes, as well as the interpretation of the perception within the brain. Much progress has been made, both in charting the process and in terms of discovering the tricks and shortcuts used by the system, although like any study that involves the brain, there is a long way to go. As technology improves gradually, though, it will penetrate more domains and help professionals make informed / aided decisions faster and safer while giving the customers and users more efficient, powerful experiences. What Say?


Mar 2020. Compilation from various publicly available internet sources, authors views are personal.

Suggested reading / references

Originally published at