Computer Vision, Simplified!

11 min readDec 11, 2020

Computer Vision (CV) is defined as a field of study that seeks to develop tools and techniques to help computers “see” and understand, classify and categorize ( recognize could be the right word here) the content of digital images such as camera feeds, media files such as photographs or videos either in real time.

As an established technology Computer Vision is all about pattern recognition, by way of training a computer model to understand visual data (Supervised Learning, a subset of Machine Learning) via feeding the image dataset ( i.e. lots of images thousands, millions if possible that is labeled, often called training dataset) and then subject this dataset to various AI/ML techniques, or algorithms, that allow the system to hunt down patterns in all the elements that relate to those labels and establish a context from the new images it ‘sees’ from the real time inputs from camera feeds or media files. The contextual reference here means recognition of object, feature and activity thereof.

While the Internet has text and images it was simple for anyone to search the textual content but was difficult to search the images unless they were labeled by the user who created them and shared on the internet. The Computer vision if applied to these images enables the users to see inside the image and identify and recognize objects inside those images such as animals, objects ( non-living things), colors and even the texts embedded in the image ( OCRs, Optical character recognition) and help create a logical connection / context wrt content ( what is shot/clicked) and metadata ( where and when it is created) etc.

This is where computer vision separates itself from legacy image processing techniques. Image processing is just limited to digital signal processing for enhancing or transforming the content and create new images but does not understand the content or the context thereof since it does not have ability to recognize patterns via artificial intelligence that deploys machine learning technology. Computer vision however might use image processing as one of the methods to manipulate the data sets as and when use case demands.

To decouple the complexities in the computer vision, there are multiple activities / tasks associated / integrated as a part of the workflow these are…

1. Object Classification: What is the broad category of objects in the image / frame?

2. Object Identification: Which type of a given object is it in the image / frame?

3. Object Verification: Is the object present in the given image / frame?

4. Object Detection: Where are the objects in the image / frame?

5. Object Activity: What is the activity of object/s happening within the image / frame?

6. Object Landmark Detection: What are the key points for the object in the image / frame?

7. Object Segmentation: What pixels belong to the object in the image / frame?

8. Object Recognition: Which objects are in this photograph and where are they?

How does Computer Vision work?

Computer vision to achieve contextual understanding of the content, multiple tools and techniques are deployed into specific operational stages, associated tasks to deliver the outcome, broadly categorizes as ..

Image acquisition — a stage wherein the system will acquire images from various sources and devices
Pre-processing — i.e. Digitization will deal with any analog content encoding to digital, so that further processing will be achieved. I have detailed out the concept for video processing here. The pre-processing for images will help smoothening, apply digital, linear and Gaussian filters etc, this stage prepares the content (noise reduction, contrasts, image structure enhancements etc.) as pre-requisites for establishing contextual references in subsequent stages.
Feature extraction — a crucial stage since it deals with first stage of the content discovery such as lines, ridges and edge detection (outlines for an object in 2D and 3D etc), shapes, textures and colors thereof.
Segmentation & detection — mark regions of interest in the image, find objects inside specified regions, detect multiple regions and multiple objects, group objects, spatial-taxon scene hierarchy, correlations between objects, multiple feeds / frames into a series of per-frame foreground masks, maintain temporal semantics for continuity etc.
High level processing — is where next level of processing is applied, for acquired images / frames for model-based and application-specific assumptions, estimating object direction, size and postures within each frame / image and detect objects and co-relate different views of the objects (image registration) etc
Decision point — the final stage where decisions are made based on the criteria applied by the compute vision applications on pass/fail parameters, selection and identification (flag / fail) and inspection of interested / identified objects, recognition of the objects based on checksums, tags and classifications thereof. The results are returned, marked and sent for summarization.

The specific implementation of a computer vision ecosystem also depends on whether its functionality is pre-specified, unique or runs dynamically since if some of the modules it can be learned or modified dynamically during runtime. Typically for deep learning and neural networks run the criteria and above stages in the hidden layers to arrive at results based on the flags of availability of object or stage at which it is identified and cascaded for further evaluation or discovery.

Behind the Curtains…

Computer vision is driven by multiple technologies, recent developments in machine learning has become preferred choice for runtime iterations and multiple layers of criteria’s that needs to be applied on real time images / feeds. The key technologies under the AI/ML domain hover around deep learning, that provided a fundamentally different approach for machine learning. As a subset of machine learning; Deep learning relies on neural networks, a general-purpose function that can solve any problem, be able to extract common patterns between those examples and transform it into a mathematical equation that will help classify future pieces of information.

Deep learning allows computational models of multiple processing layers to learn and represent data with multiple levels of abstraction mimicking how the brain perceives and understands multimodal information, thus implicitly capturing intricate structures of large‐scale data. Deep learning is a rich family of methods, encompassing neural networks, hierarchical probabilistic models, and a variety of unsupervised and supervised feature learning algorithms. The recent surge of interest in deep learning methods is due to the fact that they have been shown to outperform previous state-of-the-art techniques in several tasks, as well as the abundance of complex data from different sources (e.g., visual, audio, medical, social, and sensor).

A Convolutional neural network, also called as ConvNet/CNN is actually an algorithm which can read a single frame or image, assign importance (learnable weights and biases) to various aspects/objects in the image and be able to differentiate one from the other. Since architecture of a ConvNet is analogous to that of the Neuron connections in our Brain and was inspired by the organization of the Visual Cortex. As part of our vision cortex, each Individual neuron respond to stimuli only in an identified region of the visual field known as the Receptive Field. Large segmentation of such fields then overlaps to cover the entire visual area and our brain processes the Spatial and Temporal dependencies in the receptive fields as a single frame abstracting the leaning from what we see and identify or recognize the objects as a unified process.

The ConvNet is designed on similar principle adding multiple layers of application so relative filters to extract the high-level features of objects such as edges, shapes, outlines, orientation etc from the input frame, thus deploying multiple convolutional Layers. With each added layer, the architecture deploys high-level feature maps and flattening techniques gaining the wholesome understanding of images in the dataset, similar to how we, humans would.

Few usecases of Computer Vision …

Computer vision has made remarkable progress in multiple usecases across verticals, using this technologies there are different applications (Read, usecases) that we can witness and shown tremendous advantages / leverages to us, although there are many examples and domains which leverage computer vision today under the umbrella of artificial intelligence, let us look at selective few for the time being..

Surveillance and Safety

Public safety is the main agenda for any government, enterprises and utility / service providers managing large crowded public places. While surveillance refers to the processes of focusing systematic and routine attention on certain human behaviours for influencing, managing, protecting or directing purposes. With Computer vision systems for public surveillance have developed significantly in terms of object detection, tracking, classification and behavior analysis, further optical character recognition for ANPR ( Read, Automated Number Plate Recognition) and facial recognition algorithms are able to detect, recognize a particular person, vehicle thus improving accuracy and reliability to monitor real-time situations, vehicles / persons involved in incidents or even traffic monitoring for vehicular count and flow and facilitating greater mobility management gather evidences for law enforcement officials of traffic violations / crowd behavior etc.

The facial recognition technology however has seem much more adoption from retail stores, building access management systems, mobile devices to airport immigrations to identify and recognize the known and unknown facial data and locate/alert POI ( Read, person of Interest) usecases helping locate notorious anti-social elements and establish real time monitoring for security agencies and bodies. The facial recognition also positively helps validate identities at ATMs, identify missing persons, track school attendance, track frequent customers to stores or friends and family members visiting homes on their arrival etc. Millions of Mobile users now literally use face recognition technology in the palms of their hands, protecting their data and personal information on their devices due to secure face authentication algorithms in built into the devices.

Automotive

From identifying manufacturing defects such as surface validation, defect tracking, visual quality assurance of the component in large scale production since digital image processing is the basis for the seamless monitoring and control of industrial production processes. Use of virtual reality layers top of computer vision (erstwhile refer as machine vision too) for shop floor / assembly line trainings, in built cameras for road semantics for self-driving / driver assist features of vehicles and simulations thereof for road safety etc.

The application of computer vision is becoming reality for the future of automotive industry propelling the Industry 4.0 agenda of smart factories, the automotive industry is increasingly relying on collaborative robots and autonomous forklifts that are now becoming integral part of production process via quick and automated analysis of 3D-based machine vision methods for detecting direction of robot and human movements for seamless, flexible and secure interactions. Further generate accurate and diverse annotations on the datasets to train, validate, and test algorithms related to autonomous vehicles and self-driving cars via semantic segmentation, object and motion detection etc will take the computer vision deployments on the field and expand the canvas of computer vision in the automotive sector.

Agriculture

Computer Vision is becoming an integral part of agriculture for farming aids, crop coverage, yield mapping, yield estimation, disease detection, and harvesting using multitemporal remote sensing imagery processing. It spans beyond outdoor ground conditions to weather conditions, food processing till granular quality assessment of the growth of individual plants/fruits to visible infestations or weed detection etc.

The effective usage of computer vision in farming helps prevention and control of crop diseases, insects and weeds are the key steps in producing high-quality and pollution-free agricultural products and achieving high yields.

Healthcare

Computer vision is expanding the frontiers of healthcare, augmenting diagnostic and treatment tools, and helping healthcare professionals prognosticate diseases more effectively. This will undoubtedly improve patient care outcomes and reduce avoidable delays in the patient care continuum. Using machine learning technologies and algorithms scientists are collaborating with doctors to teach computers to recognize tumors / abnormalities in retina scans, prognosis of skin or breast cancers via images to detect and respond to the diseases more precisely. Just as an example for early diagnosis, computer vision can be applied in breast cancer screening with algorithms that are trained in recognizing and classifying cancerous changes from millions of mammogram images showing both healthy and diagnosed samples.

Applying these algorithms in evaluating an image detects the most subtle cancerous or precancerous pattern within seconds, offering a great supplementary resource for doctors. In surgical interventions for instance, a specialized image-processing model can calibrate, orient, and navigate input images to improve visualization and guide surgical movements during orthopedic procedures. These vision systems can also process, correct, and calibrate the images of the operating room, the patient’s body, and the surgical tools to create a magnified 3D image of all three components and overlays these three images / layers into a single view that allows the robot to track its position and the positions of the surgical tools such that it makes accurate movements adding the next generation robotic surgeries for precision and real time cognitive context.

Retail

Kiosk that uses computer vision, 3D reconstruction, and deep learning to scan several items at the same time without the need of barcodes, people counting, facial recognition and sentiment analytics. Computer vision can increase sales is by digitizing the shelf using CCTV footage and computer vision algorithms can transpose the retail store into a digital version of the merchandising efforts providing accurate reports of the person’s interactions with their products, including how many times it gets picked up and put back determining how much space competitors take on the shelves for effectiveness of the merchandise displays and sales enablement.

The list can go on and on since inclusion of computer vision in our day to day life has created enormous value proposition for the beneficiaries.

In Summary, Although Computer vision has the potential to change business workflows and outcomes the technology still has hard times tackling small visual variations i.e. classifying something like an ostrich with just a little bit of added noise in the original image, better and varied test data sets, tuned algorithms and cognitive layers of criteria’s etc. Each element of the domian needs to be improved further. Studying biological vision requires an understanding of the perception organs like the eyes, as well as the interpretation of the perception within the brain. Much progress has been made, both in charting the process and in terms of discovering the tricks and shortcuts used by the system, although like any study that involves the brain, there is a long way to go. As technology improves gradually, though, it will penetrate more domains and help professionals make informed / aided decisions faster and safer while giving the customers and users more efficient, powerful experiences. What Say?

***

Mar 2020. Compilation from various publicly available internet sources, authors views are personal.

Originally published at https://www.linkedin.com.