With the advent of AI, many new algorithms and software were developed in order to complement the field and further advance the models being developed. One of the algorithms to gain unprecedented importance was object detection, not to be confused with image classification, which is an algorithm to take an image and extract objects of interest in it along with its location in the image.
SSD or Single Shot Detection is one such algorithm designed for object detection in the field of computer vision. For the purpose of Object Detection, we also have other models.
An important point to note at this stage is the difference between single-stage detection and two-stage detection. Single-stage detectors ignore any Region of Interest (RoI) and directly dissect the image for object detection. On the other hand, two-stage detectors follow a method wherein the image is broken down into RoIs.
SSD is a single-stage detector, and it is a major development when computer vision was dominated by two-stage detectors such as the R-CNN algorithms.
Another algorithm that shares features with the SSD is the YOLO object detection algorithm. It is also a single-stage object detection algorithm that employs convolutional neural networks to detect objects in a single run. It is highly prized in computer vision due to its speed and accuracy and is employed in various real-life applications.
YOLO and SSD are similar in aspects of their object detection process. Both employ the use of grid boxes on the acquired images while further breaking them down into bounding boxes.
Similarly, the Intersection over Union (IoU) method is used to find the confidence score of each bounding box. One main difference would be that SSD employs convolutional layers instead of neural networks used in the YOLO algorithm.
Functioning of SSD
Coming back to Single Shot Detection, SSD uses a VGG-16 network along with convolutional layers to detect and locate an object in an image. The main job of the VGG-16 is to extract the image. Afterward, the successive convolutional layers are used for the detection of objects and their location in the image.
Broadly, SSD consists of two parts:
- Backbone, of which the VGG-16 is part of
- SSD Head consists of progressively decreasing convolutional layers.
In SSD, the featured image is divided into grids of various resolutions, such as 3✕3. After this, anchor boxes are applied to the grids which contain the image of interest.
SSD applies about 8732 anchor boxes of different aspect ratios to the image to be matched with the ground truth boxes of the input image.
The overlap of the anchor boxes with the ground truth boxes, also known as the confidence, is checked. The anchor boxes with the highest overlap are used to find the presence and location of the object. It is important to note that the matching boxes must at least have an overlap of 50%.
The rest of the boxes are canceled out, and only the top 200 unique boxes are kept, this process being known as non-maximum suppression (nms). Finally, data augmentation is used to help feed in data to the algorithm about varying image sizes and orientations in order to increase its object detection capability.
This process is also known as training of the SSD, wherein a trained SSD is then used for object detection in a single shot.
Why is SSD a Big Deal?
Single Shot Detectors were introduced after the advent of several other algorithms such as CNN. One of the main reasons that SSD became so popular was that it had single-shot capability while being highly accurate.
A measure of accuracy can be deduced from the results of a research paper titled SSD: Single Shot MultiBox Detector, which reported 74% mAP (mean Average Precision) on a frame rate of 59.
As mentioned above, SSD does not employ RoIs in its approach but observes the subject image as a whole, while older models would divide subject images into RoI and then proceed onto object detection.
Initially, in the field of computer vision, object classification was a major function performed by software in an even better capacity than humans. However, when it came to object detection, machines were found lacking.
The first deep neural network algorithm for object detection was overfeat. Although a decent starting point, the sliding window approach of the algorithm created many problems due to the invariability of the windows to accommodate a change in shape
This issue was improved upon by algorithms developed for object detection, such as R-CNN, FAST R-CNN, AND FASTER R-CNN, but these were also found to be slow in their training phase. The training was required to be done in two phases due to the two-stage detection nature of the algorithms, and the network lagged when faced with non-trained objects.
In order to solve the slow function of such algorithms, YOLO and SSD algorithms were developed, which had higher accuracy and speed. Due to its speed and accuracy, SSD has found applications in fields such as autonomous driving, object detection in agriculture, retail shopping, security, and medicine.
Downsides of SSD
Even as we speak of the advantages of the SSD algorithm, we also have to take the disadvantages and issues associated with it into account. It has been observed that SSD does not perform well on smaller objects due to not being able to generate high-level features.
Secondly, the algorithm might be confused when it comes to the detection of objects belonging to the same class. Some of these problems were also shared by other newer models, such as the newer YOLO versions, while improvements were made to the speed and accuracy. The YOLOR algorithm is regarded as the fastest, while the YOLOv3 has better loss function convergence.
On the other hand, RetinaNet was also introduced to improve upon the weaknesses of the YOLO and SSD algorithms. It has a unique focal loss function system where it beats the previous algorithms. Finally, another downside of the SSD algorithm would be that it is not available as open-source, unlike the YOLO algorithm.
To conclude, SSD was one of the main developments, along with YOLO, which ushered in the era of single-stage detectors and was an important blueprint when it came to advanced models.
The speed and accuracy levels of SSD set the bar when it came to object detection algorithms, along with being an advancement in the field of convolutional networks for object detection and improving upon the training time and class learning.
Although algorithms developed in later years, like the YOLOv3 and RetinaNet, have improved upon the SSD features with regards to individual domains of speed and training, SSD is still the decent choice to make when an all-round performance with regards to speed, accuracy, and real-time identification is needed.