Introduction to YOLOv2

Computer Vision, an escalating field of Artificial Intelligence (AI), continues to emerge and gain popularity these days. By definition, “computer vision allows the detection of high-level, meaningful data through visuals (images or videos)”.

From simple face detection via smartphones to complex cancer detections, the applications of computer vision expand by the day. Rapid advancements like these make it hard to filter useful information from the useless.

So, we’ve done the research for you, covering:

  1. What is YOLO?
  2. Two-Stage (multistage) Detectors
  3. One-Stage Detectors
  4. What is YOLOv2?
  5. The architecture of YOLO Models
  6. Dense and Sparse Prediction
  7. Significance of YOLOv2
  8. Applications
  9. Problems with YOLOv2

Let’s dive right in.


What is YOLO?

YOLO (You Only Look Once) is an algorithm that provides real-time object detection with high accuracy and speed by passing visuals merely only once through its network. Such an algorithm is also known as an ‘object detector’.

Object detection is the technique that detects objects via images and videos. Object detectors are used commonly in processes like security and traffic checking to detect activities or to recognize faces. However, its applications are not limited to just that.

Two-Stage Detectors

Computer vision took off with the help of multi-stage detectors. These multi-stage detectors processed images and videos in multiple stages by breaking down the process into parts.

The high accuracy and localization were a breakthrough but eventually fell short. Why? They were slow. The latency in processing the images was what compelled developers to introduce a faster algorithm.

One-Stage Detectors

It wasn’t until 2015 that Joseph Redmond and his peers introduced the very first one-stage object detector. They titled the paper ‘You Only Look Once: Unified, Real-time Object Detention’. The main goal was to save time by doing away with the multi-stage detection and replacing it with a faster, one-step process.

One-stage detectors use a unique neural network by analyzing the entire object. It slices the object into multiple boxes, each containing a certain sub-object. YOLO examines the overall image instead of examining several regions back to back.

We find the application of this algorithm in multiple fields like detecting traffic lights, poker cheats, sovereign vehicles, parking meters, etc.

What is YOLOv2?

A year after YOLO was introduced, YOLOv2 surfaced on the web. In 2016, Joseph Redmon and Ali Farhadi released the paper ‘YOLO9000: Better, Faster, Stronger‘ since it could detect over 9000 categories of objects. Relatively, this was a much-improved version, making use of Batch Normalization (BN) and Data Augmentation.

YOLOv2 also introduced the concept of anchor boxes, predefined areas within which the object is
most likely to be detected. The ratio of overlap over union (IoU) of the predicted bounding box and
the anchor box acts as a threshold to decide if the probability of the detected object is sufficient to
make a prediction or not.

However, YOLO is designed to provide highly accurate results, and so it examines the training data and performs clustering on it (dimension clusters). This is to ensure that the data on which we have to train our model is represented within the anchor box.

The Architecture of YOLO Models

All YOLO models have a similar theme of components in their architectures:

  1. The backbone is a convolutional neural network (CNN) that gathers and produces visual features with different shapes and sizes. Models such as ResNet and VGG are also used as feature extractors.
  2. The neck is a set of layers that integrates and blends characteristics before they get passed on to the prediction layer.
  3. The head of the structure inputs the features from the neck along with the bounding box predictions. Here, classification (using regression) is performed on the features and the coordinates of the bounding box to complete the detection process; (x, y) coordinates along with width and height are output.

Dense Prediction and Sparse Prediction

Dense prediction is the final prediction, arranged of a vector that contains orders of the anticipated
bounding box, the confidence score of the prediction, and the label.

Sparse Prediction is the relation between features of data in sparse representation containing zero values and the targets.

Significance of YOLOv2

Easy-to-use and Accurate

YOLOv2 is based on the DarkNet19 structure, which has 19 convolutional layers with 3 × 3 filter and 5 max-pooling layers, doubling the number of channels compared with the previous layer. This increases the accuracy of the results.

Faster and Stronger

The complexity of network computing is reduced by adding a 1 × 1 convolution layer after every 3 × 3 convolution layer. This decrease in complexity increases the inference time of the image processing and optimizes the performance of the algorithm.

High Resolution

YOLOv2 functions by improving the resolution of the input image, increasing the detected pixels and the amount of detected information, which is conducive to improving the detection accuracy.

Fine-Grained Features

Data Augmentation adds features to the model by expanding the input dataset. This works by randomly cropping and rotating the input image, adding dimension to the model.

Detects Multiple Objects

As stated by the authors of the algorithm, “YOLO9000 predicts detections for more than 9000 different object categories, all in real-time.”

Batch Normalization

YOLOv2 makes use of Batch Normalization (BN) after every convolutional layer. It unifies the distribution of the data to the standard normal distribution, which improves the detection accuracy.

It must be noted that:

  • The use of Batch Normalization and Data Augmentation improves the accuracy and mAP from the previous version of YOLO, which also has localization errors.
  • Both RCNN and faster RCNN have low detection speed compared to YOLOv2 as the frames per second (FPS) are more in YOLOv2. YOLO alone has a detection speed of 45 FPS using TITAN X GPU, and the Fast YOLO can reach a speed of 155 FPS with the same type of GPU.
  • There is a 50% reduction in training a YOLOv2 network.

Application of YOLOv2

We have the following applications of YOLOv2:

  • Tiny vehicles
  • Multiple objects
  • Maximized visuals

Issues With YOLOv2

While YOLOv2 may sound like the perfect go-to algorithm for computer vision, many later versions improvised it even further. Today, we have the 6th version of the series, YOLOv6!

Here are the features that the new versions have:


  • More intuitive understanding
  • use of independent logistic classification
  • Prediction of boxes at each scale
  • New architecture
  • A large number of the backbone network
  • High hardware performance 


  • Faster and efficient
  • Contains high average precision (AP) and frames per second (FPS)
  • Trained to do classification and bounding box regression at the same time
  • Simpler architecture
  • Contains 80 built-in object classes


  • Smaller in size (27MB)
  • Lightweight – Extremely fast (140FPS)
  • More accurate (0.895mAP)

Final Thoughts

Development in IT continues to accelerate because of the modifiable algorithms and the need to grow. While YOLOv2 was a major advancement in computer vision, it provided the backbone for later versions to perform more effectively and efficiently. The new models use it as the base structure and continue to create variations and improvements to perform better.

Related Articles

Ashwin Joy

I'm the face behind Pythonista Planet. I learned my first programming language back in 2015. Ever since then, I've been learning programming and immersing myself in technology. On this site, I share everything that I've learned about computer programming.

Leave a Reply

Your email address will not be published. Required fields are marked *

Recent Posts