Introduction to YOLOv1

Computer vision is a fascinating field and a highly-explored branch of artificial intelligence. By automating and performing actions exclusive to human vision, computer vision enables computers to extract high-level information from and understand objects in images or videos. You Only Look Once (YOLO) was a breakthrough object detection concept that furthered the course of computer vision in technology.

This article will explore the concept of the original version of YOLO, YOLOv1, its history, mode of operation, significance, and real-life applications. Going further, we will take an insightful look into the advantages YOLOv1 had over its predecessors and the drawbacks that newer models sought to address.

What is YOLO?

You Only Look Once broke onto the computer vision scene in 2015 via a paper written by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. In the paper, Redmon proposed a different approach to object detection. Previous research involved the repurposing of classifiers to detect objects in visuals.

YOLO would instead view object detection as a regression problem to associated class probabilities and spatially separated boxes. They proposed a model that was not only fast but could achieve double the Mean Average Precision of other real-time detectors.

YOLOv1 was the base model. YOLOv2, YOLOv3, YOLOv4, and the Single Shot Multibox Detector were improved versions of the initial model. While it simplified and fast-tracked object detection on a large scale, YOLO had its fair share of shortcomings.


How Does YOLO Work?

Object detection in computer vision is akin to how the eyes help the mind recognize objects in a vacuum. Object detection involves identifying and recognizing instances of an object (belonging to a specific class) within an image or video.

Object detection evolved from the traditional methods to the deep learning methods, moving from Viola-Jones detectors, HOG detectors, and Deformable Part-Based Models to convolutional neural networks and deep convolutional networks.

In the era of deep learning, which we are in today, object detection takes a two-faced approach: two-stage detection and one-stage detection. Generally, one-stage object detection models prioritize inference speeds, while their two-stage counterparts concentrate more on achieving detection accuracy.

One-stage detection models include RetinaNet, YOLO, and Single Shot Multibox Detector, while two-stage networks include Cascade R-CNN, Fast R-CNN, and Mask R-CNN.

From the paper, Redmon views object detection as a regression problem. YOLO uses a sole convolutional neural network to predict bounding boxes and class probabilities, considering the entire image in a single evaluation.

In one step and for one image, YOLO predicts multiple bounding boxes, the class probabilities for each box, and all bounding boxes across all classes – making it a one-stage detection model.

Unlike earlier object detection models, which localized objects in images by using regions of the image with high probabilities of containing the object, YOLO considers the full image.


The basic structure of the YOLOv1 convolutional neural network involves two separate processes. The network’s initial convolutional layers single out features from the image, while the fully-connected layers predict the probabilities and coordinates of the output. The network architecture models the GoogLeNet framework for image classification.

YOLOv1 has twenty-four convolutional layers along with two fully-connected layers. Although GoogLeNet uses inception modules, YOLOv1 adopts 1×1 reduction layers paired alongside 3×3 convolutional layers.

The YOLOv1 framework splits the input image into an S×S grid. Each grid cell is responsible for detecting any object which falls within it, predicting B bounding boxes, and confidence scores for each box.

Six numbers (pc, bx, by, bh, bw, c) represent each bounding box. ‘pc’ represents the confidence that an object is in the bounding box, ‘bx, by, bh, bw’ represent the bounding box, and ‘c’ represents a vector that contains the class probabilities.

YOLO is fast and uses a simple pipeline that runs the neural network on a new image to predict detections at test time. The base network runs at 45 fps without batch processing, and the fast version (Fast YOLO) exceeds speeds of 150 fps.

YOLOv1’s speed implies that it can handle real-time processing of live video streams with latency below 25 milliseconds. YOLOv1 is also great for person detection in artwork and real-time detection in the wild.

Significance of YOLOv1

YOLOv1 was a breakthrough for object detection technology. The more straightforward nature of the network, its speed, and its advantage over older object detection models in global reasoning and object generalization learning placed it ahead of the pack. When trained on several datasets, YOLOv1 outperformed several of its predecessors in the aspects of detection speed, accuracy, and mean average precision.

YOLOv1’s significance is bolstered by its learning model that not only unifies all aspects of an image during training, but it is less likely to crash when new inputs and new domains are introduced. The network is big on learning generalizable object representations and reasoning globally about images during predictions.

YOLOv1 Compared to Older Object Detection Models

YOLOv1 vs Deformable Parts-Based Models (DPM)

DPM uses a sliding window approach, extracting static features, classifying regions, and
predicting bounding boxes with a disjointed pipeline. YOLOv1 replaces these parts with a
single neural network that trains and optimizes features for detection, resulting in faster and
more precise object detection than DPM.


R-CNN employs a region proposal that only considers the image segments with a high
probability of containing the image. In comparison, YOLOv1 looks at the whole picture. Also,
R-CNN is a slow two-stage model that uses many bounding boxes, while YOLOv1 selects
fewer bounding boxes, and its unified approach leads to swift object detection.

YOLOv1 vs MultiGrasp

YOLOv1 shares several similarities with MultiGrasp, like its bounding box prediction model, but is more complex, as it determines the size, location, and boundaries of the object, and it also has to predict its class, unlike MultiGrasp.

Real-life Applications of YOLOv1

  • Self-driving cars: YOLOv1 has proven that it can deliver object detection in real-time, with a latency of fewer than 25 milliseconds. Fast and accurate real-time object detection models like YOLOv1 allow computers to drive cars without special sensors.
  • Tracking systems: When connected to a webcam, YOLOv1 detects objects in real-time like a live tracking system. Although the network detects objects individually, it performs well for moving objects due to its generalization abilities.

Shortcomings of YOLOv1

Compared to newer models, YOLOv1 falls short in certain areas. First, the model imposes strong spatial constraints on bounding box predictions that reduce the number of nearby objects it can predict. Also, YOLOv1 struggles with small objects when they appear in groups, e.g., a flock of birds. One example of a YOLOv1 error is found in the original paper, which identifies a human being in midair as an airplane.

Since YOLOv1 uses data when learning to predict bounding boxes, it struggles with generalization when new and uncommon aspect ratios come into the mix. Finally, YOLOv1 also records errors due to incorrect localizations. These incorrect localizations result from the loss function and how it treats errors in small and large bounding boxes.

Newer models of YOLO (YOLOv2, YOLOv3, and YOLOv4) address these issues, increasing object detection accuracy without sacrificing speed. YOLOv4, in particular, packs a bag of freebies (methods that enhance model performance without increasing inference cost) and a bag of specials (methods that increase accuracy and computation cost).

Final Thoughts

Successive models of YOLO and other new object detection models are now ahead of YOLOv1 in speed and accuracy. Yet, YOLOv1 formed the framework for these successes, emerging as a breakthrough that eliminated multiple pipelines and completed object detection in record time.

YOLOv1’s emergence has led to several advancements in computer vision and its real-life applications.

Related Articles

Ashwin Joy

I'm the face behind Pythonista Planet. I learned my first programming language back in 2015. Ever since then, I've been learning programming and immersing myself in technology. On this site, I share everything that I've learned about computer programming.

Leave a Reply

Your email address will not be published. Required fields are marked *

Recent Posts