As humans, our ability to see is one of, if not our greatest ability. Our eyes are the gateways through which we experience the world. The field of Computer vision is an attempt by man to replicate and potentially surpass the magic of human sight. As great as our eyes can be, we typically can’t “see all” at once.
Computers, on the other hand, are a different story. With the help of computer vision, man can delegate mundane tasks to machines and ultimately obtain details from the world that are otherwise imperceivable to human vision.
This article aims to introduce you to YOLOX, a cutting-edge object detection algorithm. We will go over its history, structure, operation, significance, flaws, and future prospects. In order for you to get a thorough understanding of what YOLOX is all about, let’s start with its older brother YOLO.
What is YOLO?
YOLO is an object detection algorithm, and the term YOLO stands for You Only Look Once. This name references YOLO’s ability to detect objects in real-time via a single forward propagation through a neural network. YOLO was first introduced in 2015 by Joseph Redmon and has since then become the industry standard for object detection. This was mainly due to YOLO’s speed and accuracy.
As the name implies, object detection involves detecting objects in videos and images. Object detection aims to answer two questions: what is the object? (classification) and where is it? (localization).
Before the introduction of YOLO, there were other two-stage detection algorithms in play, notably R-CNN (Region-based Convolutional Neural Network). A two-stage detection algorithm performs object detection in two stages. In the first stage, the image is divided into regions likely to contain an object. The algorithm then classifies the object in each region during the second stage.
On the other hand, we have one-stage detectors like YOLO that carries out object detection without going through the first stage (also known as region proposal) of two-stage detectors.
What is YOLOX?
YOLOX is the latest version of the YOLO series. Up until now, there have been the YOLOv2, YOLOv3, YOLOv4, YOLOv5, and YOLOR. They are all great object detection algorithms in their own right, each improving over the previous iteration, and YOLOX continues that tradition.
In YOLOX, a modified version of YOLOv3 (YOLOv3-SPP) is used as the baseline and Darknet-53 (a convolutional neural network with 53 convolutional layers) as a backbone. This architecture helps YOLOX deliver superior object detection performance compared to other alternatives.
Three key pillars that make YOLOX stand out are the use of a decoupled head, an advanced label assigning strategy, and becoming anchor-free.
So, what makes YOLOX so special? Why is the machine learning community stoked about its introduction? Here are some of the core tenets that make YOLOX a breakthrough.
Decoupled Head
Before now, the previous YOLO series utilized a coupled head, and the classification and localization processes were done in one pipeline. Coupled heads have previously been shown to significantly reduce the performance of a model.
However, in YOLOX, a decoupled head is used. This means that the classification and localization processes are separated. This leads to an increase in performance by improving the converging speed compared to other algorithms with a coupled head setup.
Strong Data Augmentation
As part of its improvements, YOLOX introduced two augmentation strategies: Mosaic and MixUp. Data augmentation basically involves modifying the data you already have in other provide more data for training your model, thereby eliminating the need to collect new data.
Mosaic helps a model recognize objects at a much smaller scale. It combines 4 training images into one based on some ratio. MixUp, on the other hand, generates a weighted combo of random image pairs from the training images.
Anchor-Free
Anchor-based systems have their pitfalls. They introduce more time and potential bottlenecks to the detection process. Anchor-based systems perform clustering analysis as part of the training process and also increase the number of predictions made per object, thereby increasing the inference time.
With these obvious problems associated with Anchor-based systems, it was pretty clear that a move from such can yield an increase in performance. YOLOX’s anchor-free architecture manages to reduce the number of predictions for each image cell from 3 to 1.
SimOTA
YOLOX introduces an advanced label assignment technique called SimOTA. In choosing a label assigning strategy, the team behind YOLOX wanted a system that could meet four objectives: loss/quality awareness, centre prior, dynamic number of anchors for each ground-truth, and global view.
Significance of YOLOX
According to its creators, YOLOX could surpass NanoDet by 1.8% AP, YOLOv3 by 3% AP, and YOLOv5-L by 1.8% AP on the COCO dataset. This makes YOLOX the best object detection model in cases where speed and accuracy are important.
Final Thoughts
YOLOX is a welcome improvement to the YOLO series. Its ability to be fast and accurate makes it a go-to in the world of computer vision. I hope this article was helpful. Happy coding!