| Pythonista Planet

Fast R-CNN is a state-of-the-art object detection algorithm that is widely used in the field of computer vision. It is a fast and accurate method for detecting objects in images or videos. Fast R-CNN has been shown to outperform previous object detection techniques such as sliding window object detection and R-CNN.

One of the key features of Fast R-CNN is its speed, which allows it to process images and videos quickly and efficiently. In addition, Fast R-CNN has been shown to achieve high levels of accuracy in object detection tasks, making it a popular choice for a wide range of applications.

Let’s dive deep into Fast R-CNN.

What is Fast R-CNN?

Fast R-CNN is a powerful object detection model that utilizes a deep convolutional neural network to accurately and efficiently predict the locations of objects in an image. Unlike previous object detection models, which typically consisted of multiple stages, Fast R-CNN consists of a single unified network that can be trained end-to-end.

One of the key innovations of Fast R-CNN is its ability to improve training and testing speed compared to previous models significantly. For example, Fast R-CNN is able to train the deep VGG16 network nine times faster than R-CNN and is 213 times faster at test time while achieving higher mean average precision on the PASCAL VOC 2012 dataset. In comparison to SPPnet, Fast R-CNN is able to train the VGG16 network three times faster, test ten times faster, and achieve higher accuracy.

The Working of Fast R-CNN

The Fast R-CNN algorithm is similar to the R-CNN algorithm in that it uses a deep convolutional neural network (CNN) to detect objects in images. However, rather than feeding individual region proposals to the CNN, Fast R-CNN feeds the entire input image to the CNN to generate a convolutional feature map. From this feature map, the algorithm identifies potential regions of interest (RoIs) and warps them into squares. An RoI pooling layer is then used to reshape these squares into a fixed size that can be fed into a fully connected layer.

Using this RoI feature vector, Fast R-CNN employs a softmax layer to predict both the class of the proposed region and the offset values for the bounding box. Fast R-CNN is faster than R-CNN because it only requires a single convolution operation per image, rather than feeding in 2000 region proposals to the CNN each time. This allows Fast R-CNN to process images more efficiently and accurately detect objects in a timely manner.

Fast R-CNN architecture and training

Fast R-CNN takes an entire image as input, along with a set of object proposals, which are regions of the image that may contain objects. The network processes the entire image with multiple convolutional (CONV) and max pooling layers to produce a feature map.

For each object proposal, a region of interest (RoI) pooling layer extracts a fixed-length feature vector from the feature map. These feature vectors are then fed into a series of fully connected layers, which branch into two output layers: one that produces softmax probability estimates for K object classes, plus a catch-all “background” class, and another layer that outputs four real-valued numbers for each of the K object classes. These numbers encode refined bounding-box positions for each of the K classes.

1. The RoI pooling layer

The RoI (region of interest) pooling layer is used to convert the features inside a given RoI into a small feature map with fixed spatial dimensions H × W (e.g., 7 × 7). H and W are hyperparameters of the layer that are independent of the specific RoI being processed. An RoI is defined as a rectangular window within a convolutional feature map, specified by a four-tuple (r, c, h, w) that indicates its top left corner (r, c) and its height and width (h, w).

The RoI pooling layer is similar to the spatial pyramid pooling (SPP) layer used in SPPnets, but with a single pyramid level. It performs max pooling over a specified sub-window to generate the output feature map. The calculations for the pooling sub-window are given in the original Fast R-CNN paper.

2. Initializing from pre-trained networks

When a pre-trained network is used to initialize a Fast R-CNN network, it undergoes three transformations.

First, the last max pooling layer is replaced by an RoI pooling layer, which is configured by setting H and W to be compatible with the first fully connected layer of the network (e.g., H = W = 7 for VGG16).
Second, the last fully connected layer and softmax of the pre-trained network (which were trained for 1000-way ImageNet classification) are replaced with the two sibling layers of the Fast R-CNN network: a fully connected layer and softmax over K + 1 categories, and category-specific bounding-box regressors.
Third, the network is modified to take two input data sources: a list of images and a list of ROIs in those images.

These modifications allow the Fast R-CNN network to be trained for object detection tasks.

3. Fine-tuning for detection

Fast R-CNN has the ability to train all network weights using backpropagation. During Fast R-CNN training, stochastic gradient descent (SGD) mini-batches are sampled hierarchically. First, N images are sampled, and then R/N ROIs are sampled from each image. This hierarchical sampling allows ROIs from the same image to share computation and memory in the forward and backward passes, which can significantly reduce mini-batch computation time. For example, using N = 2 and R = 128, the proposed training scheme is roughly 64 times faster than sampling one ROI from 128 different images.

While this training strategy may initially raise concerns about slow convergence due to correlated ROIs from the same image, it has been shown to be effective and achieve good results with N = 2 and R = 128 using fewer SGD iterations than R-CNN. In addition to hierarchical sampling, Fast R-CNN uses a streamlined training process with a single fine-tuning stage that jointly optimizes a softmax classifier and bounding-box regressors, rather than training these components in separate stages as in R-CNN.

The components of this procedure are described below:

Multi-task loss: A Fast R-CNN network has two output layers that are siblings. The first layer outputs a discrete probability distribution (for each region of interest, or RoI) over K+1 categories, where p = (p0, …, pK) and p is calculated using a softmax function applied to the K+1 outputs of a fully connected layer. The second sibling output layer produces bounding box regression offsets for each of the K object classes, indexed by k. These offsets are specified using a scale-invariant translation and log-space height/width shift relative to an object proposal. During training, each RoI is labeled with a ground-truth class (u) and a ground-truth bounding box regression target (v). A multi-task loss (L) is applied to each labeled RoI to jointly train for classification and bounding box regression.

Mini-batch sampling: During fine-tuning, each SGD mini-batch consists of 2 images randomly chosen from the dataset. The mini-batch size is 128, with 64 regions of interest (ROIs) sampled from each image. 25% of the ROIs are taken from object proposals that have at least 0.5 intersection over union (IoU) overlap with a ground-truth bounding box. These ROIs are labeled with a foreground object class. The remaining ROIs are sampled from object proposals with a maximum IoU with ground truth in the range [0.1, 0.5). These ROIs are labeled with u=0 and represent background examples. The lower threshold of 0.1 serves as a heuristic for hard example mining. During training, images are horizontally flipped with a probability of 0.5. No other data augmentation is used.

Backpropagation through RoI pooling layers: Backpropagation involves routing derivatives through the RoI pooling layer. For simplicity, we consider only one image per mini-batch (N=1), though the process can be extended to N>1 as the forward pass treats all images independently. The backwards function of the RoI pooling layer calculates the partial derivative of the loss function with respect to each input variable xi by following the argmax switches.

SGD hyper-parameters: The fully connected layers used for softmax classification and bounding box regression are initialized with zero-mean Gaussian distributions with standard deviations of 0.01 and 0.001, respectively. Biases are initialized to 0. All layers use a per-layer learning rate of 1 for weights and 2 for biases, and a global learning rate of 0.001. When training on VOC07 or VOC12, we run SGD for 30k mini-batch iterations, then lower the learning rate to 0.0001 and train for an additional 10k iterations. When training on larger datasets, we run SGD for more iterations as described later. A momentum of 0.9 and parameter decay of 0.0005 (on weights and biases) are used.

4. Scale invariance

We explore two approaches for achieving scale-invariant object detection:

“Brute force” learning: In this approach, each image is processed at a pre-defined pixel size during training and testing. The network must directly learn scale-invariant object detection from the training data.
Use of image pyramids: At test time, the image pyramid is used to approximately scale-normalize each object proposal.

Fast R-CNN detection

Once a Fast R-CNN network is fine-tuned, object detection consists of running a forward pass. The network takes an image and a list of R object proposals as input and scores them.

Truncated SVD for faster detection

For whole-image classification, the time spent computing the fully connected layers is relatively small compared to the convolutional layers. However, in object detection, the number of regions of interest (ROIs) to process is large, and nearly half of the forward pass time is spent computing the fully connected layers. Large fully connected layers can be accelerated by compressing them using truncated SVD.

Advantages of Fast R-CNN

The Fast R-CNN method has several significant advantages:

It achieves higher detection quality (measured in mean average precision, or mAP) than R-CNN and SPPnet. Fast R-CNN can train a very deep detection network (VGG16) 9x faster than R-CNN and 3x faster than SPPnet. At runtime, the detection network processes images in 0.3s (excluding object proposal time) while achieving top accuracy on PASCAL VOC 2012 with an mAP of 66% (compared to 62% for R-CNN).
Training is single-stage, using a multi-task loss. In multi-task training, the network accepts an image as input and returns the class probabilities and bounding boxes of detected objects. The feature map from the last convolutional layer is fed to an ROI Pooling layer to extract a fixed-length feature vector from each region proposal.
Training can update all network layers. This not only simplifies the training process, but also improves performance as the tasks influence each other during training, resulting in a network with improved shared representative power.
No disk storage is required for feature caching. Fast R-CNN does not cache the extracted features, so it does not require as much disk storage as R-CNN, which can need hundreds of gigabytes.

Issues of Fast R-CNN method:

There are some issues with the Fast R-CNN method:

During testing, using region proposals significantly slows down the algorithm compared to not using region proposals, making them a bottleneck that affects performance.
Fast R-CNN uses selective search as a method for finding regions of interest, which is a slow and time-consuming process. It takes around 2 seconds per image to detect objects, which is an improvement over RCNN. However, when applied to large real-life datasets, Fast R-CNN may not seem as fast.

Applications of Fast R-CNN

Fast R-CNN object detection has various applications, including:

Autonomous driving: Autonomous driving systems use object detection to avoid accidents on roads.
Smart surveillance systems: Object detection can be used in security systems to restrict access to certain areas.
Facial recognition: Facial recognition can be used in wildlife conservation to detect different species of animals and track their migration.

Final Thoughts

Fast R-CNN is a clear and fast update to R-CNN and SPPnet. It not only achieves state-of-the-art detection results, but also provides detailed experiments that may offer new insights. Of particular interest is the finding that sparse object proposals appear to improve detector quality, which was previously too time-consuming to explore but becomes practical with Fast R-CNN.

It is possible that undiscovered techniques may allow dense boxes to perform as well as sparse proposals, potentially further accelerating object detection.

Hope you liked the article. Happy coding!

Introduction to Fast RCNN