Small Logo Detection using 2D Object Detection methods
Published:
This is a Blog Post derived from a work I did as part of deep learning class in UJM.
Abstract
Object detection, one of the most fundamental and challenging problems in computer vision, seeks to locate object instances from many predefined categories in natural images. Deep learning techniques have emerged as a powerful strategy for learning feature representations directly from data and have led to remarkable breakthroughs in object detection. Logo detection has received more attention in the last few years for its wide applications. However, robust and accurate logo detection is still challenging in real-world scenarios because the logo images have distinctive properties, which may form significant obstacles to current progress. The goal of this paper is to compare two popular object detection deep-learning methods on images containing logos of different brands and specifically analyze the methods on tiny logo detection.
Our paper begins with an introduction on the object detection backbones and baselines, then we also review popular metrics used for comparison. We describe our experiment and result on a dataset of logos, and we draw meaningful conclusions. Finally, several promising directions and tasks are provided to serve as guidelines for future work.
Introduction
Object detection deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. Object detection has been attracting increasing amounts of attention in recent years due to its wide range of applications such as monitoring security, autonomous driving, transportation surveillance, drone scene analysis, and robotic vision. The main task of logo detection is to determine the location of a specific logo in images/videos and identify them. [1] reviews object detection methods. Although it may be regarded as a particular object task, logo detection in real-world images can be pretty challenging since numerous brands may have highly diverse contexts, varied scales, changes in illumination, size, resolution, and even nonrigid deformation. [2] reviews the advance in applying deep learning techniques to logo detection. Unlike generic objects, logos tend to be small in size, making it difficult to distinguish them from contexts, especially the complex background. Early-layer feature maps represent small reception fields but lack high-level semantic information critical for logo detection. In contrast, high-level feature maps help identify large logos but may not detect smaller ones. Addressing the size problem, tiny logos contain insufficient information, making it more challenging to extract discriminative features. Meanwhile, tiny logos contain relatively few samples and thus are easily disturbed by environmental factors. Numerous real-world tiny logos may have highly diverse contexts, which can cause the same logo to appear very different in different real scenarios. According to [2], the results of existing detection methods on tiny logos’ performance of existing methods are unsatisfactory in real scenarios.
This paper focuses on describing and analyzing deep learning based object detection on a small logo detection task. In addition, we introduce our annotated small-scale dataset FlickrSAN-15.
This paper is organized as follows. Section 1.1 describes the backbone Networks, Section 1.2 describes typical baseline object detectors, and further explains two popular methods. Section 1.3 introduces the available small-scale logo dataset and our new FlickrSAN-15 annotated dataset. Section 1.4 describes commonly used metrics. Section 2 explains our experiment and the analysis of the object detection methods we chose. Section 3 details our conclusions and reflections on future work.
Backbone Networks
Backbone network is acting as the basic feature extractor for object detection task which takes images as input and outputs feature maps of the corresponding input image. Most of the backbone networks for detection are the network for classification task taking out the last fully connected layers. Towards different requirements about accuracy vs. efficiency, people can choose deeper and densely connected backbones, like ResNet [3]. ResNet, or Residual Network, provided a novel way to add more convolutional layers to a CNN, without running into the vanishing gradient problem, using the concept of shortcut connections. A shortcut connection “skips over” some layers, converting a regular network to a residual network. ResNet has many variants that run on the same concept but have different numbers of layers. Resnet50 is used to denote the variant that can work with 50 neural network layers. The newly high-performance classification networks can improve precision and reduce the complexity of object detection tasks. This is an effective way to further improve network performance because the backbone network acts as a feature extractor. As is known to all, the quality of features determines the upper bound of network performance.
Typical baselines
With the development of deep learning and the continuous improvement of computing power, great progress has been made in the field of general object detection.
Two kinds of object detectors
Pre-existing domain-specific image object detectors usually can be divided into two categories, one-stage detectors, and two-stage detectors. In the two-stage detector, The First stage, Region Proposal Network, proposes candidate object bounding boxes. In the second stage, features are extracted by RoI Pooling. Two-stage detectors have high localization and object recognition accuracy, whereas one-stage detectors achieve high inference speed.
We introduce representative object detection architectures that will be used in our case study. SSD as a representative for a one-stage detector and Faster R-CNN as a two-stage detector.
1) SSD: SSD [4], a single-shot detector for multiple categories within one-stage which directly predicts category scores and box offsets for a fixed set of default bounding boxes of different scales at each location in several feature maps with different scales. SSD discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. The SSD Model is shown in figure [4].
2) Faster R-CNN: Three months after Fast R-CNN was proposed, Faster R-CNN [5] further improves the region-based CNN baseline. Fast R-CNN uses selective search to propose RoI, which is slow and needs the same running time as the detection network. Faster R-CNN replaces it with a novel RPN (region proposal network). RPN shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features—using the recently popular terminology of neural networks with “attention” mechanisms, the RPN component tells the unified network where to look. That is a fully convolutional network to efficiently predict region proposals with a wide range of scales and aspect ratios. RPN accelerates the generating speed of region proposals because it shares fully-image convolutional features and a common set of convolutional layers with the detection network.
Datasets
The small-scale datasets include BelgaLogos [6], FlickrLogos-32 [7], etc. BelgaLogos [6] consists of 2,695 instances of logos labeled with bounding boxes. FlickrLogos-32 [7], one of the most popular small-scale datasets for logo detection, comprises 32 different classes with 70 images in each class. The images in this dataset are mainly captured from the real world, and many contain occlusions, appearance changes, and lighting changes, making detecting this dataset very challenging. We introduce FlickrLogoSAN-15; we created the dataset from FlickrLogo-32’s [7] un-annotated images. Our dataset was resized to 800x600 pixels and then annotated manually by using the labelImg repository. It includes 15 classes out of 32 classes of FlickrLogo-32. Each class includes 40 images for training and 10 images for testing. Overall, FlickrLogoSAN-15 contains 750 annotated images.
Metrics
IOU (Intersection over Union) is an evaluation metric used to measure the accuracy of an object detector. We divide the area of overlap between the predicted bounding box and the ground-truth bounding box in the area of union, the area encompassed by both the predicted bounding box and the ground-truth bounding box. Predicted bounding boxes that heavily overlap with the ground-truth bounding boxes have higher scores than those with less overlap. This makes Intersection over Union an excellent metric for evaluating custom object detectors.
Precision is defined as the number of true positives over the number of true positives plus the number of false positives. Recall is defined as the number of true positives over the number of true positives plus the number of false negatives. The relationship between recall and precision can be observed in the stairstep area of the plot - at the edges of these steps, a small change in the threshold considerably reduces precision, with only a minor gain in recall. Average precision (AP) summarizes such a plot as the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold used as the weight.
mAP (mean average precision) is the average of AP. In some context, we compute the AP for each class and average them. But under the COCO context, AP is the average over 10 IoU levels on 80 categories (AP@[.50:.05:.95]: starting from 0.5 to 0.95 with a step size of 0.05).
Case Study
Experimental Setup
Our experiments are conducted on our Logo dataset as described in the previous section Datasets. We used TensorFlow Framework and TensorFlow 2 Object Detection API tutorial. The backbone network is ResNet, specifically ResNet50, ResNet101, and ResNet152. We use image sizes 640x640 and 1024x1024. The object detectors are SSD and Faster R-CNN. The models have been taken from TensorFlow 2 Detection Model Zoo repository.
Retrospect and Prospect
Concluding Remarks
This research presents a comparison on 2D object detection in the context of logo detection. To identify and classify small logos, we propose to train the network with an annotated dataset, to train with better computational efficiency giving access to results with a higher batch size but lower training time cost, and finally, to fine-tune and optimize the network to tackle overfitting.
Reflections on Future Work
In prospect of what remains to be done, we identify the avenues for future work. Logo detection is still challenging in real-world scenarios. Our comparison can be extended to other backbone networks, for example, MobileNetV2. In [2], some future research directions in logo detection described: Weakly supervised logo detection, Video logo detection, Tiny logo detection, Long tail logo detection, and Incremental logo detection. Using the dataset and the survey, we propose some more directions for future works such as continuing to train the models at larger batch sizes to get better results leading to better analysis of the problem of identifying small 2D logo identification and to train the model with a larger Supervised Annotated training set to extract more features from images with lower resolution and get better mAP, which the current models fail to do.
Acknowledgements.
This work was done as part of the Deep Learning for Computer Vision class instructed by Prof. Damien Muselet (damien.muselet@univ-st-etienne.fr). The class was taken as part of IMLEX program supported by EU and Japan. The shared information was written by me, the experiments were conducted by Hoang Nam Nguyen (hoang.nam.nguyen@etu.univ-st-etienne.fr) and Aryaman Sharma (aryaman.sharma@etu.univ-st-etienne.fr).
References:
- Jiao, L., et al. (2019). A Survey of Deep Learning-based Object Detection. IEEE Access.
- Hou, S., et al. (2022). Deep learning for logo detection: A survey. arXiv preprint arXiv:2204.11918.
- He, K., et al. (2016). Deep Residual Learning for Image Recognition. CVPR.
- Liu, W., et al. (2016). SSD: Single shot multibox detector. ECCV.
- Ren, S., et al. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. NIPS.
- Joly, A., & Buisson, O. (2009). Logo retrieval with a contrario visual query expansion. ACM MM.
- Romberg, S., et al. (2011). Scalable logo recognition in real-world images. ACM ICMR.