Towards Generalizable Object Detection

Zhao, Xiangyun

doi:https://doi.org/10.21985/n2-s816-r374

Work

Towards Generalizable Object Detection

Public

Download PDF

Download All Files (.zip)

Object Detection is a core computer vision problem and can facilitate many image understanding problems. Object Detection has witnessed significant progress in the past decade especially after deep learning is successfully applied to this field. Most of the detection models focus on generalizing the trained model to unseen samples of the seen classes. However, we argue that generalizing the detection model to unseen classes is also important in applications. In many emerging applications like robotics exploration and autonomous driving, the detection models are usually deployed on the embedded system. When the systems are exposed to a new environment where it is desirable to detect new unseen classes, the embedded system does not support extra training. Therefore, it is desirable that the trained detectors can be generalized to unseen classes without extra training. In this dissertation, we focus on the object detector's generalization in two aspects: 1) generalization to unseen classes, and 2) generalization to unseen samples of seen classes. To generalize the detector to unseen classes, we present a generalized framework to model the relationship between classes. We propose to model joint visual and semantic relationships between classes so that the detector trained using seen class data can be generalized to detect unseen classes. Then, we study the challenges which hinder the detection model to generalize to unseen samples of seen classes and presents solutions to them. This dissertation presents a generalized framework to model relationships between different classes so that the model can be generalized to unseen classes. Different from most existing models which consider each class independently, the relationship between classes is modeled. In the proposed framework, each class is represented by a fixed-length vector which is also called prototype, and different classes may partially share the prototypes. So, different classes can help each other by learning the shared part of the prototypes together. More importantly, the model trained on seen classes can be generalized to unseen classes if the prototypes of unseen classes can be computed. We first do proof of concept on part attributes recognition. The encouraging results show that the trained model not only improves the performance on seen classes by partially sharing among different classes but also is well generalized to unseen classes. Then, we apply this framework to object detection and present a (multi-modal) morphable detector in which the prototypes are learned by joint visual and semantic embedding in an EM-like approach. The proposed morphable detector can not only be generalized to unseen classes without extra training but also outperform state-of-the-arts on few-shot detection benchmarks. Then, this dissertation studies different challenges for generalizing the model to unseen samples of seen classes. We first studied the competition between classes in joint training and present a modulation module to alleviate the competition issue so that the model can be better generalized to unseen samples of seen classes. Annotating more data for more classes can help train a detector to detect more classes, but annotation is very expensive. We find that training a detector with multiple existing datasets can help detect more classes without expensive extra annotations. However, different datasets annotate different categories, so training with such partial annotations hinders the model to generalize to the unseen samples in test data. Therefore, we explore a method based on a pseudo labeling framework which results in a detector with a unified label space, and it outperforms other alternate baselines. Overfitting is a notorious issue when training data is limited. It prevents the model from generalizing to unseen samples in the test data. We present a contrastive learning-based approach to overcome the overfitting issue. Experiments show that the learned model has a more discriminative embedding so as to better generalize to the unseen samples.

Creator