RoI Feature Propagation for Video Object Detection

How to exploit spatio-temporal information in video to improve the object detection precision remains an open problem. In this paper, we boost the object detection accuracy in video with short- and long-term information. This is implemented with a two-stage object detector that matches and aggregates deep spatial features over short periods of time combined with a long-term optimization method that propagates detections' scores across long tubes. Short-time spatio-temporal information in neighboring frames is exploited by Region-of-Interest (RoI) temporal pooling. The temporal pooling works on linked spatial features through tubelets initialized from anchor cuboids. On top of that convolutional network, a double head processes both temporal and current frame information to give the final classification and bounding box regression. Finally, long-time information is exploited linking detections over the whole video from single detections and short-time tubelets. Our system achieves competitive results in the ImageNet VID dataset.