Scene Recognition using Visual Attention, Invariant Local Features and Visual Landmarks

In this paper, we study the scene recognition task using a visual attention model of bottom-up saliency, invariant local features and visual landmarks. This study is carried out in the context of a robot-like navigation application, where scene recognition is performed using the invariant local features (alternatively visual landmarks) to characterize the different scenarios and the Nearest Neighbor rule for classification. Experimental work shows that important reductions in the number of prototypes used by the NN classifier can be achieved using saliency maps from the bottom-up saliency model, and thus, accelerate the overall process. We compare SIFT and SURF solutions for invariant local features and conclude from the experiments that SIFT features outperform SURF ones by achieving better recognition performance and better distinctiveness within the database of prototypes. We also present a novel approach to extract visual landmarks that uses the model of bottom-up saliency to localize interest points, and color centiles plus local binary pattern histograms to collect local description of them. In the experiments, this later approach outperforms SIFT features by achieving similar recognition results but further reductions in the size of the database of prototypes (two orders of magnitude), thus providing bigger savings in computational costs.

keywords: Scene Recognition, Visual Attention, Bottom-up Saliency, Invariant Local Features, Visual Landmarks