Dynamic Whitening Saliency
General dynamic scenes involve multiple rigid and flexible objects, with relative and common motion, camera induced or not. The complexity of the motion events together with their strong spatio-temporal correlations make the estimation of dynamic visual saliency a big computational challenge. In this work, we propose a computational model of saliency based on the assumption that perceptual relevant information is carried by high-order statistical structures. Through whitening, we completely remove the second-order information (correlations and variances) of the data, gaining access to the relevant information. The proposed approach is an analytically tractable and computationally simple framework which we call Dynamic Adaptive Whitening Saliency (AWS-D). For model assessment, the provided saliency maps were used to predict the fixations of human observers over six public video datasets, and also to reproduce the human behavior under certain psychophysical experiments (dynamic pop-out). The results demonstrate that AWS-D beats state-of-the-art dynamic saliency models, and suggest that the model might contain the basis to understand the key mechanisms of visual saliency. Experimental evaluation was performed using an extension to video of the well-known methodology for static images, together with a bootstrap permutation test (random label hypothesis) which yields additional information about temporal evolution of the metrics statistical significance.
keywords: Spatio-temporal saliency, visual attention, adaptive whitening, short-term adaptation, eye fixations