dynamic scene

Understanding dynamic scenes by unraveling the spatio-temporal visual patterns of the phenomena under consideration is, together with object recognition, one of the fundamentals of computer vision. To provide a realistic interpretation of the processed scenes, this activity can require approaches taking into consideration knowledge on the scenario, coming from physics or psychology, depending on the scenario. In streaming data captured by one or more cameras the event is the basic unit of information. The detection and classification of events is useful for generating video descriptions and video retrieval, as well as for activity recognition in video surveilled environments, for example to detect anomalies or specific actions. The most recently used techniques are based on machine learning and deep learning with structured output prediction.

Computer vision based systems for monitoring and understanding complex environments through videos are used for safety in industrial environments and for outdoor and indoor surveillance for security or for data collection. In assisted living applications, customer behavior analysis, or sports description, the main goals are to recognize people's actions and their interactions with the surrounding environment. In traffic analysis the goals are to recognize pedestrians and vehicles along with paths to detect passages and anomalous events. Other studied domains regard augmented reality applications in order to embed properly and dynamically virtual objects in the real scene.


The problem of action recognition in egocentric video sequences is among the most challenging in video analysis. It requires the detailed recognition of objects and their interaction with the user. Our approach (LSTA) combines the ability to focus on the most significant spatial features with attentional techniques operating along the temporal dimension. LSTA, implemented by means of an end-to-end architecture, has proven to be effective in the recognition of egocentric activities in reference challenges.

EPIC-Kitchens is a large dataset in first-person vision recordings in native environment. Given a trimmed action segment, the challenge is to classify the segment into its action class composed of the pair of verb and noun classes (e.g. wash cup, cut tomato, dry hand...). Swathikiran Sudhakaran, Oswald Lanz (Fondazione Bruno Kessler, Trento) and Sergio Escalera (Uni Barcelona) ranked in the 3rd position in EPIC-Kitchens 2019 Action Recognition Challenge at CVPR with LSTA: Long Short-Term Attention for Egocentric Action Recognition. Third place again in the 2020 challenge.


  • S. Sudhakaran, S. Escalera and O. Lanz. Gate-Shift Networks for Video Action Recognition, IEEE Conference on Computer Vision and Pattern Recognition - CVPR, 2020

  • S. Sudhakaran, S. Escalera and O. Lanz. LSTA: Long Short-Term Attention for Egocentric Action Recognition, IEEE Conference on Computer Vision and Pattern Recognition - CVPR, 2019


Novel-View Human Action Synthesis aims to synthesize the appearance of a dynamic scene from a virtual viewpoint, given a video from a real viewpoint. Our approach uses a novel 3D reasoning to synthesize the target viewpoint. We first estimate the 3D mesh of the target object, a human actor, and transfer the rough textures from the 2D images to the mesh. This transfer may generate sparse textures on the mesh due to frame resolution or occlusions. To solve this problem, we produce a semi-dense textured mesh by propagating the transferred textures both locally, within local geodesic neighborhoods, and globally, across symmetric semantic parts. Next, we introduce a context-based generator to learn how to correct and complete the residual appearance information. This allows the network to independently focus on learning the foreground and background synthesis tasks. (more info)

related publication

M.I. Lakhal, D. Boscaini, F. Poiesi, O. Lanz and A. Cavallaro. Novel-View Human Action Synthesis, 15th Asian Conference on Computer Vision - ACCV, 2020


Along with the deployment of video surveillance in public spaces, there is an increasing demand for automatic analysis tools able to extract typical and anomalous patterns in complex scenes. In crowded scenes object tracking is not effective, therefore approaches that exploit low-level features are used to extract typical and anomalous activities.

We study statistical methods for analyzing complex visual scenes involving people and their activities in order to identify anomalous patterns in video monitored environments. The approach is based on stacked denoising auto-encoders fusing multiple modalities (motion, appearance).

related publications

D. Xu, E. Ricci, Y. Yan, J. Song, N. Sebe. Learning Deep Representations of Appearance and Motion for Anomalous Event Detection. British Machine Vision Conference - BMVC, 2015

E. Ricci, G. Zen, N. Sebe, S. Messelodi. A Prototype Learning Framework using EMD: Application to Complex Scenes Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, N. 3, pp. 513-526, 2013

E. Ricci, F. Tobia, G. Zen: Learning Pedestrian Trajectories with Kernels. International Conference on Pattern Recognition - ICPR, pp. 149-152 2010


Non-verbal behavior, including but not limited to gaze, facial expression and body language, is extremely significant in human interaction. In the analysis of group behavior some key variables are the proximity and focus of attention indicating the object or person one is attending to. We apply computer vision monitoring in combination with other sensing modalities (e.g. audio) to explore the relationship between proxemics, visual attention, social signals and personality traits during interaction.


J. Varadarajan, R. Subramanian, S. Rota Bulò, N. Ahuja, O. Lanz and E. Ricci. Joint Estimation of Human Pose and Conversational Groups from Social Scenes. International Journal of Computer Vision, 2017

X. Alameda-Pineda, J. Staiano, R. Subramanian, L. M. Batrinca, E. Ricci, B. Lepri, O. Lanz, N. Sebe. SALSA: A Novel Dataset for Multimodal Group Behavior Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(8):1707-1720, 2016

Y. Yan, E. Ricci, G. Liu, O. Lanz and N. Sebe. A Multi-task Learning Framework for Head Pose Estimation under Target Motion, IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(6):1070-1083, 2016

E. Ricci, J. Varadarajan, R. Subramanian, S. Rota Bulò, N. Ahuja and O. Lanz. Uncovering Interactions and Interactors: Joint Estimation of Head, Body Orientation and F-formations from Surveillance Videos, International Conference on Computer Vision - ICCV, 2015

X. Alameda-Pineda, Y. Yan, E. Ricci, O. Lanz and N. Sebe. Analyzing Free-standing Conversational Groups: a Multimodal Approach. ACM International Conference on Multimedia - ACMMM, 2015

A.K. Rajagopal, R. Subramanian, E. Ricci, R.L. Vieriu, O. Lanz, R. Kalpathi and N. Sebe. Exploring Transfer Learning Approaches for Head Pose Classification from Multi-view Surveillance Images. International Journal of Computer Vision, 109(1-2):146-167, 2014

R. Subramanian, Y. Yan, J. Staiano, O. Lanz and N. Sebe. On the relationship between head pose, social attention and personality prediction for unstructured and dynamic group interactions. ACM International Conference on Multimodal Interaction - ICMI, 2013

F. Setti, O. Lanz, R. Ferrario, V. Murino and M. Cristani. Multi-scale f-formation discovery for group detection. IEEE International Conference on Image Processing - ICIP, 2013

G. Zen, B. Lepri, E. Ricci and O. Lanz. Space Speaks - Towards Socially and Personality Aware Visual Surveillance. Multimodal Pervasive Video Analysis ACMMM Workshop - MPVA, pp. 37-42, 2010


In the years 2000-2005 we developed and field-tested traffic analysis tools for monitoring road intersections and queues to compute traffic statistics and detect anomalous paths or situations. The basic tool was called SCOCA, a real-time vision system to compute traffic parameters by analyzing monocular image sequences coming from pole-mounted video cameras at urban crossroads, able to detect, track and classify singular vehicles for statistical analysis. The system used a combination of segmentation and motion information to localize and track multiple moving objects on the road plane, utilizing a robust background updating, and a feature-based tracking method. Strengths of the system are the capability to operate in a wide range of illumination conditions and the possible configuration for different intersection geometry and camera position [1].

A second system measured the presence and severity of vehicular queues in long field-of-views [3]. By processing and analyzing the images acquired by means of two video cameras overlooking a high traffic two-ways road, the system automatically extracts traffic parameters, in both directions, such as traffic volume and speed; vehicle classification (three categories); queues and associated severity index.

Another tool was developed to measure the accident risk at intersections, analyzing the characteristics of local traffic [2]. With the identification and automatic extraction of vision-based traffic parameters related to dangerous situations, we proposed a risk index that takes into account volume, speed, classes of vehicles including bicycles, their paths and the relative distances.

related publications

  1. S. Messelodi, C.M. Modena and M. Zanin. A computer vision system for the detection and classification of vehicles at urban road intersections. Pattern Analysis and Applications, Vol. 8, No. 1-2, pp. 17-31, 2005

  2. S. Messelodi, C.M. Modena. A Computer Vision System for Traffic Accident Risk Measurement. A Case Study. Advances in Transportation Studies, Vol. 7, pp. 51-66, 2005

  3. M. Zanin, S. Messelodi, and C. M. Modena. An Efficient Vehicle Queue Detection System Based on Image Processing. 12th International Conference on Image Analysis and Processing - ICIAP, pp. 232-237, 2003