people tracking

People tracking and head pose estimation are basic tasks for live monitoring in smart interactive spaces, such as retails, warehouse, industrial sheds. Our activities focus on real-time tracking of multiple people using multiple cameras to observe a space from different viewpoints. The low-level integration of multiple distant views permits to simultaneously estimate the ground position of multiple persons with sub-meter precision even if they are seen occluded by one or more cameras such as when people move in groups. Main challenges are dealing with persistent occlusion, adaptation to illumination changes, scalability to number of people and monitored area, and optimal camera placement and self-calibration.

Integrated with audio, visual analysis provides detailed reporting for the analysis of social interactions occurring in a closed space.


SmarTrack is a multi-camera multi-person tracking system. It computes the ground location of people utilizing a coarse shape-plus-color signature, and is designed to work effectively in multi-person scenarios where frequent and persistent occlusions occur among the persons. How it works in brief:

  • For detection, a ground occupancy map is generated using motion features extracted from multiple views. The modes of the map represent those ground locations that most likely explain the image motion under a 3D human shape hypothesis. In a verification step, every mode of the occupancy map is tested for consistency of shape-model projection and extracted image motion: if confirmed, a new track is instantiated and a colour descriptor is extracted from shape-model projections to form a 3D shape-plus-appearance model of the new target.

  • For tracking, a particle filter updates ground location hypotheses using these 3D shape-plus-appearance models. The course-of-dimension induced by appearance dependencies (notably, occlusions) leading to exponential complexity in multi-target tracking is hereby overcome: predictions are updated with a joint occlusion-aware shape-model projection under a quadratic complexity upper bound, leading to a scalable solution to multi-person tracking.

A more detailed project description is available in smartrack_description.pdf.


US7965867 EP1879149 Method and apparatus for tracking a number of objects or object parts in image sequences

EP2302589 US8436913 Method for efficient target detection from images robust to occlusion

key publications

  1. O. Lanz. Approximate Bayesian Multibody Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9):1436-1449, 2006

  2. O. Lanz, S. Messelodi. A Sampling Algorithm for Occlusion Robust Multi Target Detection. IEEE International Conference on Advanced Video and Signal Based Surveillance - AVSS, 2009

  3. T. Hu, S. Messelodi and O. Lanz. Dynamic Task Decomposition for Decentralized Object Tracking in Complex Scenes. Computer Vision and Image Understanding, 134:89-104, 2015


Integrated audio-visual monitoring can provide detailed reporting on when who is speaking, and towards whom or what, that is especially relevant for realizing multi-modal interfaces operating from a distance, and for the analysis of social interactions occurring in a closed space. Based on spatial reasoning, detected acoustic events can be either associated with one or more speaking individuals that are tracked persistently by the cameras, or be ignored as background noise. A more precise head orientation estimation of the speaker is also obtained through early fusion of audio-visual cues.

In collaboration with the speech technology team of FBK, we combine real-time tracking with acoustic source localization techniques into an integrated solution for audio-visual monitoring in smart spaces.

related publications

X. Qian, A. Brutti, O. Lanz, M. Omologo and A. Cavallaro. Audio-visual tracking of concurrent speakers, IEEE Transactions on Multimedia, 2021

X Qian, A Brutti, O Lanz, M Omologo, and A Cavallaro. Multi-speaker tracking from an audio–visual sensing device, IEEE Transactions on Multimedia, 21(10):2576-2588, 2019 (along with CAV3D dataset)

X. Qian, A. Xompero, A. Cavallaro, A. Brutti, O. Lanz and M. Omologo. 3D mouth tracking from a compact microphone array co-located with a camera, IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP, 2018

A. Brutti and O. Lanz. A joint particle filter to track the position and head orientation of people using audio visual cues. European Signal Processing Conference - EUSIPCO, 2010

A Brutti and O. Lanz. An Audio-Visual Particle Filter for Monitoring Interactive People Behaviour, Workshop on Pattern Recognition and Artificial Intelligence for Human Behaviour Analysis, 2009

R. Brunelli, A. Brutti, P. Chippendale, O. Lanz, M. Omologo, P. Svaizer and F. Tobia. A Generative Approach to Audio-Visual Person Tracking, International Evaluation Workshop on Classification of Events, Activities and Relationships - CLEAR, 2006