Many tasks such as object recognition and action recognition involve understanding the shapes of objects or human silhouettes. Shape reconstruction of objects from their appearance in images has a long history: techniques to detect the shape from... texture, shading, stereo, motion have been proposed since early 1970s with algorithms working well in certain situations or for certain objects, but performing poorly for others, sometimes unusable in real-time applications and often with un-calibrated acquisition devices. Research in this area has never stopped. Some methods focus on restricted classes of objects in interest, such as faces or human bodies. Recently, deep learning methods are proposed to estimate the 3D shape of objects from a single flat image. Furthermore, shape and pose of articulated objects are studied by means of geometric deep learning techniques with input multiple RGB images or 3D scans

We study the problem of reconstructing a 3D object, possibly articulated or deformable, from a single or multiple scans or images. For example, we study algorithms to generate correspondences between images depicting the same object or person, to provide a geometric reconstruction of that specific piece of world. Data of the same object or scene, acquired from different viewpoints or at different times, need to be aligned for the 3D reconstruction. The processing chain include the following steps, possibly end-to-end: (i) primitives detection, i.e. extraction of 2D/3D key points, or salient regions, to be described with a feature vector; (ii) matching of descriptors by similarity measures; (iii) finally, parameter estimation of a transformation model used for the alignment of the entire data, which permits to describe the shape. Applications range from extended reality and graphics, to robotics and autonomous navigation. 

We study novel shape descriptors, particularly by learning descriptors for (deformable) shapes using deep networks. This is a topic related to indexing and retrieval of 3D objects which has many and varied applications: hand gesture recognition, retrieval and classification of scanned objects,  classification of proteins, just to cite a few. 


3D point set registration is the problem of finding an optimal Euclidean transformation to align two partially overlapping 3D point sets such that they can be represented in a common reference frame. Our research focuses on the design of algorithms that process 3D point clouds that are captured in the real world.

Point cloud registration approaches can be broadly categorised into correspondence-free and correspondence-based

Correspondence-free registration approaches aim at minimizing the difference between the global features extracted from two input point clouds (like OGMM [1]).

Correspondence-based registration approaches rely on point-level correspondences between two input point clouds, for example by computing the correspondences through 3D descriptors (like DIP [3] or  GeDi [2]).

related publications 


With the advent of mobile phones capable of capturing 3D information, there has been tremendous increase in the point cloud data availability. Point cloud processing and 3D shape understanding are very challenging tasks for which deep learning techniques have demonstrated great potential. These studies are intended for general semantic scene understanding purposes; a practical application is the detection and identification of various object parts for robotic manipulation tasks.

To allow artificial intelligent agents to interact with the real world, where the amount of annotated data may be limited, integrating new sources of knowledge becomes crucial to support autonomous learning. We consider several possible scenarios involving synthetic and real-world point clouds where supervised learning fails due to data scarcity and large domain gaps. We propose to enrich standard feature representations by leveraging self-supervision through a multi-task model that can solve a 3D puzzle while learning the main task of shape classification or part segmentation. 

In augmented reality the segmentation of an object in semantic parts can be exploited for example to detect the single parts in order to virtually manipulate and reconstruct the object in a different fashion, like done in the REPLICATE project.

The semantic segmentation of 3D shapes with a high-density of vertices could be impractical due to large memory requirements. To make this problem computationally tractable, we propose neural-network based approaches that produces 3D augmented views of the 3D shape to solve the whole segmentation as sub-segmentation problems. 

In publications [2][3] we deal with different domains covering real and synthetic 3D point clouds as well as several learning settings across domains and scarce annotations 
Semantic segmentation results of the approach presented in [4]: Segmentation color key for each semantic part: colour code: yellow = head, green = torso, blue = right arm, light blue = right hand, orange = right leg, yellow = right foot, red = left arm, light red = left hand, purple = left leg, light purple = left foot. Each segmentation result (left) is accompanied by a confidence map (right) showing the uncertainty (entropy) of the network prediction over the 3D shape. The darker the color the higher the uncertainty. 

related publications


We propose a system to capture nearly synchronous frame streams from multiple and moving handheld mobiles that is suitable for dynamic object 3D reconstruction. Each mobile executes Simultaneous Localisation and Mapping on-board to estimate its pose, and uses a wireless communication channel to send or receive synchronization triggers. The system can harvest frames and mobile poses in real time using a decentralized triggering strategy and a data-relay architecture that can be deployed either at the Edge or in the Cloud. 

related publications


One of the fundamental tasks in computer vision when analyzing, for example, humans is the problem of accurately estimating 2D/3D key-points from images depicting human bodies. This is the first step in object reconstruction. 

We study innovative methods for 2D/3D key-points locations, for example applied to human pictures under domain shift, i.e. when the training (source) and the test (target) images significantly differ in terms of visual appearance. One of the proposed methods seamlessly combines three different components: feature alignment, adversarial training and self-supervision. Specifically, a deep architecture leverages domain-specific distribution alignment layers to perform target adaptation at the feature level. Furthermore, a loss is proposed which combines an adversarial term for ensuring aligned predictions in the output space and a geometric consistency term which guarantees coherent predictions between a target sample and its perturbed version. 

related publications

L.O. Vasconcelos, M. Mancini, D. Boscaini, S. Rota Bulò, B. Caputo and E. Ricci. Shape Consistent 2D Keypoint Estimation under Domain Shift, International Conference on Pattern Recognition - ICPR, pp. 8037-8044, 2020

L.O. Vasconcelos, M. Mancini, D. Boscaini, B. Caputo and E. Ricci. Structured domain adaptation for 3d keypoint estimation, International Conference on 3D Vision - 3DV, pp. 57-66, 2019 


We propose novel approaches based on geometric deep learning techniques, for example to 3D hand shape recognition from RGB-D data. In this case the model, trained on synthetic data, retains the performance on real samples during test time.

related publication

J. Svoboda, P. Astolfi, D. Boscaini, J. Masci and MM. Bronstein. Clustered Dynamic Graph CNN for Biometric 3D Hand Shape Recognition, IEEE International Joint Conference on Biometrics - IJCB, pp. 1-9, 2020