Many tasks such as object recognition and action recognition involve understanding the shapes of objects or human silhouettes. Shape reconstruction of objects from their appearance in images has a long history: techniques to detect the shape from... texture, shading, stereo, motion have been proposed since early 1970s with algorithms working well in certain situations or for certain objects, but performing poorly for others, sometimes unusable in real-time applications and often with un-calibrated acquisition devices. Research in this area has never stopped. Some methods focus on restricted classes of objects in interest, such as faces or human bodies. Recently, deep learning methods are proposed to estimate the 3D shape of objects from a single flat image. Furthermore, shape and pose of articulated objects are studied by means of geometric deep learning techniques with input multiple RGB images or 3D scans.

We study the problem of reconstructing a 3D object, possibly articulated or deformable, from a single or multiple scans or images. For example, we study algorithms to generate correspondences between images depicting the same object or person, to provide a geometric reconstruction of that specific piece of world. Data of the same object or scene, acquired from different viewpoints or at different times, need to be aligned for the 3D reconstruction. The processing chain include the following steps, possibly end-to-end: (i) primitives detection, i.e. extraction of 2D/3D key points, or salient regions, to be described with a feature vector; (ii) matching of descriptors by similarity measures; (iii) finally, parameter estimation of a transformation model used for the alignment of the entire data, which permits to describe the shape. Applications range from extended reality and graphics, to robotics and autonomous navigation.

We study novel shape descriptors, particularly by learning descriptors for (deformable) shapes using deep networks. This is a topic related to indexing and retrieval of 3D objects which has many and varied applications: hand gesture recognition, retrieval and classification of scanned objects, classification of proteins, just to cite a few.


An effective 3D descriptor should be invariant to different geometric transformations, such as scale and rotation, repeatable in the case of occlusions and clutter, and generalisable in different contexts when data is captured with different sensors. We have proposed a simple but effective method for learning distinctive 3D local deep descriptors (DIPs). They can be used to register point clouds without requiring an initial alignment. Point cloud patches are extracted, canonicalised with respect to their estimated local reference frame and encoded into rotation-invariant compact descriptors by a PointNet-based deep neural network. DIPs can effectively generalise across different sensor modalities because they are learnt end-to-end from locally and randomly sampled points. DIPs are robust to clutter, occlusions and missing regions because they encode only local geometric information.

related publications

F. Poiesi, D. Boscaini. Distinctive 3D local deep descriptors, International Conference on Pattern Recognition - ICPR, pp. 5720-5727, 2021

M. Zanin, F. Remondino and M Dalla Mura. High-performance computing in image registration, SPIE Remote Sensing, 2012


With the advent of mobile phones capable of capturing 3D information, there has been tremendous increase in the point cloud data availability. Point cloud processing and 3D shape understanding are very challenging tasks for which deep learning techniques have demonstrated great potential. These studies are intended for general semantic scene understanding purposes; a practical application is the detection and identification of various object parts for robotic manipulation tasks.

To allow artificial intelligent agents to interact with the real world, where the amount of annotated data may be limited, integrating new sources of knowledge becomes crucial to support autonomous learning. We consider several possible scenarios involving synthetic and real-world point clouds where supervised learning fails due to data scarcity and large domain gaps. We propose to enrich standard feature representations by leveraging self-supervision through a multi-task model that can solve a 3D puzzle while learning the main task of shape classification or part segmentation.

In augmented reality the segmentation of an object in semantic parts can be exploited for example to detect the single parts in order to virtually manipulate and reconstruct the object in a different fashion, like in the REPLICATE project.

The semantic segmentation of 3D shapes with a high-density of vertices could be impractical due to large memory requirements. To make this problem computationally tractable, we propose a neural-network based approach that produces 3D augmented views of the 3D shape to solve the whole segmentation as sub-segmentation problems. 3D augmented views are obtained by projecting vertices and normals of a 3D shape onto 2D regular grids taken from different viewpoints around the shape. These 3D views are then processed by a Convolutional Neural Network to produce a probability distribution function (pdf) over the set of the semantic classes for each vertex. The pdfs are then re-projected on the original 3D shape and post-processed using contextual information through Conditional Random Fields.

In publications [1][2] we deal with different domains covering real and synthetic 3D point clouds as well as several learning settings across domains and scarce annotations
Semantic segmentation results of the approach presented in [3]: Segmentation color key for each semantic part: colour code: yellow = head, green = torso, blue = right arm, light blue = right hand, orange = right leg, yellow = right foot, red = left arm, light red = left hand, purple = left leg, light purple = left foot. Each segmentation result (left) is accompanied by a confidence map (right) showing the uncertainty (entropy) of the network prediction over the 3D shape. The darker the color the higher the uncertainty.

related publications

  1. A. Alliegro, D. Boscaini and T. Tommasi. Joint Supervised and Self-Supervised Learning for 3D Real-World Challenges, International Conference on Pattern Recognition - ICPR, 2020

  2. A. Alliegro, D. Boscaini and T. Tommasi. Self-Supervision for 3D Real-World Challenges, European Conference on Computer Vision Workshops - ECCVW, 2020

  3. D. Boscaini and F. Poiesi. 3D Shape Segmentation with Geometric Deep Learning, 20th International Conference on Image Analysis and Processing - ICIAP, 2019


We propose a system to capture nearly synchronous frame streams from multiple and moving handheld mobiles that is suitable for dynamic object 3D reconstruction. Each mobile executes Simultaneous Localisation and Mapping on-board to estimate its pose, and uses a wireless communication channel to send or receive synchronization triggers. The system can harvest frames and mobile poses in real time using a decentralized triggering strategy and a data-relay architecture that can be deployed either at the Edge or in the Cloud.

related publications

  • M. Bortolon and F. Poiesi. An open-source mobile-based system for synchronised multi-view capture and dynamic object reconstruction, Software Impact, 9, 2021

  • M. Bortolon, L. Bazzanella and F. Poiesi. Multi-view data capture for dynamic object reconstruction using handheld augmented reality mobiles, Journal of Real-Time Image Processing, vol. 18, pp. 345-355, 2021

  • M. Bortolon, P. Chippendale, S. Messelodi and F. Poiesi. Multi-view data capture using edge-synchronised mobiles, 15th International Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - VISAPP, pp. 730-740, 2020


One of the fundamental tasks in computer vision when analyzing, for example, humans is the problem of accurately estimating 2D/3D key-points from images depicting human bodies. This is the first step in object reconstruction.

We study innovative methods for 2D/3D key-points locations, for example applied to human pictures under domain shift, i.e. when the training (source) and the test (target) images significantly differ in terms of visual appearance. One of the proposed methods seamlessly combines three different components: feature alignment, adversarial training and self-supervision. Specifically, a deep architecture leverages domain-specific distribution alignment layers to perform target adaptation at the feature level. Furthermore, a loss is proposed which combines an adversarial term for ensuring aligned predictions in the output space and a geometric consistency term which guarantees coherent predictions between a target sample and its perturbed version.

related publications

L.O. Vasconcelos, M. Mancini, D. Boscaini, S. Rota Bulò, B. Caputo and E. Ricci. Shape Consistent 2D Keypoint Estimation under Domain Shift, International Conference on Pattern Recognition - ICPR, pp. 8037-8044, 2020

L.O. Vasconcelos, M. Mancini, D. Boscaini, B. Caputo and E. Ricci. Structured domain adaptation for 3d keypoint estimation, International Conference on 3D Vision - 3DV, pp. 57-66, 2019


We propose a novel approach to 3D hand shape recognition from RGB-D data based on geometric deep learning techniques. The model, trained on synthetic data, retains the performance on real samples during test time.

related publication

J. Svoboda, P. Astolfi, D. Boscaini, J. Masci and MM. Bronstein. Clustered Dynamic Graph CNN for Biometric 3D Hand Shape Recognition, IEEE International Joint Conference on Biometrics - IJCB, pp. 1-9, 2020