Paper Authors : Shivam Duggal, Deepak Pathak. Carnegie Mellon University.
CVPR2022
Recent Trends in 3D Computer Vision - TUM - 2023
This Blog gives a summary and brief explanation of the TARS paper.
Blog author: Alejandra Castelblanco
Context:
There is a lack of scalable ways to obtain large amounts of data that map 2D to 3D representations of an object, even more for in-the-wild images on the internet. This type of data however, is required to train most of the 2D to 3D rendering AI models.
This study adresses 3 questions:
-
How can we understand the 3D geometric world from 2D observations?
-
How to best represent a 3D Structure to obtain the best shape generation model?
-
Can we exploit a collection of unmatched 2D images that belong to the same category to train the model?
Aim: Given a single 2D image of an object, the goal is to propose a model that can reconstruct the corresponding 3D shape.
How: This method leverages information from a collection of unpaired images from the same category, and builds an 'averaged' or 'canonical' representation of the shape of this category to make better 3D renderings.
Requirements: In this work, the authors propose a model that estimates a 3D shape of an object from three inputs:
- (i) A 2D image with a targeted object
- (ii) A mask of the object location in the image
- (iii) Camera viewpoint
Achievements:
- This work outperformed prior works in term of 3D reconstruction fidelity and generated 3D shapes with better global structure and finer instance-specific details.
- The method allows to learn 3D correspondence across instances of a category.
-
Neural Rendering: Previous works have synthesized images from real world observations using various AI and classical computer graphics methods. This paper utilizes previous work on Shape Representation Networks (SRN) as their AI-based renderer.
-
Implicit Representations: Previous approaches have relied on 3D or multiview supervision for model training. They generally used meshed-based representations. For example see: Learning Category-Specific Mesh Reconstruction from Image Collections
Most approaches don’t generalize to highly diverse topological variations in the same category and lack of fine instance details.
There are two steps for making inferences: (A) Deformation Fields + Point Features, and (B) Canonical Shape Reconstruction.
-
First, a pretrained ResNet autoencoder (AE) is used to condense the image information in a vector of features
$I$ (shown in purple). -
Second, coordinates of the object
$X_{object}=[x,y,z]$ (shown in red), are sampled in the 2.5D representation of the image, via SRN. -
Both representations are given to the DeformNet model
$g(I + X_{object})$ , to obtain a deformation field$(\Delta)$ , and then$\Delta + X_{object}$ generates a point in the canonical shape representation:$X_{canonical:3D}=[x',y',z']$ .
The Deformnet model also outputs: Image point features and RGB color estimations of the sampled point in the image.
Once we have deformed the 3D points in object space to the corresponding points in canonical space, our next task is to learn the 3D shape in form of SDF field.
The canonical shape generator takes the points from the canonical space and the point features, to find the signed distance value of the final object. They utilize marching cubes to generate a 3D mesh from the learned SDF field
In addition, at the time of training there is an additional step to be understood:
The differentiable renderer is used to generate 2D renderings of the learned 3D shape during training, which are then compared against input object’s GT 2D observations.
The training objective is defined as a loss for color, a loss for sdf representation and a regularization term:
Evaluation was performed on the following datasets: Shapenet, Pascal3D+, CUB-200-2011 and Pix3D chairs.
Baselines considered to compare the method were:
- SoftRas: rasterization-based differentiable mesh renderer.
- SDF-SRN: neural implicit modeling approach for single-view reconstruction.
- CMR: deformation driven mesh reconstruction approach.
TARS reconstructions are (1) much less noisier (2) respect the topology of the underlying shapes (3) better captures global shape and finer details (4) learns cross-instance structural correspondences.
The model can also extrapolate from training on a synthetic dataset to real world dataset images.
TARS matches the reconstructed shape fidelity of the SDF-SRN reconstructions, and also learns cross-instance structural correspondences for free. Thanks to the proposed higher-dimensional deformation field, they significantly perform better than a mesh based approach.
Their approach can learn to reconstruct 3D shapes, given a category collection of unpaired 2D images.
- They achieved high fidelity reconstruction of the global structure and fine topological details.
- The deformation field network in combination with point-based-features learns the category-specific structural relations (e.g., base of a sofa as equivalent to legs of a chair) without supervision.
Main shortcomings of the method: Needs the GT of the camera pose for training (difficult to obtain) and the mask of the object (easier).