GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence
ECCV - 2024

Abstract

overview

Category-level pose estimation is a challenging task with many potential applications in computer vision and robotics. Recently, deep-learning-based approaches have made great progress, but are typically hindered by the need for large datasets of either pose-labelled real images or carefully tuned photorealistic simulators. This can be avoided by using only geometry inputs such as depth images to reduce the domain-gap but these approaches suffer from a lack of semantic information, which can be vital in the pose estimation problem. To resolve this conflict, we propose to utilize both geometric and semantic features obtained from a pre-trained foundation model. Our approach projects 2D semantic features into object models as 3D semantic point clouds. Based on the novel 3D representation, we further propose a self-supervision pipeline, and match the fused semantic point clouds against their synthetic rendered partial observations from synthetic object models. The learned knowledge from synthetic data generalizes to observations of unseen objects in the real scenes, without any fine-tuning. We demonstrate this with a rich evaluation on the NOCS, Wild6D and SUN RGB-D benchmarks, showing superior performance over geometric-only and semantic-only baselines with significantly fewer training objects.

Method

overview

Overview of semantic and geometric feature embedding. Different from other synthetic-only pose estimation pipelines, our method incorporates both geometric and semantic features to improve performance. (1) Firstly, we sample camera poses around the synthetic object CAD model with 2D RGB-D image renderings. (2) Afterwards, we fuse 2D semantic features from rendered RGB image to 3D point clouds as 3D semantic features. Specially, we project each point to the visible 2D observations and extract the 2D semantic feature on the projected image location. As an object point can be observed from multiple views, we calculate the average over the observed features and get a smooth representation. We directly use the 3D object point coordinates as geometric features and combine them with fused semantic features as the matching network inputs.


overview

Overview of our matching network. Left: For matching between a semantic point cloud and the RGB-D input, we firstly extract 2D semantic features from the RGB image and back-project the semantic features with the depth image as a partial input point cloud. We then uniformly sample 3000 points from the semantic point cloud and 1000 points from partial input point cloud for the matching. The normalized point coordinates are embedded as geometric features with positional encoding and added with semantic features. The embedded features are fused with self- and cross-attention layers for multiple iterations in a transformer network for global perceptions. The assignment matrix is calculated based on the cosine similarity of the fused point features. Right: To disambiguate the symmetrical poses, (1) Since multiple ground truth poses can exist for axis-symmetry objects, (2) the Ground Truth (GT) pose is constrained to intersect the object xz-plane with the camera origin coordinate system.

Citation

@inproceedings{gspose, title={GS-Pose: Category-Level Object Pose Estimation via Geometric and Semantic Correspondence}, author={Pengyuan Wang, Takuya Ikeda, Robert Lee, Koichi Nishiwaki}, year={2024}

Acknowledgements

The website template was borrowed from Michaƫl Gharbi.