x

Segment Any 3D-Part from a Sentence

Hongyu Wu1,†, Pengwan Yang1,†, Yuki M. Asano1, 2,‡, Cees G. M. Snoek1,‡
1University of Amsterdam     2Technical University of Nuremberg
empty

We propose the first large-scale 3D dataset with dense part annotations, based on an innovative approach for constructing 3D scene data with detailed part annotations (left). Using this new dataset, we introduce the 3D part understanding task and method that enables flexible part segmentation and identification based on any sentence query (right)

Abstract

This paper aims to achieve the segmentation of any 3Dpart based on natural language descriptions, extending be-yond traditional object-level 3D scene understanding andaddressing both data and methodological challenges. Existing datasets and methods are predominantly limited toobject-level comprehension. To overcome the limitationsof data availability, we introduce the first large-scale 3Ddataset with dense part annotations, created through aninnovative and cost-effective method for constructing synthetic 3D scenes with fine-grained part-level annotations,paving the way for advanced 3D part understanding. On themethodological side, we propose OpenPart3D, a 3D-inputonly framework to effectively tackle the challenges of partlevel segmentation. Extensive experiments demonstrate thesuperiority ofour approach in open-vocabulary 3D understanding tasks at both the part and object levels, with stronggeneralization capabilities across various 3D datasets.

3D-PU Data

empty

3D-PU dataset scene examples. Each 3D scene includes a point cloud, mesh, textured mesh, and detailed part annotations for all objects

Method: OpenPart3D

empty

First, the Room-Tour Snap Module captures multiple view images of the 3D scene by strategically positioning cameras at optimized locations and orientations. These images are then processed by a 2D open-vocabulary model to generate 2D part masks corresponding to the given text query. Subsequently, the View-Weighted 3D-Part Grouping Module integrates these 2D part masks across multiple views, assigning adaptive weights to each view, to extract geometrically consistent regions from the scene’s point cloud and aggregate them into precise 3D parts

Visualization(3D-PU)

empty

Visualization(Other datasets)

empty

BibTeX