OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion

Abstract

The creation of 3D assets with explicit, editable part structures is crucial for advancing interactive applications, yet most generative methods produce only monolithic shapes, limiting their utility. We introduce OmniPart, a novel framework for part-aware 3D object generation designed to achieve high semantic decoupling among components while maintaining robust structural cohesion. OmniPart uniquely decouples this complex task into two synergistic stages: (1) an autoregressive structure planning module generates a controllable, variable-length sequence of 3D part bounding boxes, critically guided by flexible 2D part masks that allow for intuitive control over part decomposition without requiring direct correspondences or semantic labels; and (2) a spatially-conditioned rectified flow model, efficiently adapted from a pre-trained holistic 3D generator, synthesizes all 3D parts simultaneously and consistently within the planned layout. Our approach supports user-defined part granularity, precise localization, and enables diverse downstream applications. Extensive experiments demonstrate that OmniPart achieves state-of-the-art performance, paving the way for more interpretable, editable, and versatile 3D content.

Interactive Bounding Box and Mesh

Bounding Box

Mesh

3D Gaussian Splatting

Combined

Exploded

Mask-Guided Multi-Granularity Generation

Bounding Box

Mesh

Method Overview

OmniPart generates part-aware, controllable, and high-quality 3D content through two key stages: part structure planning and structured part latent generation. Built upon TRELLIS, which provides a spatially structured sparse voxel latent space, OmniPart first predicts part-level bounding boxes via an auto-regressive planner. Then, part-specific latent codes are generated through fine-tuning of a large-scale shape model pretrained on overall objects.

Applications

OmniPart generates high-quality part-aware 3D content directly from a single input image and naturally supports a range of downstream applications, including Animation, Mask-Controlled Generation, Multi-Granularity Generation, Material Editing, and Geometry Processing.