4D generation has made remarkable progress in synthesizing dynamic 3D objects from input text, images, or videos. However, existing methods often represent motion as an implicit deformation field, which limits direct control and editability.
To address this, we propose SkeletonGaussian, a novel framework for generating editable, dynamic 3D Gaussians from monocular video input. Our approach introduces a hierarchical, articulated representation that decomposes motion into sparse, rigid motion explicitly driven by a skeleton and fine-grained, non-rigid motion. Concretely, we extract a robust skeleton and drive rigid motion via linear blend skinning, followed by a hexplane-based refinement for non-rigid deformations—enhancing interpretability and editability.
Experimental results show that SkeletonGaussian surpasses existing methods in generation quality while enabling intuitive motion editing, establishing a new paradigm for editable 4D generation.
Our method allows users to directly edit the skeleton of the generated 3D Gaussian, enabling real-time pose manipulation.
Our 4D generation pipeline consists of three stages: static 3D Gaussian generation, rigid motion modeling, and non-rigid motion refinement. Any existing image-to-3D method can be used for initialization. We extract a robust skeleton using UniRig and drive rigid motion via Linear Blend Skinning (LBS). Finally, fine-grained details are modeled using a HexPlane-based deformation field. This decomposition allows for explicit control over the object's pose.
We compare SkeletonGaussian with state-of-the-art 4D generation methods. Our method achieves superior visual quality and motion fidelity.
We discuss limitations such as incorrect skeleton extraction on complex topology or non-articulated objects.
@inproceedings{wu2026skeletongaussian,
title={SkeletonGaussian: Editable 4D Generation through Gaussian Skeletonization},
author={Wu, Lifan and Zhu, Ruijie and Ai, Yubo and Zhang, Tianzhu},
booktitle={AAAI},
year={2026}
}