💃💃InfiniteDance: Scalable 3D Dance Generation Towards in-the-wild Generalization💃💃

Abstract

Although existing 3D dance generation methods perform well in controlled scenarios, they often struggle to generalize in the wild. When conditioned on unseen music, existing methods often produce unstructured or physically implausible dance, largely due to limited music-to-dance data and restricted model capacity. This work aims to push the frontier of generalizable 3D dance generation by scaling up both data and model design. 1) On the data side, we develop a fully automated pipeline that reconstructs high-fidelity 3D dance motions from monocular videos. To eliminate the physical artifacts prevalent in existing reconstruction methods, we introduce a Foot Restoration Diffusion Model (FRDM) guided by foot-contact and geometric constraints that enforce physical plausibility while preserving kinematic smoothness and expressiveness, resulting in a diverse, high-quality multimodal 3D dance dataset totaling [OurHours] hours. 2) On model design, we propose Choreographic LLaMA (ChoreoLLaMA), a scalable LLaMA-based architecture. To enhance robustness under unfamiliar music conditions, we integrate a retrieval-augmented generation (RAG) module that injects reference dance as a prompt. Additionally, we design a slow/fast-cadence Mixture-of-Experts (MoE) module that enables ChoreoLLaMA to smoothly adapt motion rhythms across varying music tempos. Extensive experiments across diverse dance genres show that our approach surpasses existing methods in both qualitative and quantitative evaluations, marking a step toward scalable, real-world 3D dance generation.

Teaser

Teaser

Method

ChoreoLLaMA Overview
Fig. 1 The overview of ChoreoLLaMA. Our approach consists of two main components: (1) a fully automated pipeline for high-fidelity 3D dance reconstruction with Foot Restoration Diffusion Model (FRDM), and (2) Choreographic LLaMA (ChoreoLLaMA) architecture with Retrieval-Augmented Generation (RAG) and slow/fast-cadence Mixture-of-Experts (MoE) modules for robust dance generation.

In order to achieve scalable and generalizable 3D dance generation, we design a two-pronged approach that addresses both data quality and model architecture.

Data Pipeline with Foot Restoration

We develop a fully automated pipeline that reconstructs high-fidelity 3D dance motions from monocular videos. To eliminate physical artifacts prevalent in existing reconstruction methods, we introduce a Foot Restoration Diffusion Model (FRDM) guided by foot-contact and geometric constraints. This ensures physical plausibility while preserving kinematic smoothness and expressiveness, resulting in a diverse, high-quality multimodal 3D dance dataset.

Choreographic LLaMA Architecture

We propose Choreographic LLaMA (ChoreoLLaMA), a scalable LLaMA-based architecture for 3D dance generation. To enhance robustness under unfamiliar music conditions, we integrate a retrieval-augmented generation (RAG) module that injects reference dance as a prompt. Additionally, we design a slow/fast-cadence Mixture-of-Experts (MoE) module that enables ChoreoLLaMA to smoothly adapt motion rhythms across varying music tempos.

Results

As shown in the following videos (Please unmute for music), ChoreoLLaMA from InfiniteDance can generate high-quality 3D dances from given music with improved generalization capabilities.