PADS-TAL: Padding-Annealed Diffusion Sampling in Text-Aware Latent Space for Robust and Diverse Text-to-Music Generation

Abstract

Text-to-Music diffusion models are increasingly used in real-world applications, yet deployment remains challenging: generations can collapse to limited patterns despite diverse initial noise and prompts, and inference-time diversity control often harms text alignment and fidelity by distorting key prompt cues established in early denoising. To address this, We propose Padding-Annealed Diffusion Sampling, which perturbs only a padding-indexed subspace while keeping non-padding conditioning fixed, enabling controlled exploration with reduced semantic drift. However, in a text-unaware VAE latent space, such exploration is less likely to stay within genre-faithful neighborhoods, limiting genre-consistent diversity. We therefore introduce Text-Aware Latent space that aligns local neighborhoods with text-implied genre structure, promoting genre-consistent diversity. Together, the two techniques form a unified pipeline that, compared to prior full-conditioning perturbation, achieves a better text alignment--diversity trade-off: at comparable text alignment, it delivers 15.4\% higher diversity with a relatively small fidelity drop, and further improves within-genre diversity by 71.6\%.

Audio Examples

Baseline: Limited diversity despite high audio quality, with occasional genre mismatch.
CADS: Enhanced diversity but degraded audio quality, suffering from semantic drift.
PADS (Ours): Achieves rich diversity and high audio quality while maintaining strict genre consistency.

The audio samples above were produced by a generative model as demonstration outputs. Unauthorized reproduction or use is prohibited.

Task 1: [Samples with single Prompts & Fixed Initial Noise]

Notice: All samples are generated twice using the same initial seed; however, the initial seed differs across prompts.

Prompt	Baseline	PADS (Ours)	CADS
tropical house, drums, 105 bpm, dance
timpani, soundtrack, 125 bpm, soothing, ambient
groove, 125 bpm, dance, upbeat, happy, musical instrument
piano, 100 bpm, double bass, light, calm

Task 2: [Samples with single Prompts & Random Initial Noise]

Input Prompt 1: lofi, chill, vinyl noise, mellow, soft drums, warm chords, nostalgic

Baseline	PADS-TAL (Ours)	CADS

Input Prompt 2: tropical house, drums, 105 bpm, animal, dance, piano

Baseline	PADS-TAL (Ours)	CADS

Input Prompt 3: jazz, piano trio, swing, live recording feel, warm tone, improvisation

Baseline	PADS-TAL (Ours)	CADS

Input Prompt 4: hopeful, mellow, acoustic

Baseline	PADS-TAL (Ours)	CADS

Task 3: [Samples with various Prompts & Random Initial Noise]

Pop

Prompt	Baseline	PADS-TAL (Ours)	CADS
110 bpm, bass guitar, hip hop, pop, drums
pop, electric guitar, bass guitar, synthesizer, 110 bpm, funky, plucked string instrument
inspiring, pop, guitar, 110 bpm, musical instrument, acoustic guitar, dance

Electronic

Prompt	Baseline	PADS-TAL (Ours)	CADS
upbeat, drum machine, electric piano, 125 bpm, dubstep
upbeat, drum and bass, bass, musical instrument, aggressive, instrumental, sampler, electronic, cinematic, 125 bpm
synthesizers, inspiring, 125 bpm, summer, spray, waterfall, electronic

Electronic Pop

Prompt	Baseline	PADS-TAL (Ours)	CADS
110 bpm, passionate, boing, emotional, soulful, electronic pop, love
saturday night, boing, electronic pop, double bass, 110 bpm, happiness, groovy, birthday, the synthesizer, piano, inside
funky, guitar, electronic pop, 110 bpm, positive, musical instrument, groovy, outside, boing, upbeat

New Age

Prompt	Baseline	PADS-TAL (Ours)	CADS
emotional, new age, piano
115 bpm, tranquil, new-age music, moving, piano, strings, melancholic, lullaby, emotional
musical instrument, emotional, piano, moving, peaceful, reflective, relaxed, calm, sentimental, electric piano, new age

The audio samples above were produced by a generative model as demonstration outputs. Unauthorized reproduction or use is prohibited.