Abstract
Text-to-Music diffusion models are increasingly used in real-world applications, yet deployment remains challenging: generations can collapse to limited patterns despite diverse initial noise and prompts, and inference-time diversity control often harms text alignment and fidelity by distorting key prompt cues established in early denoising.
To address this, We propose Padding-Annealed Diffusion Sampling, which perturbs only a padding-indexed subspace while keeping non-padding conditioning fixed, enabling controlled exploration with reduced semantic drift.
However, in a text-unaware VAE latent space, such exploration is less likely to stay within genre-faithful neighborhoods, limiting genre-consistent diversity. We therefore introduce Text-Aware Latent space that aligns local neighborhoods with text-implied genre structure, promoting genre-consistent diversity.
Together, the two techniques form a unified pipeline that, compared to prior full-conditioning perturbation, achieves a better text alignment--diversity trade-off: at comparable text alignment, it delivers 15.4\% higher diversity with a relatively small fidelity drop, and further improves within-genre diversity by 71.6\%.
Audio Examples
Baseline: Limited diversity despite high audio quality, with occasional genre mismatch.
CADS: Enhanced diversity but degraded audio quality, suffering from semantic drift.
PADS (Ours): Achieves rich diversity and high audio quality while maintaining strict genre consistency.
The audio samples above were produced by a generative model as demonstration outputs. Unauthorized reproduction or use is prohibited.
Task 1: [Samples with single Prompts & Fixed Initial Noise]
Notice: All samples are generated twice using the same initial seed; however, the initial seed differs across prompts.
| Prompt | Baseline | PADS (Ours) | CADS |
|---|---|---|---|
| tropical house, drums, 105 bpm, dance |
|
|
|
| timpani, soundtrack, 125 bpm, soothing, ambient |
|
|
|
| groove, 125 bpm, dance, upbeat, happy, musical instrument |
|
|
|
| piano, 100 bpm, double bass, light, calm |
|
|
|
Task 2: [Samples with single Prompts & Random Initial Noise]
Input Prompt 1: lofi, chill, vinyl noise, mellow, soft drums, warm chords, nostalgic
| Baseline | PADS-TAL (Ours) | CADS |
|---|---|---|
Input Prompt 2: tropical house, drums, 105 bpm, animal, dance, piano
| Baseline | PADS-TAL (Ours) | CADS |
|---|---|---|
Input Prompt 3: jazz, piano trio, swing, live recording feel, warm tone, improvisation
| Baseline | PADS-TAL (Ours) | CADS |
|---|---|---|
Input Prompt 4: hopeful, mellow, acoustic
| Baseline | PADS-TAL (Ours) | CADS |
|---|---|---|
Task 3: [Samples with various Prompts & Random Initial Noise]
Pop
| Prompt | Baseline | PADS-TAL (Ours) | CADS |
|---|---|---|---|
| 110 bpm, bass guitar, hip hop, pop, drums | |||
| pop, electric guitar, bass guitar, synthesizer, 110 bpm, funky, plucked string instrument | |||
| inspiring, pop, guitar, 110 bpm, musical instrument, acoustic guitar, dance |
Electronic
| Prompt | Baseline | PADS-TAL (Ours) | CADS |
|---|---|---|---|
| upbeat, drum machine, electric piano, 125 bpm, dubstep | |||
| upbeat, drum and bass, bass, musical instrument, aggressive, instrumental, sampler, electronic, cinematic, 125 bpm | |||
| synthesizers, inspiring, 125 bpm, summer, spray, waterfall, electronic |
Electronic Pop
| Prompt | Baseline | PADS-TAL (Ours) | CADS |
|---|---|---|---|
| 110 bpm, passionate, boing, emotional, soulful, electronic pop, love | |||
| saturday night, boing, electronic pop, double bass, 110 bpm, happiness, groovy, birthday, the synthesizer, piano, inside | |||
| funky, guitar, electronic pop, 110 bpm, positive, musical instrument, groovy, outside, boing, upbeat |
New Age
| Prompt | Baseline | PADS-TAL (Ours) | CADS |
|---|---|---|---|
| emotional, new age, piano | |||
| 115 bpm, tranquil, new-age music, moving, piano, strings, melancholic, lullaby, emotional | |||
| musical instrument, emotional, piano, moving, peaceful, reflective, relaxed, calm, sentimental, electric piano, new age |
The audio samples above were produced by a generative model as demonstration outputs. Unauthorized reproduction or use is prohibited.