DDSP Vocal Effects Supplement

Vocoding Approach

Below are a number of audio examples that demonstrate the vocoding approach, i.e. modifying the harmonic distribution generated from a timbre transfer decoder. Two models were used; on trained on brass instruments, and another on a harsh synthesizer. For every example, the interpolating parameter p is indicated.

All examples were generated from the same input:

female.wav

Brass

p = 0

p = 0.5

p = 0.7

p = 0.9

p = 1.0

Synth

p = 0

p = 0.5

p = 0.7

p = 0.9

p = 1.0

Latent Approach

The following audio examples demonstrate the latent approach. They were generated from the same input sample as the vocoding examples.

First, the output of a model trained on a dataset consisting of a single person singing children’s songs:

csd.wav

Next, the output of a model trained on a combined dataset of the children’s songs recordings and brass instruments. Early on in the training, the model still struggles to reproduce intelligible lyrics:

csd-brass-4000.wav

After 4000 steps of training

As the training progresses, the reproduced lyrics become clearer, but the timbre still seems to be affected by the amended dataset:

csd-brass-10000.wav

After 10000 steps

When the training continues beyond this point, the model learns to reconstruct the singing voice portion of the training dataset more and more accurately. Eventually, the effect of the additional instruments fades and the output becomes quite similar to that of the model only trained on the singing voice:

csd-brass-20000.wav

After 20000 steps