DDSP Vocal Effects Supplement

Vocoding Approach

Below are a number of audio examples that demonstrate the vocoding approach, i.e. modifying the harmonic distribution generated from a timbre transfer decoder. Two models were used; on trained on brass instruments, and another on a harsh synthesizer. For every example, the interpolating parameter p is indicated.

All examples were generated from the same input:

Latent Approach

The following audio examples demonstrate the latent approach. They were generated from the same input sample as the vocoding examples.

First, the output of a model trained on a dataset consisting of a single person singing children’s songs:

Next, the output of a model trained on a combined dataset of the children’s songs recordings and brass instruments. Early on in the training, the model still struggles to reproduce intelligible lyrics:

After 4000 steps of training

As the training progresses, the reproduced lyrics become clearer, but the timbre still seems to be affected by the amended dataset:

After 10000 steps

When the training continues beyond this point, the model learns to reconstruct the singing voice portion of the training dataset more and more accurately. Eventually, the effect of the additional instruments fades and the output becomes quite similar to that of the model only trained on the singing voice:

After 20000 steps