Generating Audio Using Recurrent Neural Networks

a PhD Dissertation by Andrew Pfalz

This page is a brief overview of my dissertation work. I will explain:

Code and musical examples coming soon!



Input Audio x



Sampled Output ŷ




OVERVIEW

Long Short Term Memory networks (LSTMs) are a type of Recurrent Neural Network (RNN). They were designed for modeling sequence data. An LSTM computes a prediction and a new cell state at each timestep. The process can be broken down into four conceptual steps. Hover over the equations for more information.


ft = σ(Wf ·[ŷt-s, xt] + bf)
it = σ(Wi ·[ŷt-s, xt] + bi)
Ĉt = σ(WC ·[ŷt-s, xt] + bC)
Ct = ft*Ct-1, it * Ĉt
ot = σ(Wo ·[ŷt-s, xt] + bo)
ŷt = ot * tanh(Ct)


TRAINING

LSTMs learn by making predictions based on inputs. The figure at right shows the process of training a model. At each time step the model sees an array of inputs and predicts an output for each array element. The accuracy of the prediction is evaluated by comparing it to a label via the loss function. The error in the prediction, the loss, is passed to the optimizer. The optimizer updates the internal weights in the model via backpropagation. The result of this process is that the next time the model sees the same input, its prediction will be more accurate.

for each time step:
    input, label          = get_new_input_and_label()
    prediction, new_state = model.predict(input, prev_state)
    current_loss          = loss(label, prediction)
    optimzer.update_weights(current_loss)

    



SAMPLING

To generate an sampled output the model is shown an arbitrary seed. It is asked what it thinks comes next in the sequence after the seed. Then the model is asked what it thinks comes after its first prediction assuming it was correct. This process is repeated until the desired length is reached.

To keep the seed the same length at each iteration the final portion of the prediction is appended to the end of the seed and the first portion of the seed is dequeued.

seed   = get_new_seed()
output = []
for each sampling time step:
    prediction, new_state = model.predict(seed, prev_state)
    output.append(prediction[-1])
    seed.append(prediction[-1]
    seed = seed[1:]

        

SEQUENTIAL

for each epoch:
    for each timestep:
        input        = get_input()
        label        = get_label()

        y_hat, state = model.guess(input, state)
        loss         = loss(y_hat, label)
        model.update_weights(loss)
        if timestep % wait == 0:
            seed   = get_seed()
            output = []
            for each dream_timestep:
                pred = model.guess(seed)
                output.append(pred[-1])
                seed.append(pred[-1])
                seed = seed[1:]




        

When sampling from a model the simplest approach would be to periodically pause training to sample an output. This algorithm is shown in psuedocode to the left. In practice this method works some of the time but can get stuck predicting the same output over and over. The pseudocode at right shows a slighlty different algorithm that performs better in practice. This method has the model train and sample concurrently.

CONCURRENT

seed   = get_seed()
output = []
for each epoch:
    for each timestep:
        #train normally
        input= get_input()
        label = get_label()
        y_hat, state = model.guess(input, state)
        loss  = loss(y_hat, label)
        model.update_weights(loss)

        #make a prediction
        pred = model.guess(seed)
        output.append(pred[-1])
        seed.append(pred[-1])
        seed = seed[1:]
        if len(output) == output_length:
            save_output()
            output = []
            seed   = get_seed()
        



VECTOR APPROACH

The first approach to sampling from an LSTM has the model predict a vector of audio samples for each input vector. These vectors are the same size and can contain overlapping or non-overlapping samples. This approach is the fastest of the four presented here. The predicted vectors can sometimes have discontinuities at their extremities which are problematic.


VECTOR APPROACH AUDIO

Audio Example 1: Vector length 1024

Audio Example 2: Vector length 512

Audio Example 3: Vector length 256

Audio Example 4: Vector length 128








VECTOR APPROACH ANALYSIS

Changing the length of the predicted vectors drastically changes the quality of the sampled outputs. When the vectors are long, the model produces samples that evolve slowly. When the vectors are short, the model produces samples that change more rapidly but resemble in the input data less.




MAGNITUDE SPECTRUM APPROACH

This approach formats the data in the same manner as the vector approach except that each vector is magnitude spectrum window. The raw audio is broken into vectors as before, then each vector is transformed via the FFT. The phase information is omitted and the magnitudes are normalized to the range 0 to 1. The sampled outputs are resynthesized via a channel vocoder. It works by linearly interpolating between the transpose of the predicted windows. This produces a sampling rate envelope for each bin of the FFT. Each envelope is applied to an oscillator with its frequency set to the center frequency of the corresponding FFT bin.




VOCODER DESIGN


MAGNITUDE SPECTRUM APPROACH AUDIO

Audio Example 5: Hop Size 1024

Audio Example 6: Hop Size 512

Audio Example 7: Hop Size 256

Audio Example 8: Hop Size 128

Audio Example 9: Hop Size 56





MAGNITUDE SPECTRUM ANALYSIS

As with the vector approach vector length, changing the hop size of the fft used to produce the data drastically changes the quality of the sampled outputs. When the hop size is large the frequencies are overlapped and blurred to create a subtle evolving texture. As the hop size is decreased transitions between different pitches become increasingly clear. When the hop size is very small sequences of notes from the input data can be observered, however the model also tends to become somewhat unstable.


TRANSPOSE APPROACH

This approach features the most exotic handling of the model outputs. The output of the LSTM is the same shape as the input [number_of_unrollings, vector_length]. It is passed through a fully connected layer to change the shape to [number_of_unrollings, 1]. The transpose of this column vector is passed through another fully connected layer to change the shape to [1, 1].


TRANSPOSE APPROACH AUDIO

Audio Example 10: Transpose

TRANSPOSE ANALYSIS

This method hardly ever produces noisy outputs. This method is the most successful in that aspect. The outputs themselves tend to blend all the frequences in the seed. The biggest drawback of this approach is that it takes significantly more time to produce an output compared to the previous two approaches. This is due to the fact the the model is predicting one audio sample at a time as opposed to a vector of samples at a time.


COLUMN VECTOR APPROACH

This the most successful approach. The model predicts one sample for each input vector. This is accomplished by passing the output of the LSTM through a fully connected layer. The sampled outputs from this model tend to feature notes from the input data, but rarely in the same order. Often these notes are predicted in a different order than they appear in the input data. Occasionally the notes are surreally distorted. For instance piano notes are sustained or growing in volume rather than decaying after their initial attack. Because of the way the input data is offset from the label data only the final element in the predicted vector is not present elsewhere in the seed. This final audio sample is the only piece that is kept during sampling. As such this approach is very slow to produce outputs like the tranpose approach.


COLUMN VECTOR APPROACH AUDIO

Audio Example 11: Column Vector

COLUMN VECTOR ANALYSIS

The sampled outputs produced with this approach resemble most closely what a human might guess if presented with the same task as the LSTM. Notes from the input are repeated, but in a different order. Some minor changes might be made to the data, but generally the data remains recognizable. This is in stark contrast to the other methods that produce outputs where sometimes the data is only barely discernible.