This page is a brief overview of my dissertation work. I will explain:
Code and musical examples coming soon!
Input Audio x
Sampled Output ŷ
OVERVIEW
Long Short Term Memory networks (LSTMs) are a type of Recurrent Neural Network (RNN). They were designed for modeling sequence data. An LSTM computes a prediction and a new cell state at each timestep. The process can be broken down into four conceptual steps. Hover over the equations for more information.
TRAINING
LSTMs learn by making predictions based on inputs. The figure at right shows the process of training a model. At each time step the model sees an array of inputs and predicts an output for each array element. The accuracy of the prediction is evaluated by comparing it to a label via the loss function. The error in the prediction, the loss, is passed to the optimizer. The optimizer updates the internal weights in the model via backpropagation. The result of this process is that the next time the model sees the same input, its prediction will be more accurate.
for each time step: input, label = get_new_input_and_label() prediction, new_state = model.predict(input, prev_state) current_loss = loss(label, prediction) optimzer.update_weights(current_loss)
SAMPLING
To generate an sampled output the model is shown an arbitrary seed. It is asked what it thinks comes next in the sequence after the seed. Then the model is asked what it thinks comes after its first prediction assuming it was correct. This process is repeated until the desired length is reached.
To keep the seed the same length at each iteration the final portion of the prediction is appended to the end of the seed and the first portion of the seed is dequeued.
seed = get_new_seed() output = [] for each sampling time step: prediction, new_state = model.predict(seed, prev_state) output.append(prediction[-1]) seed.append(prediction[-1] seed = seed[1:]
SEQUENTIAL
for each epoch:
for each timestep:
input = get_input()
label = get_label()
y_hat, state = model.guess(input, state)
loss = loss(y_hat, label)
model.update_weights(loss)
if timestep % wait == 0:
seed = get_seed()
output = []
for each dream_timestep:
pred = model.guess(seed)
output.append(pred[-1])
seed.append(pred[-1])
seed = seed[1:]
When sampling from a model the simplest approach would be to periodically pause training to sample an output. This algorithm is shown in psuedocode to the left. In practice this method works some of the time but can get stuck predicting the same output over and over. The pseudocode at right shows a slighlty different algorithm that performs better in practice. This method has the model train and sample concurrently.
CONCURRENT
seed = get_seed() output = [] for each epoch: for each timestep: #train normally input= get_input() label = get_label() y_hat, state = model.guess(input, state) loss = loss(y_hat, label) model.update_weights(loss) #make a prediction pred = model.guess(seed) output.append(pred[-1]) seed.append(pred[-1]) seed = seed[1:] if len(output) == output_length: save_output() output = [] seed = get_seed()
VECTOR APPROACH
The first approach to sampling from an LSTM has the model predict a vector of audio samples for each input vector. These vectors are the same size and can contain overlapping or non-overlapping samples. This approach is the fastest of the four presented here. The predicted vectors can sometimes have discontinuities at their extremities which are problematic.
VECTOR APPROACH AUDIO
Audio Example 1: Vector length 1024
Audio Example 2: Vector length 512
Audio Example 3: Vector length 256
Audio Example 4: Vector length 128
VECTOR APPROACH ANALYSIS
Changing the length of the predicted vectors drastically changes the quality of the sampled outputs. When the vectors are long, the model produces samples that evolve slowly. When the vectors are short, the model produces samples that change more rapidly but resemble in the input data less.
MAGNITUDE SPECTRUM APPROACH
This approach formats the data in the same manner as the vector approach except that each vector is magnitude spectrum window. The raw audio is broken into vectors as before, then each vector is transformed via the FFT. The phase information is omitted and the magnitudes are normalized to the range 0 to 1. The sampled outputs are resynthesized via a channel vocoder. It works by linearly interpolating between the transpose of the predicted windows. This produces a sampling rate envelope for each bin of the FFT. Each envelope is applied to an oscillator with its frequency set to the center frequency of the corresponding FFT bin.
VOCODER DESIGN
MAGNITUDE SPECTRUM APPROACH AUDIO
Audio Example 5: Hop Size 1024
Audio Example 6: Hop Size 512
Audio Example 7: Hop Size 256
Audio Example 8: Hop Size 128
Audio Example 9: Hop Size 56
MAGNITUDE SPECTRUM ANALYSIS
As with the vector approach vector length, changing the hop size of the fft used to produce the data drastically changes the quality of the sampled outputs. When the hop size is large the frequencies are overlapped and blurred to create a subtle evolving texture. As the hop size is decreased transitions between different pitches become increasingly clear. When the hop size is very small sequences of notes from the input data can be observered, however the model also tends to become somewhat unstable.
TRANSPOSE APPROACH
This approach features the most exotic handling of the model outputs.
The output of the LSTM is the same shape as the input [number_of_unrollings, vector_length]
.
It is passed through a fully connected layer to change the shape to [number_of_unrollings, 1]
.
The transpose of this column vector is passed through another fully connected layer to change the shape to [1, 1]
.
TRANSPOSE APPROACH AUDIO
Audio Example 10: Transpose
TRANSPOSE ANALYSIS
This method hardly ever produces noisy outputs. This method is the most successful in that aspect. The outputs themselves tend to blend all the frequences in the seed. The biggest drawback of this approach is that it takes significantly more time to produce an output compared to the previous two approaches. This is due to the fact the the model is predicting one audio sample at a time as opposed to a vector of samples at a time.
COLUMN VECTOR APPROACH
This the most successful approach. The model predicts one sample for each input vector. This is accomplished by passing the output of the LSTM through a fully connected layer. The sampled outputs from this model tend to feature notes from the input data, but rarely in the same order. Often these notes are predicted in a different order than they appear in the input data. Occasionally the notes are surreally distorted. For instance piano notes are sustained or growing in volume rather than decaying after their initial attack. Because of the way the input data is offset from the label data only the final element in the predicted vector is not present elsewhere in the seed. This final audio sample is the only piece that is kept during sampling. As such this approach is very slow to produce outputs like the tranpose approach.
COLUMN VECTOR APPROACH AUDIO
Audio Example 11: Column Vector
COLUMN VECTOR ANALYSIS
The sampled outputs produced with this approach resemble most closely what a human might guess if presented with the same task as the LSTM. Notes from the input are repeated, but in a different order. Some minor changes might be made to the data, but generally the data remains recognizable. This is in stark contrast to the other methods that produce outputs where sometimes the data is only barely discernible.