Brief intro to recurrent neural networks

Note: This post is meant to be a brief and intuitive summary of recurrent neural networks. I have adapted this material from the Coursera deep learning course. The value I hope to add here is that I have attempted to summarize the information in a way that is easy (hopefully) to understand, and can be used as reference or refresher material for the future.

Part 1: Recurrent Neural Networks:

Recurrent neural networks are a class of neural networks where the nodes/neurons form a directed graph along a sequence. They are very effective at tasks such as Natural Language Processing because they have a "memory," in other words they can receive context from previous inputs. They can take in input one at a time, and can pass information from one node to the next via hidden activation layers. This information serves as the "memory" or "context" layer of the network, which can be used in conjunction as more new input is being processed.

RNN Cell:

A cell/node of an RNN consists of an input x<t> and activation layer/memory a<t-1> from the previous time step. These two are multiplied by their corresponding weight vectors and added together, then a tanh activation function is applied to get a<t>. a<t> is the activation layer that gets passed into the next node/neuron of the recurrent neural network.

Python/numpy:

 a_next = np.tanh(np.dot(Waa, a_prev) + np.dot(Wax, xt) + ba)   
 yt_pred = softmax(np.dot(Wya, a_next) + by)

Forward pass of a recurrent neural network:

Chaining multiple RNN cells in succession, and you have a recurrent neural network...

Python/numpy

    a = np.zeros((n_a, m, T_x))
    y_pred = np.zeros((n_y, m, T_x))
    
    # Initialize a_next (≈1 line)
    a_next = a0
    
    # loop over all time-steps
    for t in range(T_x):
        # Update next hidden state, compute the prediction, get the cache (≈1 line)
        a_next, yt_pred, cache = rnn_cell_forward(x[:,:,t], a_next, parameters)
        # Save the value of the new "next" hidden state in a (≈1 line)
        a[:,:,t] = a_next
        # Save the value of the prediction in y (≈1 line)
        y_pred[:,:,t] = yt_pred

Part 2: Long Short Term Memory Network (LSTM):

While recurrent neural networks have a concept of "memory" by passing the hidden output later a<t> to successive nodes, this memory is considered very "short term" because it gets updated with each successive cell.

The Long Short Term Memory (LSTM) network architecture attempts to solve this problem by introducing a long term memory vector c<t>. This long term memory can "remember" things from the far past.

LSTMs have many useful applications. For example, in natural language processing, an LSTM can remember that a noun is singular, but later update that noun to plural as the context changes.

Forget gate:

One component of an LSTM is the forget gate Rf. A forget gate consists numbers close to zero to one, which when multiplied by the long term memory from the previous cell c<t-1>, will determine which values will be "remembered" and which will be "forgotten" going into the next cell.

\Gamma_f^{\langle t \rangle} = \sigma(W_f[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_f)\tag{1}

As you can see from the above equation, the hidden state from the previous cell a<t-1> is concatenated with x<t> to form [a<t-1>,x<t>]. [a<t-1>,x<t>] is then multiplied by a weight vector wf and then a sigmoid activation function is applied to get Rf.

Update gate:

The update gate Ru is achieved through a similar process as the forget gate Rf. Ru, which like Rf, consists of numbers close to zero to one, will be multiplied element wise against c<t>~ to calculate c<t>. This will determine which values will be "updated."

\Gamma_u^{\langle t \rangle} = \sigma(W_u[a^{\langle t-1 \rangle}, x^{\{t\}}] + b_u)\tag{2}

Using the forget gate and output gate to update the cell:

To update the cell, we need to create a new vector of numbers c<t>~ to update the previous cell state. Recall that the "long term memory" from the previous cell is c<t-1>.

\Gamma_o^{\langle t \rangle}= \sigma(W_o[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_o)\tag{5}

a^{\langle t \rangle} = \Gamma_o^{\langle t \rangle}* \tanh(c^{\langle t \rangle})\tag{6}

c<t>~ represents the vectorized set of new information that will be "updated," and added to the previous cell state.

\tilde{c}^{\langle t \rangle} = \tanh(W_c[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_c)\tag{3}

Finally, the forget gate Rf is multiplied by the long term memory from the previous cell c<t-1>, and the update gate Ru is multiplied by c<t>~. The two are added together to get c<t>, which is

c^{\langle t \rangle} = \Gamma_f^{\langle t \rangle}* c^{\langle t-1 \rangle} + \Gamma_u^{\langle t \rangle} *\tilde{c}^{\langle t \rangle} \tag{4}

c<t> represents the long term memory after passing through the "forget" and "update" gate.

Output gate:

Finally, we compute the output gate Ro, which determines which output a<t> we will use.

\Gamma_o^{\langle t \rangle}= \sigma(W_o[a^{\langle t-1 \rangle}, x^{\langle t \rangle}] + b_o)\tag{5}

A tanh activation function is applied to c<t>. The result of that is then multiplied by the output gate Ro, to get the next hidden layer a<t>

a^{\langle t \rangle} = \Gamma_o^{\langle t \rangle}* \tanh(c^{\langle t \rangle})\tag{6}

Python/numpy:

ft = sigmoid(np.dot(Wf, concat) + bf)
it = sigmoid(np.dot(Wi, concat) + bi)
cct = np.tanh(np.dot(Wc, concat) + bc)
c_next = ft*c_prev + it*cct
ot = sigmoid(np.dot(Wo, concat) + bo)
a_next = ot * np.tanh(c_next)
yt_pred = softmax(np.dot(Wy, a_next) + by)

Forward Pass of an LSTM:

As with the vanilla RNN example, if you chain many LSTM cells together you have a full LSTM.

Python/numpy

# initialize "a", "c" and "y" with zeros (≈3 lines)
a = np.zeros((n_a, m, T_x))
c = a
y = np.zeros((n_y, m, T_x))

# Initialize a_next and c_next (≈2 lines)
a_next = a0
c_next = np.zeros(a_next.shape)

# loop over all time-steps
for t in range(T_x):
    # Update next hidden state, next memory state, compute the prediction, get the cache (≈1 line)
    a_next, c_next, yt, cache = lstm_cell_forward(x[:,:,t], a_next, c_next, parameters)
    # Save the value of the new "next" hidden state in a (≈1 line)
    a[:,:,t] = a_next
    # Save the value of the prediction in y (≈1 line)
    y[:,:,t] = yt
    # Save the value of the next cell state (≈1 line)
    c[:,:,t]  = c_next

Part 3: Real World Application - Generating Music Using LSTMs:

Now, lets move on to a real world application -- generating music using LSTMs.

We have an input X, of the shape (m, Tx, 78) where:
- m is the number of training examples
- Tx is the length of the music piece / sequence (i.e. 30)
- 78 is a one-hot-vector which represents a musical note

We have output Y which is essentially the same as x, but shifted one to the left along the direction of Tx. Thus, given a single musical note x, we want to predict the next note y, i.e. x<t> = y<t-1>.

Creating the model:

In the code below, we use Keras to generate an LSTM model object with the input/output dimensions specified above

def djmodel(Tx, n_a, n_values):
    """
    Implement the model
    
    Arguments:
    Tx -- length of the sequence in a corpus
    n_a -- the number of activations used in our model
    n_values -- number of unique values in the music data 
    
    Returns:
    model -- a keras model with the 
    """
    
    # Define the input of your model with a shape 
    X = Input(shape=(Tx, n_values))
    
    # Define s0, initial hidden state for the decoder LSTM
    a0 = Input(shape=(n_a,), name='a0')
    c0 = Input(shape=(n_a,), name='c0')
    a = a0
    c = c0
    
    ### START CODE HERE ### 
    # Step 1: Create empty list to append the outputs while you iterate (≈1 line)
    outputs = []
    
    # Step 2: Loop
    for t in range(Tx):
        
        # Step 2.A: select the "t"th time step vector from X. 
        x = Lambda(lambda x: X[:,t,:])(X)
        # Step 2.B: Use reshapor to reshape x to be (1, n_values) (≈1 line)
        x = reshapor(x)
        # Step 2.C: Perform one step of the LSTM_cell
        a, _, c = LSTM_cell(x, initial_state=[a, c])
        # Step 2.D: Apply densor to the hidden state output of LSTM_Cell
        out = densor(a)
        # Step 2.E: add the output to "outputs"
        outputs.append(out)
        
    # Step 3: Create model instance
    model = Model([X, a0, c0], outputs)

    return model

Training the model:

Now that we have our model creation function, we will use it to create the model and train/fit it to our training data:

# instantiate the model:
model = djmodel(Tx = 30 , n_a = 64, n_values = 78)

# instantiate the optimizer and compile the model
opt = Adam(lr=0.01, beta_1=0.9, beta_2=0.999, decay=0.01)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

# initialize input activation layer and long term memory/cell state layer 
a0 = np.zeros((m, n_a))
c0 = np.zeros((m, n_a))

# fit the model
model.fit([X, a0, c0], list(Y), epochs=100)

Generating novel music using the trained LSTM:

Now that we have a trained model, we can use it to generate novel jazz music.

Using our pretrained model, we will implement a music inference model which uses the output y from the previous cell step and feeds in as input x in the next cell step.

Our inference function will generate a probability distribution over all 78 possible notes. We will use np.argmax() over the 3rd axis (the note axis) to find the index with the highest probability, then convert that index to a one hot vector in order to determine the next note that will be fed back into the model.

def music_inference_model(LSTM_cell, densor, n_values = 78, n_a = 64, Ty = 100):
    """
    Uses the trained "LSTM_cell" and "densor" from model() to generate a sequence of values.
    
    Arguments:
    LSTM_cell -- the trained "LSTM_cell" from model(), Keras layer object
    densor -- the trained "densor" from model(), Keras layer object
    n_values -- integer, umber of unique values
    n_a -- number of units in the LSTM_cell
    Ty -- integer, number of time steps to generate
    
    Returns:
    inference_model -- Keras model instance
    """
    
    # Define the input of your model with a shape 
    x0 = Input(shape=(1, n_values))
    
    # Define s0, initial hidden state for the decoder LSTM
    a0 = Input(shape=(n_a,), name='a0')
    c0 = Input(shape=(n_a,), name='c0')
    a = a0
    c = c0
    x = x0

    ### START CODE HERE ###
    # Step 1: Create an empty list of "outputs" to later store your predicted values (≈1 line)
    outputs = []
    
    # Step 2: Loop over Ty and generate a value at every time step
    for t in range(Ty):
        
        # Step 2.A: Perform one step of LSTM_cell (≈1 line)
        a, _, c = LSTM_cell(x, initial_state=[a, c])
        
        # Step 2.B: Apply Dense layer to the hidden state output of the LSTM_cell (≈1 line)
        out = densor(a)

        # Step 2.C: Append the prediction "out" to "outputs". out.shape = (None, 78) (≈1 line)
        outputs.append(out)
        
        # Step 2.D: Select the next value according to "out", and set "x" to be the one-hot representation of the
        #           selected value, which will be passed as the input to LSTM_cell on the next step. We have provided 
        #           the line of code you need to do this. 
        x = Lambda(one_hot)(out)
        
    # Step 3: Create model instance with the correct "inputs" and "outputs" (≈1 line)
    inference_model = Model([x0, a0, c0], outputs)
    
    ### END CODE HERE ###
    
    return inference_model

Finally, we use this inference model to generate novel music...

to be continued...

luwei likes data science