10 min read

Understanding Recurrent Neural Networks

Table of Contents

Imagine you are watching a video: a stream of images that paint a moving story. Your eyes and brain are able to stitch those snapshots into smooth motion and pick up who’s moving, what happened in the shot and what might come next in the shot. Now imagine feeding those frames in a vanilla neural network. Each image is treated in isolation, the model has no idea what happened before and what will happen after. All the model sees is a series of photos, it doesn’t understand how they are related and hence cannot understand them.

Recurrent Neural Networks are able to alleviate exactly this problem, they are able to process sequences in a much smarter manner than Vanilla Neural Networks. They have an inbuilt memory of the information they have processed so far which allows them to excel in a specific task of sequence modelling.

In this blog I will show how RNN’s work then build one up using PyTorch and train it so that it is able to generate sentences.

Brief Overview of an RNN

Recurrent Cell

At each time step t, the RNN cell takes: A new input vector x_tx\_{t} The previous hidden state h_t−1h\_{t-1}

And it produces an updated hidden state using a special Weight matrix for hidden weights

Ht=tanh(Wx∗xt+Wh∗ht−1+bh)H_t = tanh(W_x * x_t + W_h*h_{t-1} + b_h)

Output Calculation

The idea is not that we can map our hidden state to an output, this depends on the task whether it is (next-character prediction, sentiment analysis etc). In our scenario, we will connect our RNN layer to a Fully connected layer to map it to our vocabulary.

What makes it different

Using the hidden state, the RNN is able to create a representation of the sequence so far, this allows it to take into account the previous sequence and generate the current output. Because of this we are able to handle sequences of any length. So on the other hand we have Vanilla networks that have a predefined API, meaning that they have a fixed input size and a fixed output size. Meanwhile, the RNN does have a fixed input size but its output size can change depending on the length of our sequence and the task at hand.

A good mental model is to see each item of our sequence is a single timestep, because RNN processes each item in a sequential patter while also building up a context representation of what has come before in the sequence. This pattern allows the RNN to also take account of the previous knowledge and connect the items in the sequence together.

Now let’s dive into the python implementation of RNNs. Note: The code can be found here on my github.

Implementation

Data preparation

I will not include the code for this because it is a lot and not exactly relevant to the topic of this blog but here is a gist of what I do. I am getting data from the gutenberg files for some of the famous texts by Fyodor Dostoevsky. I clean that data by lowercasing all the letters, removing new lines and double spaces. Next I am creating dictionaries to convert the characters to ids and vice-versa. This is because I need to map each id to a specific one-hot encoded vector. Then I define a get_sequences function to get batches of the text and then define a train and test dataset.

Writing the RNN Layer

The RNN Layer is our tiny heart for our RNN. It implements the following equation

Ht=tanh(Wx∗xt+Wh∗ht−1+bh)H_t = tanh(W_x * x_t + W_h*h_{t-1} + b_h)
class RNNLayer(nn.Module):
  def __init__(self, input_size, hidden_size):
    super().__init__()
    self.W = nn.Parameter(torch.rand(input_size, hidden_size) * 0.01)
    self.H = nn.Parameter(torch.rand(hidden_size, hidden_size) * 0.01)
    self.b = nn.Parameter(torch.zeros(hidden_size))

  def forward(self, x, hidden):
    h_next = torch.tanh(x @ self.W + hidden @ self.H + self.b)
    return h_next

We have 2 matrices of weights

  • W handles how the new input influences the state
  • H handles how the previous hidden state influences the next state.

We use tanh, which squashes each dimension of ht​ into [−1,1]. This keeps gradients from blowing up too large (though it can still vanish, which we’ll discuss later).

Writing a simple RNN

Now that we have created a RNNLayer we can use it to create a simple model that makes use of it to generate text! We call our class SimpleRNN

Initialization

  def __init__(self, vocab_size, hidden_size):
    super().__init__()
    self.vocab_size = vocab_size
    self.hidden_size = hidden_size

    self.rnn_layer = RNNLayer(vocab_size, hidden_size)
    self.fc = nn.Linear(hidden_size, vocab_size)

Notice we have a fully connected layer along with our RNN layer, this is because the RNN layer just outputs numbers, we need a way to map the output of RNN to the ids that we initially assigned to each character in our vocabulary.

Forward

  def forward(self, x, hidden = None):
    batch_size, seq_len = x.size()

    if hidden is None:
      hidden = torch.zeros(batch_size, self.hidden_size, device = x.device)

    outputs = []

    for t in range(seq_len):
      x_t_indices = x[:, t]

      x_t_one_hot = torch.zeros(batch_size, self.vocab_size, device=x.device)

      batch_idxs = torch.arange(batch_size, device=x.device)
      x_t_one_hot[batch_idxs, x_t_indices] = 1.0

      hidden = self.rnn_layer(x_t_one_hot, hidden)
      outputs.append(hidden.unsqueeze(1))

    outputs = torch.cat(outputs, dim = 1)
    logits = self.fc(outputs)

    return logits, hidden

In our forward function, we will firstly each input token index into a one hot vector because our RNNLayer expects a vector of size vocab_size which is 78.

The hidden state is initialized as a vector of zeros. At each time step we take the previous hidden layer and the current one-hot input then use that to generate the new hidden state.

We run a loop over t from 0...seq_len - 1 and keep on accumulating each hidden state. After collecting all the hidden states we stack them together and feed through a fully connected layer to get raw score over each position. A good way to see the output of the fully connected layer is to see it as a probaility distribution that tells which character is likely to be the next in our sequence.

Finally, we return the logits and also the hidden state of the final computation.

Here is the code for the full model

class SimpleRNN(nn.Module):
  def __init__(self, vocab_size, hidden_size):
    super().__init__()
    self.vocab_size = vocab_size
    self.hidden_size = hidden_size

    self.rnn_layer = RNNLayer(vocab_size, hidden_size)
    self.fc = nn.Linear(hidden_size, vocab_size)

  def forward(self, x, hidden = None):
    batch_size, seq_len = x.size()

    if hidden is None:
      hidden = torch.zeros(batch_size, self.hidden_size, device = x.device)

    outputs = []

    for t in range(seq_len):
      x_t_indices = x[:, t]

      x_t_one_hot = torch.zeros(batch_size, self.vocab_size, device=x.device)

      batch_idxs = torch.arange(batch_size, device=x.device)
      x_t_one_hot[batch_idxs, x_t_indices] = 1.0

      hidden = self.rnn_layer(x_t_one_hot, hidden)
      outputs.append(hidden.unsqueeze(1))

    outputs = torch.cat(outputs, dim = 1)
    logits = self.fc(outputs)

    return logits, hidden

Training


model = SimpleRNN(vocab_size, 256)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

num_epochs = 5

for epoch in range(num_epochs):
    train_loss = train_epoch(model, train_sequences, optimizer, criterion, device)
    print(f"Epoch: {epoch} ---- Loss: {train_loss}")

Now we will initialize our model, set the loss as CrossEntropyLoss because it combines LogSoftmax and Negative-Log-Likelyhood which makes it ideal for sequence modelling. We will run our training for 5 full passes over the entire training dataset.

def train_epoch(model, train_sequences, optimizer, criterion, device):
    model.train()
    total_loss = 0.0
    for x_seq, y_seq in train_sequences:
        x = torch.tensor(x_seq, dtype=torch.long).to(device)
        y = torch.tensor(y_seq, dtype=torch.long).to(device)

        optimizer.zero_grad()
        logits, _ = model(x)  # logits: (1, seq_len, vocab_size)

        loss = criterion(logits.view(-1, logits.size(-1)), y.view(-1))
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    avg_loss = total_loss / len(train_sequences)
    return avg_loss

Here is a very simple PyTorch training loop, all we are doing is first taking an input sequence x and then feeding it to the model. Then we compare it with the ideal output sequence y by feeding it to the loss function. The backward function then calculates how much it needs to the change weights of the model in order to lower the loss, then we update our parameters accordingly. To know more about the training process feel free to read my blog on neural networks.

Inference

Now we know our model takes numbers as inputs and returns numbers as outputs. So we must write an additional function to interpret the output and translate them into words that we understand.

def generate_text(model, start_sequence, gen_len=100, device='cuda'):
    model.eval()
    hidden = None

    input_ids = [stoi[c] for c in start_sequence]
    x = torch.as_tensor([input_ids], dtype=torch.long, device=device)
    with torch.no_grad():
        _, hidden = model(x)

    last_idx = input_ids[-1]
    generated = list(start_sequence)

    for _ in range(gen_len):
        x = torch.as_tensor([[last_idx]], dtype=torch.long, device=device)
        logits, hidden = model(x, hidden)
        logits = logits[:, -1, :]
        probs  = F.softmax(logits, dim=-1)

        next_idx = torch.multinomial(probs, num_samples=1).item()

        generated.append(itos[next_idx])
        last_idx = next_idx

    return ''.join(generated)

We convert the sequence into their respective token ids. Then for a designated length we run for loop, feeding in our hidden state and the current index to the model. We take the probabilities and sample them using torch.multinomial which gives us an id for the most likely character. Recall we created itos which maps the ids to their respective characters. Then using that we get the character and add it to our output sequence. When the for loop ends we will have our entire sequence!

I range generate_text(model, "i came home") and I got the following output: i came homeritents: he was wawhing ilussine keepiinately passes i'tess he was two very from dishote invonlember

Yeah, I know it makes no sense. However, given more epochs and training data it will definately generate a meaningful sentence. I wrote another RNN where I am using PyTorch’s inbuilt RNN layers and it was able to generate a much better outputs. Feel free to try it out here

Observations

While training the RNN, I realized that the RNN actually trained slower when compared to if I used PyTorch’s inbuild layers. That is probably because PyTorch uses clever optimizations and parallelization behind the scenes to speed up the process.

Drawbacks of RNNs

Although RNNs were a great step in the history of Language Modelling and Sequence Modelling they are not ideal. This is because they often suffer from an issue called Vanishing Gradients This is because during backpropagation we are often multiplying small numbers together, this causes the weight changes that we do to be so small that they don’t make a significant difference to the perfomance of our network. To alleviate this issue, 2 other models were introduced (LSTMs and GRU). Which I will cover in future blogs!