Home Blog AI Writing Poems: Building an LSTM Model Using PyTorch

AI Writing Poems: Building an LSTM Model Using PyTorch

Abhay Dave

13 March, 2020

Hello everyone !! In this article, we will build a model to predict the next word in a poem writing using PyTorch. First, we will learn about RNN and LSTM and how they work. Then we will create our model. In the first step, we load our data and pre-process it. Then we will use PyTorch to train the model and save it. After that, we will make predictions from that model by giving it a small initial text and by using that it will generate the complete paragraph or a stanza of a poem.

What is RNN?

In machine learning, simple problems like classifying an image of a dog or cat can be solved by simply training it on the set of data and by using a classifier. But what if our problem is more complex for example, we have to predict the next word in the paragraph. So if we examine this problem closely we will find that we humans don’t solve this problem by only using our existing knowledge of language and grammar. In this type of problem, we use the previous word of the paragraph and context of the paragraph or poem to predict the next word.

Traditional neural networks can’t do this because they are trained on a fixed set of data and then use it to make predictions. RNN is expert in solving this type of problem. RNN stands for the Recurrent Neural Network. We can think of RNN as a neural network with a loop in it. It passes the information of one state to the next state. So information persists in the process and we can use that to understand the previous context and make accurate predictions.

What is LSTM?

I know you must be thinking if RNN can solve sequences of data problems then why do we need LSTM? To answer this question let’s have a look at the below-given examples.

Example 1:

“Birds live in the nest.” Here it is easy to predict the word “nest” because we have the previous context of birds and RNN will work fine in this case.

Example 2:

“I grew up in India (……blah blah blah…..) so I can speak Hindi” so the task of predicting the word “Hindi” is difficult for RNN because here the gap between context is large. By looking at the line “I can speak …” we can’t predict the language we will need an extra context of India. So here we will need some long term dependency on our paragraph so we can understand the context.

For this purpose, we use LSTM(Long Short Term Memory). As the name suggests they have long term and short term memory(gate) and both are used in the conjunction to make predictions. If we talk about the architecture of the LSTM they contain 4 gates namely learn gate, forget gate, remember gate, use a gate. To keep this article simple and hands-on I am not going into theory of architecture of LSTM. But maybe we will talk about it in upcoming articles. (maybe in the next one ?).

Let’s Build Our Model

So now we are done with the theory let’s start the interesting part – Building our model.

Loading and Preprocessing the Data

I will use a poetry dataset from Kaggle. It has a total of 15,000 poetry so it will be enough for our model to learn and create patterns. Now let’s start loading it in our notebook.

1. First import the libraries.

import numpy as npimport torchfrom torch import nnimport torch.nn.functional as F

2. Now load data from the text file.

# open text file and read in data as `text`with open('/data/poems_data.txt', 'r') as f:    text = f.read()

3. We can verify our data by printing the first 100 characters.

text[:100]

4. As we know our neural network does not understand text so we have to convert our text data to the integer. For this purpose we can create a token dictionary and map character to integer and vice versa.

# encode the text and map each character to an integer and vice versa# we create two dictionaries:# 1. int2char, which maps integers to characters# 2. char2int, which maps characters to unique integerschars = tuple(set(text))int2char = dict(enumerate(chars))char2int = {ch: ii for ii, ch in int2char.items()}# encode the textencoded = np.array([char2int[ch] for ch in text])

5. One hot encoding is used to represent the character. Like if we have three characters a,b,c then we can represent them in this way [1,0,0], [0,1,0], [0,0,1] here we use 1 to represent that character and all others will be 0. For our use case, we have many characters and symbols so our one-hot vector will be long. But it’s fine.

def one_hot_encode(arr, n_labels):        # Initialize the the encoded array    one_hot = np.zeros((arr.size, n_labels), dtype=np.float32)        # Fill the appropriate elements with ones    one_hot[np.arange(one_hot.shape[0]), arr.flatten()] = 1.        # Finally reshape it to get back to the original array    one_hot = one_hot.reshape((*arr.shape, n_labels))        return one_hot

Now test it in this way.

# check that the function works as expectedtest_seq = np.array([[0, 5, 1]])one_hot = one_hot_encode(test_seq, 8)print(one_hot)

6. Let’s create batches for our model as it is a crucial part. In this we will choose a “batch_size” which is the number of rows and then “sequence_length” which is the number of columns in each batch.

def get_batches(arr, batch_size, seq_length):    '''Create a generator that returns batches of size       batch_size x seq_length from arr.              Arguments       ---------       arr: Array you want to make batches from       batch_size: Batch size, the number of sequences per batch       seq_length: Number of encoded chars in a sequence    '''        batch_size_total = batch_size * seq_length    # total number of batches we can make    n_batches = len(arr)//batch_size_total        # Keep only enough characters to make full batches    arr = arr[:n_batches * batch_size_total]    # Reshape into batch_size rows    arr = arr.reshape((batch_size, -1))        # iterate through the array, one sequence at a time    for n in range(0, arr.shape[1], seq_length):        # The features        x = arr[:, n:n+seq_length]        # The targets, shifted by one        y = np.zeros_like(x)        try:            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, n+seq_length]        except IndexError:            y[:, :-1], y[:, -1] = x[:, 1:], arr[:, 0]        yield x, y

7. Now we can check if GPU is available. (If GPU is not available keep the epochs number low)

# check if GPU is available    train_on_gpu = torch.cuda.is_available()    if(train_on_gpu):        print('Training on GPU!')    else:         print('No GPU available, training on CPU; consider making n_epochs very small.')

8. Here we have created a class called CharRNN. It’s our class for model. In the “init method” we have to define the layers for our model. Here we are using two LSTM layers. We are also using dropout(it helps in avoiding overfitting). For the output we use a simple linear layer.

class CharRNN(nn.Module):                def __init__(self, tokens, n_hidden=256, n_layers=2,                                   drop_prob=0.5, lr=0.001):            super().__init__()            self.drop_prob = drop_prob            self.n_layers = n_layers            self.n_hidden = n_hidden            self.lr = lr                        # creating character dictionaries            self.chars = tokens            self.int2char = dict(enumerate(self.chars))            self.char2int = {ch: ii for ii, ch in self.int2char.items()}                        #lstm layer            self.lstm=nn.LSTM(len(self.chars),n_hidden,n_layers,                              dropout=drop_prob,batch_first=True)                        #dropout layer            self.dropout=nn.Dropout(drop_prob)                        #output layer            self.fc=nn.Linear(n_hidden,len(self.chars))                    def forward(self, x, hidden):            ''' Forward pass through the network.                 These inputs are x, and the hidden/cell state `hidden`. '''            ## Get the outputs and the new hidden state from the lstm            r_output, hidden = self.lstm(x, hidden)                        ## pass through a dropout layer            out = self.dropout(r_output)                        # Stack up LSTM outputs using view            # you may need to use contiguous to reshape the output            out = out.contiguous().view(-1, self.n_hidden)                        ## put x through the fully-connected layer            out = self.fc(out)            return out, hidden                        def init_hidden(self, batch_size):            ''' Initializes hidden state '''            # Create two new tensors with sizes n_layers x batch_size x n_hidden,            # initialized to zero, for hidden state and cell state of LSTM            weight = next(self.parameters()).data                        if (train_on_gpu):                hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda(),                      weight.new(self.n_layers, batch_size, self.n_hidden).zero_().cuda())            else:                hidden = (weight.new(self.n_layers, batch_size, self.n_hidden).zero_(),                          weight.new(self.n_layers, batch_size, self.n_hidden).zero_())                        return hidden

9. Our model is ready and it’s time to train it. For training we have to use optimizer and loss function. We simply calculate the loss after each step then the optimizer step function back propagates it and weights are changed accordingly. Loss will slowly decrease and it means that the model is getting better.

We have calculated validation loss between the epochs to decide if our model is under-fitting or over-fitting.

def train(net, data, epochs=10, batch_size=10, seq_length=50, lr=0.001, clip=5, val_frac=0.1, print_every=10):        ''' Training a network                     Arguments            ---------                        net: CharRNN network            data: text data to train the network            epochs: Number of epochs to train            batch_size: Number of mini-sequences per mini-batch, aka batch size            seq_length: Number of character steps per mini-batch            lr: learning rate            clip: gradient clipping            val_frac: Fraction of data to hold out for validation            print_every: Number of steps for printing training and validation loss                '''        net.train()                opt = torch.optim.Adam(net.parameters(), lr=lr)        criterion = nn.CrossEntropyLoss()                # create training and validation data        val_idx = int(len(data)*(1-val_frac))        data, val_data = data[:val_idx], data[val_idx:]                if(train_on_gpu):            net.cuda()                counter = 0        n_chars = len(net.chars)        for e in range(epochs):            # initialize hidden state            h = net.init_hidden(batch_size)                        for x, y in get_batches(data, batch_size, seq_length):                counter += 1                                # One-hot encode our data and make them Torch tensors                x = one_hot_encode(x, n_chars)                inputs, targets = torch.from_numpy(x), torch.from_numpy(y)                                if(train_on_gpu):                    inputs, targets = inputs.cuda(), targets.cuda()                    # Creating new variables for the hidden state, otherwise                # we'd backprop through the entire training history                h = tuple([each.data for each in h])                    # zero accumulated gradients                net.zero_grad()                                # get the output from the model                output, h = net(inputs, h)                                # calculate the loss and perform backprop                loss = criterion(output, targets.view(batch_size*seq_length).long())                loss.backward()                # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.                nn.utils.clip_grad_norm_(net.parameters(), clip)                opt.step()                                # loss stats                if counter % print_every == 0:                    # Get validation loss                    val_h = net.init_hidden(batch_size)                    val_losses = []                    net.eval()                    for x, y in get_batches(val_data, batch_size, seq_length):                        # One-hot encode our data and make them Torch tensors                        x = one_hot_encode(x, n_chars)                        x, y = torch.from_numpy(x), torch.from_numpy(y)                                                # Creating new variables for the hidden state, otherwise                        # we'd backprop through the entire training history                        val_h = tuple([each.data for each in val_h])                                                inputs, targets = x, y                        if(train_on_gpu):                            inputs, targets = inputs.cuda(), targets.cuda()                            output, val_h = net(inputs, val_h)                        val_loss = criterion(output, targets.view(batch_size*seq_length).long())                                            val_losses.append(val_loss.item())                                        net.train() # reset to train mode after iterationg through validation data                                        print("Epoch: {}/{}...".format(e+1, epochs),                          "Step: {}...".format(counter),                          "Loss: {:.4f}...".format(loss.item()),                          "Val Loss: {:.4f}".format(np.mean(val_losses)))

now train it in the following way.

# define and print the netn_hidden = 512n_layers = 2net = CharRNN(chars, n_hidden, n_layers)print(net)

batch_size = 128seq_length = 100n_epochs =  10# start small if you are just testing initial behavior# train the modeltrain(net, encoded, epochs=n_epochs, batch_size=batch_size, seq_le

ngth=seq_length, lr=0.001, print_every=10)

10. We can save the model in the following way.

# change the name, for saving multiple filesmodel_name = 'poem_4_epoch.net'checkpoint = {'n_hidden': net.n_hidden,              'n_layers': net.n_layers,              'state_dict': net.state_dict(),              'tokens': net.chars}with open(model_name, 'wb') as f:    torch.save(checkpoint, f)

11. As we can see that our model is trained, we’ll use a “sample method” to make predictions about the next characters! To sample, we pass in a character and have the network predict the next character. Then we take that character, pass it back in, and get another predicted character. Just keep doing this and tada you are up with a good paragraph! Here “top k sample” is just the number of letters that our model will predict and use the most relevant one from it.

def predict(net, char, h=None, top_k=None):        ''' Given a character, predict the next character.            Returns the predicted character and the hidden state.        '''                # tensor inputs        x = np.array([[net.char2int[char]]])        x = one_hot_encode(x, len(net.chars))        inputs = torch.from_numpy(x)                if(train_on_gpu):            inputs = inputs.cuda()                # detach hidden state from history        h = tuple([each.data for each in h])        # get the output of the model        out, h = net(inputs, h)        # get the character probabilities        p = F.softmax(out, dim=1).data        if(train_on_gpu):            p = p.cpu() # move to cpu                # get top characters        if top_k is None:            top_ch = np.arange(len(net.chars))        else:            p, top_ch = p.topk(top_k)            top_ch = top_ch.numpy().squeeze()                # select the likely next character with some element of randomness        p = p.numpy().squeeze()        char = np.random.choice(top_ch, p=p/p.sum())                # return the encoded value of the predicted char and the hidden state        return net.int2char[char], h

def sample(net, size, prime='The', top_k=None):            if(train_on_gpu):        net.cuda()    else:        net.cpu()        net.eval() # eval mode        # First off, run through the prime characters    chars = [ch for ch in prime]    h = net.init_hidden(1)    for ch in prime:        char, h = predict(net, ch, h, top_k=top_k)    chars.append(char)        # Now pass in the previous character and get a new one    for ii in range(size):        char, h = predict(net, chars[-1], h, top_k=top_k)        chars.append(char)    return ''.join(chars)

12. Let’s use this sample method to make predictions.

print(sample(net, 500, prime='christmas', top_k=2))

and output will look something like this.

christmas a son of thisthe sun wants the street of the stars, and the way the waythey went and too man and the star of the wordsof a body of a street and the strange shoulder of the skyand the sun, an end on the sun and the sun and so to the stars are starsand the words of the water and the streets of the worldto see them to start a posture of the streetson the street of the streets, and the sun and soul of the stationand so too too the world of a sound and stranger and to the worldto the sun a

As we can see our model was able to generate some good lines. Content doesn’t make a lot of sense but it was able to generate some lines with correct grammar. If we can train it for some more time it can perform even better.

Conclusion

I hope you have found this article helpful. We have built our first poem model using PyTorch. You can also use this model to draft an article or to write a blog by using a different data set. If you have any queries or suggestions feel free to post them in the comment section below or contact me at [email protected], I will be really glad to assist you.

Thanks for reading 🙂