Training Classifier Using PyTorch - Detailed Example.
Training a Classifier using Pytorch¶
As a part of "Getting Acquainted with Deep Learning Frameworks" series, in the article, we shall explore Pytorch Library. Pytorch is a deep learning library developed by Facebook Researchers. Pytorch consists of 3 important modules namely -
- Autograd Module - This module takes care of gradient computations.
- Optim Module - The optim module implements optimizer algorithms for building neural networks.
- NN Module - The NN module in PyTorch implements different layers necessary for building complex neural networks.
In the following sections, lets over steps involved in building a multiclass classifier using Pytorch.
import os
import sys
import pandas as pd
import numpy as np
import torch
import torch.nn.functional as F
from torchvision import datasets, transforms
from torch import nn, optim
from IPython.core.display import display,HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
display(HTML("<style>.container { width:100% !important; }</style>"))
As always, it helps to come up with a road map before digging into the implementation aspects. We shall complete our goal of building and training a model using the following steps -
- Load Dataset
- Visualize some data
- Define a model
- Define loss
- Define optimizer
- Train a model
- Validate the model
Loading dataset in PyTorch¶
In this article, we shall be using the Fashion MNIST dataset which consists of 10 categories, viz.
- T-shirt/top
- Trouser
- Pullover
- Dress
- Coat
- Sandal
- Shirt
- Sneaker
- Bag
- Ankle boot.
Let's kick off the implementation by loading the Fashion MNIST Dataset
### loading dataset from pytorch data repository
### when you use the dataset for the first time, it will download the actual files.
BATCH_SIZE = 32
data_transforms = transforms.Compose([transforms.ToTensor()])
trainset = datasets.FashionMNIST('~/.pytorch/F_MNIST_data/', download=True, train=True, transform = data_transforms)
training_data_loader = torch.utils.data.DataLoader(trainset, batch_size= BATCH_SIZE, shuffle = True)
testset = datasets.FashionMNIST('~/.pytorch/F_MNIST_data/', download=True, train=False, transform = data_transforms)
test_data_loader = torch.utils.data.DataLoader(testset, batch_size= BATCH_SIZE, shuffle = True)
Data loading explained -¶
We shall be loading the data using Compose and DataLoader classes from PyTorch's transforms and utils.data module respectively.
- We can load the supported datasets by providing the path where data exists. If the data is not available, set the 'download' argument as True. The API for loading dataset is quite straightforward and can be further explored here.
- The important point to note here is the use of a transform argument while loading the data. The transform argument is used to define a function that accepts a PIL image and returns a transformed tensor. For the sake of simplicity, I will not be applying any transformations to the input dataset.
- However, the DataLoader expects tensors and not PIL images at its input hence, we shall be applying a single transformation that simply transforms the PIL images to tensors for them to be converted into iterable by DataLoader.
Note - All the datasets supported by PyTorch are implemented as subclasses of torch.utils.data.Dataset i.e, they have getitem and len methods implemented. Hence, they can all be passed to a torch.utils.data.DataLoader which can load multiple samples parallelly using torch.multiprocessing workers.
Code Explanation -¶
Based on the explanation provided above, let's review the code line by line -
- The first line defines the transformation that needs to be applied to our data. In this case, it is simply type-casting the input into tensors.
- The second line loads the dataset. Note that the transform defined in step 1 has been passed as an argument while loading data.
- The third line of code is responsible for converting the data into an iterator object so that we can train a model in batches. We have set the 'train' flag to True to return the training bits. By analogy, a test loader can be defined by setting the 'train' flag to False. The 'shuffle' flag is responsible for shuffling data after each training epoch. For more information on DataLoader, refer official documentation.
While we are at it, lets quickly check the output of Dataloader. We will have to convert the output of DataLoader (iterable) into an iterator object. Check out this stackoverflow thread if you are confused between an iterable and an iterator.
itr = iter(training_data_loader)
sample_image_batch, sample_image_labels = next(itr_val)
print("Shape of sample batch is %s and shape of sample batch labels is %s"%(sample_image_batch.shape,
sample_image_labels.shape))
Now that we have our data ready, let's dig into defining and training a classification model.
Building a model in PyTorch -¶
- Similar to Keras, torch provides us with a Sequential API where we can add modules/Layers to the container.
- The order in which layers are added to the container is important. It is important to keep track of dimensions while adding layers.
- All the layers - Linear (fully connected), Convolution, LSTMs, etc are available in torch.nn.
- torch.nn also provides access to different activation functions. It is imperative to choose activation functions based on the type of problem you are trying to solve. For example, for regression, the output activation function can be Relu whereas for binary classification it can sigmoid.
- For more information regarding the Sequential API checkout PyTorch's master documentation
Data Specific Model -¶
- As seen above, each batch of our train data has a shape [batch_size, 1, 28, 28] which implies that there is only one channel with height and width of 28 and 28 respectively.
- In this article, let's keep things simple by using only Dense/ Linear/ Fully-connected layers.
- To use Dense layers, we need to reshape our training example to a tensor of shape $[batch_size, n_channels \times width \times height]$. In this case, it will be $1 \times 28 \times 28 = 784$. Thus each batch of input data can be reshaped to a size of [batch_size, 784].
- Keeping in mind the dimensions from point 3, it is evident that our first Dense/Linear layer should have dimensions of 784 x $(n\_units)_1$ where $(n\_units)_1$ denotes the number of units in the first layer of the network.
- We shall be adding 2 more hidden layers with sizes 128 and 64 respectively.
- The last layer is the output layers consisting of 10 neurons which is equivalent to the number of classes in our dataset.
"""
define a model
"""
classifier_model = nn.Sequential(nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
nn.LogSoftmax(dim = 1))
Model Explanation -¶
- As seen above, we have added 4 fully connected layers to our model using Relu activation.
- Notice the activation function in the last layer - LogSoftmax. As the names suggest LogSoftmax takes the log of softmax function.
Mathematically,
$$Softmax = \large \frac{e^{z}}{\sum_{j = 1}^{n\_classes} e^{z_j}}$$$$ \therefore Log(Softmax) = \log \big[ \large \frac{e^{z}}{\sum_{j = 1}^{n\_classes} e^{z_j}} \big]$$$$ = \log \big[ e^{z} \bigl] - \bigr[ \sum_{j = 1}^{n\_classes} e^{z_j} \big]$$People tend to prefer log-likelihood over likelihood functions for the following reasons -
- Log Likelihoods are more numerically stable since by taking a log, divisions are converted to subtractions.
- Log-Likelihood is less computationally expensive since exponential from the equations are eliminated.
- Lastly, Log-likelihood tends to punish bigger mistakes in log-likelihood space.
Let’s consider a case were your true class is 1 and your model estimates the probability of the true class is .9. If your loss function is the L1 Loss function, the value of the loss function is 0.1. On the other hand, if you are using the log-likelihood then the value of the loss function is 0.105 (assuming natural log).
On the other hand, if your estimated probability is 0.3 and you are using the likelihood function the value of your loss function is 0.7. If you are using the log-likelihood function the value of your loss function is 1.20.
Now if we consider these two cases, using the standard likelihood function (akin to softmax), the error increases by a factor of 7 (.7/.1) between those two examples. Using the log-likelihood function (akin to log-softmax) the error increases by a factor of ~11 (1.20/.105). The point is, even though log softmax and softmax are monotonic, their effect on the relative values of the loss function changes.
For further deep dive into why likelihood are used, refer to this Stackoverflow thread.
Also, notice the dim = 1 in the Pytorch Activation. It signifies that log softmax is to be computed across columns.
Step 4 and 5 -
In this section, we shall define a loss function and optimizer.
- Since this is a classification problem we shall be using a Negative log-likelihood function.
- You can choose to use CrossEntropyLoss() which internally takes the Logsoftmax and NLL. If you choose to use CrossEntropyLoss, make sure to use a Softmax() function in the output layer instead of LogSoftmax().
- Read the master documentation here for more information.
While defining an optimizer, PyTorch requires us to pass model parameters that are to be optimized.
## define loss
criterion = nn.NLLLoss()
## define optimizer
optimizer = optim.Adam(classifier_model.parameters(), lr= 0.003)
The Fun Part
Now, we get to the business end of the article. In this section, we shall train a classifier by iterating over mini-batches. For the sake of convenience, I have added comments to the code below so that each code block can be discussed in later sections.
epochs = 10
for e in range(epochs):
## code block 1 starts##
running_loss = 0
train_accuracy = 0
## code block 1 ends ##
for images, labels in training_data_loader:
## code block 2 starts ##
images = images.view(BATCH_SIZE, -1)
## code block 2 ends ##
## code block 3 starts ##
optimizer.zero_grad()
## code block 3 ends ##
## code block 4 starts ##
log_ps = classifier_model.forward(images)
## code block 4 ends ##
## code block 5 starts ##
loss = criterion(log_ps, labels)
## code block 5 ends ##
## code block 6 starts ##
ps = torch.exp(log_ps)
top_p, top_class = ps.topk(1, dim=1)
equals = top_class == labels.view(*top_class.shape)
train_accuracy += torch.mean(equals.type(torch.FloatTensor))
## code block 6 ends ##
## code block 7 starts ##
loss.backward()
## code block 7 ends ##
## code block 8 starts ##
optimizer.step()
## code block 8 ends ##
running_loss += loss.item()
else:
test_loss = 0
test_accuracy = 0
# Turn off gradients for validation, saves memory and computations
## code block 9 starts ##
with torch.no_grad():
## code block 9 ends ##
## code block 10 starts ##
classifier_model.eval()
## code block 10 ends ##
for images, labels in test_data_loader:
images = images.view(len(images), -1)
log_ps = classifier_model.forward(images)
test_loss += criterion(log_ps, labels)
ps = torch.exp(log_ps)
top_p, top_class = ps.topk(1, dim=1)
equals = top_class == labels.view(*top_class.shape)
test_accuracy += torch.mean(equals.type(torch.FloatTensor))
## code block 11 starts ##
classifier_model.train()
## code block 11 ends ##
print("Training loss: {0}, Train_Accuracy: {1}, \
Test loss: {2}, Test Accuarcy: {3}".format(running_loss/len(training_data_loader),
train_accuracy/len(training_data_loader),
test_loss/len(test_data_loader),
test_accuracy/len(test_data_loader)))
Code Explanation -¶
code block 1 -
In code block 1, we have simple initialized training_loss and training_accuracy to 0. This is done so that we can track progress while iterating over data min-batches.code block 2 -
In code block 2, we have reshaped our input batch so that it matches the dimensions expected by our neural network model. Tensors can be reshaped in PyTorch by using the .view method. Note that for a tensor 't1' of shape (a, b, c), t1.view(a, b*c) is equivalent to t1.view(a, -1).code block 3 -
In code block 3, we have simply clearing out the gradient values that may have been accumulated from previous iterations.code block 4 -
In code 4, we pass the batch if images through the network. This is equivalent to a forward pass and can be implemented using the 'forward' method.code block 5 -
After the forward pass, we need to compute the loss to implement backward propagation. The loss is computed by passing logits and true labels through the loss defined earlier.
Final Comments -¶
- Hopefully, this article will help you to get started with PyTorch.
- I will try to cover more usecases, in upcoming articles.