Product Classificatication Using Image and Text (Deep Learning)
PRODUCT CLASSIFICATION USING IMAGE AND TEXT (DEEP LEARNING)¶
Image classification using deep learning and its applications has been proving its worth across business verticals. However, building an image centric AI product is often marred by -
- Unavailability of large amounts of data.
- Poor quality of images.
- Even if data is available, the training times for achieving convergence are way too high.
- Often metadata and other information associated with the images is not utilized.
In this article, we shall explore an approach that can leverage the metadata and may evenachieve better results.
About the Dataset -
I have customized the FIDS30 dataset for this article to better suit the problem statement.
For the sake of this tutorial, lets consider only 4 fruit categories from the FIDS30 dataset, viz. Apples, Pears, Peaches and Plums.
The reason for shortlisting the above mentioned fruit categories is the fact that these fruits are visually similar and it would be great to check how a deep learning algorithm like CNN handles it.
In addition to this, each of the images has a metadata associated with it which describes the image. The metadata for fruit categories has been scrapped from Wikepedia. Please refer to the appendix for more details.
To make things interesting, an addition category has been added to the dataset - 'Iphone'. The metadata for iphone images is same as that of apples. Such noise in textual data will cause pure text based models to fail. For eg - When a metadata for an image is something like - "Apple sales in India have increased by 30% in last 2 years", it is very difficult to know if the text is referring to Apple - the fruit or Apple - Iphone. In such circumstances, ML models based purely on textual data struggle.
To make things clear, lets explore the Data !!¶
Lets Start by loading python libraries needed to training the deep learning model.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "none"
from IPython.display import display
from IPython.display import HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
display(HTML("<style>.container { width:100% !important; }</style>"))
import os, sys, glob, re, codecs
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', -1)
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.utils import shuffle
from sklearn.metrics import classification_report
## nltk library for working with textual data.
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
## import custom functions from shared library
from shared import utils
import keras
from keras import backend as K
from keras import optimizers, regularizers
from keras.models import Sequential, Model
from keras.layers import (Dense, Dropout, Activation,
Flatten, Embedding, Conv1D,
MaxPooling1D, GlobalMaxPooling1D,
concatenate, Conv2D, MaxPooling2D, ZeroPadding2D,
Input)
from keras.layers.normalization import BatchNormalization
from keras.preprocessing import sequence, image
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical
from keras.optimizers import SGD
## load pretrained weights
from keras.applications.inception_v3 import InceptionV3
from keras.applications.xception import Xception
from keras.applications.resnet50 import ResNet50
from keras.applications.vgg19 import VGG19
np.random.seed(0)
from IPython.display import HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
display(HTML("<style>.container { width:100% !important; }</style>"))
Set the Training Data Directory¶
train_data_basdir = os.path.join('.', 'data', 'image_text_data', 'raw_metadata')
Read a csv file with following columns -¶
- image_filepaths - filpaths for the downloaded images.
- metadata - textual data describing the image.
- label - Target label for the model to be trained on.
raw_data = pd.read_csv(os.path.join(train_data_basdir, 'server_images_text_raw_metadata1.csv'))
print('\n')
print("There are %d training images"%len(raw_data))
display(HTML('<font size=2>'+raw_data.head().to_html()+'</font>'))
Catgeory Distribution -¶
A quick check on how training dataset is distributed across target categories will help us discover any skewness in data that may exist.
grouped_data = raw_data.groupby('label').agg({'image_filepaths': {'data_count': 'count'}})
grouped_data.columns = grouped_data.columns.droplevel(0)
grouped_data = grouped_data.reset_index()
display(grouped_data)
It can be seen that iphone images are almost half in size as compared to other categories. Lets make a note of it and check it impact while evaluating results.
Next, lets quickly visualize the images in our dataset.¶
## fix the height and width of the image to be displayed
height, width =10, 10
# decide the number of images to be displayed from each category
columns = 5
# Find the number of unique categories in the training dataset
rows = raw_data['label'].nunique()
# simple function for shuffle the data.
def get_shuffled_data(grouped_data, n = 5):
shuffled_data = shuffle(grouped_data).head(n)
return shuffled_data
# Initialize subplots using matplotlib
fig, axes = plt.subplots(nrows=rows, ncols=columns, figsize=(width,height), sharex = True, sharey = True);
sampled_data = raw_data.groupby('label', as_index = False).apply(get_shuffled_data, columns).reset_index(drop = True)
labels = sampled_data['label'].unique()
for index in range((columns*rows)):
ax = fig.add_subplot(rows,columns,index+1)
image_array = image.load_img(sampled_data.loc[index, 'image_filepaths'])
ax.axis('off')
ax.imshow(image_array, cmap='gray', interpolation='nearest')
for ax, row in zip(axes[:, 0], labels):
ax.set_ylabel(row)
ax.set_yticks([])
for ax, row in zip(axes[0, :], labels):
ax.set_xticks([])
Visual Ambiguity -¶
- It can be seen that the images are quite visually similar. The Green apples are very similar to pears.
- A closeup image of Peach looks almost like an apple.
- Many images have more than one fruit in the image.
- Certain images of peaches and plums are very difficult to distinguish.
All in all, the above set of sample images show that it will be a challenge to model such a dataset.
Also, we will have to convert labels in one hot encoded vectors. Lets get that out of the way using below piece of code.¶
le = preprocessing.LabelEncoder()
unique_classes = raw_data['label'].unique()
Y = le.fit_transform(raw_data['label'])
Y = to_categorical(Y, num_classes=len(unique_classes))
print("There are %d training data points and %d categories"%(len(Y), Y.shape[1]))
Now that we have explored and investigated the image data at our disposal, lets turn focus to the textual metadata available at our disposal.¶
display(shuffle(raw_data[['metadata', 'label']]).head())
Need for Text Cleaning -¶
It can be seen that the text data consists of -
a. stop words
b. mix of lower and upper case characters
c. punctuations
Lets clean the textual data.
MAX_NB_WORDS = 100000
tokenizer = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))
stop_words.update(['.', ',', '"', "'", ':', ';', '(', ')', '[', ']', '{', '}'])
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)
print("pre-processing train data...")
processed_docs_train = []
def get_cleaned_text(raw_string):
try:
tokens = tokenizer.tokenize(raw_string)
filtered = ' '.join([word.lower() for word in tokens if word not in stop_words])
return filtered
except Exception as e:
print(e)
return ""
raw_data['filtered_text'] = raw_data['metadata'].apply(get_cleaned_text)
display(raw_data.head())
Converting Raw Text to Numeric Vector Representation -¶
As we know, deep learning models cannot comprehend textual words in the human sense. They can only work with numeric vectors. The following section highlights the process of converting raw text into vector representation.
Consider a sample training data with just 2 data points -
sample_data = ['The fruit matures in late summer or autumn',
'The skin of ripe apples is generally red, yellow, green, pink, or russetted although many bi- or tri-colored cultivars may be found']
1. Initialize a tokenizer for breaking down the raw text into tokens. In general, tokenizers are designed to work with words (n-grams) or with characters. For the sake of this article, we shall work with word level tokens. Feel free to experiment with character level tokenization.
[in] sample_tokenizer = Tokenizer(num_words=20, lower=True, char_level=False)
2. Fit the tokenization model on top of raw text to build a dictionary. The model basically extracts the top n words in raw text and assigns them an integer value.
[in] sample_tokenizer.fit_on_texts(sample_data)
[in] print sample_tokenizer.word_index
[out] {'summer': 7, 'ripe': 11, 'is': 13, 'in': 5, 'yellow': 16, 'autumn': 8, 'skin': 9, 'pink': 18, 'matures': 4, 'cultivars': 25, 'generally': 14, 'late': 6, 'russetted': 19, 'red': 15, 'be': 27, 'may': 26, 'bi': 22, 'fruit': 3, 'although': 20, 'tri': 23, 'colored': 24, 'of': 10, 'green': 17, 'apples': 12, 'many': 21, 'found': 28, 'the': 2, 'or': 1}
3. Convert each training example into a sequences.
[in] word_sequence = sample_tokenizer.texts_to_sequences(sample_data)
[in] print word_sequence
[out] [[2, 3, 4, 5, 6, 7, 1, 8], [2, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1, 19, 20, 21, 22, 1, 23, 24, 25, 26, 27, 28]]
4. Sequence padding
As it can be seen from the output of step 3, the length of word sequence of input text sequence is not uniform. Therefore, we cannot feed the output of step 3 directly to our deep learning model.
To solve this, lets pad our training length using dummy token as follows -
[in] sequence.pad_sequences(sequences, maxlen=20)
[out] [[ 0 0 0 0 0 0 0 0 0 0 0 0 2 3 4 5 6 7 1 8]
[11 12 13 14 15 16 17 18 1 19 20 21 22 1 23 24 25 26 27 28]]
For more information, please check out this blog
print("tokenizing input data...")
tokenizer1 = Tokenizer(num_words=MAX_NB_WORDS, lower=True, char_level=False)
## fit tokenization model
tokenizer1.fit_on_texts(raw_data['filtered_text'])
## convert raw text into a sequence of words
word_seq_train = tokenizer1.texts_to_sequences(raw_data['filtered_text'])
## huerictic for deciding the max length for padding text sequences
raw_data['doc_len'] = raw_data['filtered_text'].apply(lambda sentence: len(sentence.split(' ')))
# max_seq_len = np.round(raw_data['doc_len'].mean() + raw_data['doc_len'].std()).astype(int)
max_seq_len = raw_data['doc_len'].max()
print("The length of the input text will be capped off at %d"%max_seq_len)
## pad text input
word_seq_train = sequence.pad_sequences(word_seq_train, maxlen=max_seq_len)
## recover the word index
word_index = tokenizer1.word_index
We shall be using the pre-trained glove model to generate word embeddings for tokens in training data
#training params
batch_size = 25
num_epochs = 10
#model parameters
num_filters = 64
embed_dim = 100
weight_decay = 1e-4
#load embeddings
print('loading word embeddings...')
embeddings_index = {}
f = codecs.open(os.path.join('.', 'data', 'image_text_data' ,'glove.6B.100d.txt'), encoding='utf-8')
for line in f:
values = line.rstrip().rsplit(' ')
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('found %s word vectors' % len(embeddings_index))
#embedding matrix
print('preparing embedding matrix...')
words_not_found = []
nb_words = min(MAX_NB_WORDS, len(word_index)+1)
embedding_matrix = np.zeros((nb_words, embed_dim))
for word, i in word_index.items():
if i >= nb_words:
continue
embedding_vector = embeddings_index.get(word)
if (embedding_vector is not None) and len(embedding_vector) > 0:
# words not found in embedding index will be all-zeros.
embedding_matrix[i] = embedding_vector
else:
words_not_found.append(word)
print('number of null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))
def get_image_representation(image_filepaths, channel_orientation = 'channels_first'):
'''
Function for reading images.
input:
image_filepaths: list of images filepaths
channel_orientation - String value for controlling the channel orientation
'''
image_representation = []
for img_path in image_filepaths:
img = image.load_img(img_path, target_size=(256, 256))
img = image.img_to_array(img, data_format=channel_orientation)
img = img/255.
image_representation.append(img)
return np.array(image_representation)
def prepare_training_generators(train_df, len_unique_classes ,chunk_size = 5, channel_orientation = "channels_first"):
'''
This function generates mini batches of training data in the form of an iterator.
inputs:
train_df: pandas dataframe of training data.
len_unique_classes - integer value highlughting the number of unique target labels in training data.
chunk_size - integer value highlighting the mini batch siz. The default value is set to 16
channel_orientation - A string values used to represenattion the channel orientation of images to be read.
output:
iterator of text_data, image_data and output label
'''
index_tracker = 0
while True:
text_x = tokenizer1.texts_to_sequences(train_df.iloc[index_tracker:index_tracker+chunk_size]['filtered_text'])
text_x = sequence.pad_sequences(text_x, maxlen=max_seq_len)
image_x = get_image_representation(train_df.iloc[index_tracker:index_tracker+chunk_size]['image_filepaths'], channel_orientation)
y_le = le.transform(train_df.iloc[index_tracker:index_tracker+chunk_size]['label'])
y = to_categorical(y_le, num_classes=len_unique_classes)
index_tracker += chunk_size
if index_tracker >= len(train_df):
index_tracker = 0
train_df = shuffle(train_df).reset_index(drop = True)
yield [text_x,image_x],y
def prepare_test_generators(test_df, len_unique_classes ,chunk_size = 5, channel_orientation = "channels_first"):
'''
This function generates mini batches of training data in the form of an iterator.
inputs:
train_df: pandas dataframe of training data.
len_unique_classes - integer value highlughting the number of unique target labels in training data.
chunk_size - integer value highlighting the mini batch siz. The default value is set to 16
channel_orientation - A string values used to represenattion the channel orientation of images to be read.
output:
iterator of text_data, image_data and output label
'''
index_tracker = 0
while True:
text_x = tokenizer1.texts_to_sequences(test_df.iloc[index_tracker:index_tracker+chunk_size]['filtered_text'])
text_x = sequence.pad_sequences(text_x, maxlen=max_seq_len)
image_x = get_image_representation(test_df.iloc[index_tracker:index_tracker+chunk_size]['image_filepaths'], channel_orientation)
y_le = le.transform(test_df.iloc[index_tracker:index_tracker+chunk_size]['label'])
y = to_categorical(y_le, num_classes=len_unique_classes)
index_tracker += chunk_size
if index_tracker >= len(test_df):
index_tracker = 0
yield [text_x,image_x]
train_data = raw_data.groupby('label', group_keys=False).apply(lambda x: x.sample(frac = 0.8, random_state = 2))
test_data = raw_data.loc[~raw_data.index.isin(train_data.index)]
print("training data has %d rows"%len(train_data))
print("test_data has %d rows"%len(test_data))
tdidf_tokenizer = Tokenizer(num_words=2000)
tdidf_tokenizer.fit_on_texts(raw_data['filtered_text'])
x_train = tdidf_tokenizer.texts_to_matrix(train_data['filtered_text'], mode='tfidf')
x_test = tdidf_tokenizer.texts_to_matrix(test_data['filtered_text'], mode='tfidf')
IMAGE ONLY MODEL -
- The first model that we will be fitting on our training data is a image only model
- The function for building a image only model relies on pretrained weights available with keras.
- We shall be adding some dense layers to serve our purpose.
- Please note that the text input from generators will be ignored in an image only model
def build_image_only_model(text_input_shape =(26,), image_input_shape = (256,256,3), pretrained_model = 'vgg19', classes=245):
print('received pretrained model %s'%pretrained_model)
vis_input = Input(shape = text_input_shape, name = "vis_input")
if pretrained_model == 'inception':
pretrained_model = InceptionV3(
include_top=False,
input_shape=image_input_shape,
weights='imagenet'
)
elif pretrained_model == 'xception':
pretrained_model = Xception(
include_top=False,
input_shape=image_input_shape,
weights='imagenet'
)
elif pretrained_model == 'resnet50':
pretrained_model = ResNet50(
include_top=False,
input_shape=image_input_shape,
weights='imagenet'
)
elif pretrained_model == 'vgg19':
pretrained_model = VGG19(
include_top=False,
input_shape=image_input_shape,
weights='imagenet'
)
elif pretrained_model == 'all':
input = Input(shape=image_input_shape)
inception_model = InceptionV3(
include_top=False,
input_tensor=input,
weights='imagenet'
)
xception_model = Xception(
include_top=False,
input_tensor=input,
weights='imagenet'
)
resnet_model = ResNet50(
include_top=False,
input_tensor=input,
weights='imagenet'
)
flattened_outputs = [Flatten()(inception_model.output),
Flatten()(xception_model.output),
Flatten()(resnet_model.output)]
output = Concatenate()(flattened_outputs)
pretrained_model = Model(input, output)
'''
We can select from inception, xception, resnet50, vgg19, or a combination of the first three as the basis for our image classifier.
We specify include_top=False in these models in order to remove the top level classification layers.
These are the layers used to classify images into the categories of the ImageNet competition;
since our categories are different, we shall remove these top layers and replace them with our own.
'''
if pretrained_model.output.shape.ndims > 2:
output = Flatten()(pretrained_model.output)
else:
output = pretrained_model.output
output = BatchNormalization()(output)
output = Dropout(0.2)(output)
output = Dense(128, activation='relu')(output)
output = BatchNormalization()(output)
output = Dropout(0.2)(output)
output = Dense(256, activation='relu')(output)
output = BatchNormalization()(output)
output = Dropout(0.2)(output)
output = Dense(classes, activation='softmax')(output)
model = Model(inputs = [vis_input, pretrained_model.input], outputs = output, name = "model")
for layer in pretrained_model.layers:
layer.trainable = False
model.summary(line_length=200)
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
return model
'''
Print out the model summary
'''
tboard = keras.callbacks.TensorBoard(log_dir='./logs', histogram_freq=0, batch_size=32, write_graph=True, write_grads=False,
write_images=False, embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None, embeddings_data=None)
validation_data_gen = prepare_training_generators(test_data, len(unique_classes), chunk_size = len(test_data) ,channel_orientation = "channels_last")
train_data_gen = prepare_training_generators(train_data, len(unique_classes), chunk_size = batch_size, channel_orientation = "channels_last")
image_only_model = build_image_only_model(text_input_shape =(max_seq_len,), image_input_shape = (256, 256, 3) ,classes=len(unique_classes))
history = image_only_model.fit_generator(train_data_gen,
steps_per_epoch=np.ceil(len(train_data) / batch_size),
epochs=30,
validation_data = validation_data_gen,
validation_steps = 1,
verbose=1, callbacks=[tboard])
TEXT ONLY MODEL -
- In this model, we shall be relying only on textual metadata for each of the image.
- We shall be using the embedding from pretrained GloVe model.
- Each word in a GloVe model is represented by a vector of size 100 trained on a corpus of wikepedia. There are many other model available trained on different text corpus. For more detail, check out the Glove Project.
- Please note that for a text only model, we shall be ignoring the image data.
def build_text_only_model(text_input_shape =(26,), image_input_shape = (256,256,3), pretrained_model = 'vgg19', classes=245):
print('received pretrained model %s'%pretrained_model)
vis_input = Input(shape = text_input_shape, name = "vis_input")
img_input = Input(shape=image_input_shape, name="img_input")
text_emb = Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1], weights=[embedding_matrix], trainable=False)(vis_input)
text_emb = Conv1D(128, 3, padding='same')(text_emb)
text_emb = Activation('relu')(text_emb)
text_emb = MaxPooling1D(2)(text_emb)
text_emb = Conv1D(256, 3, padding='same')(text_emb)
text_emb = Activation('relu')(text_emb)
text_emb = MaxPooling1D(2)(text_emb)
text_emb = Dropout(0.2)(text_emb)
text_emb = Flatten()(text_emb)
text_emb = Dense(512, kernel_regularizer=regularizers.l2(weight_decay))(text_emb)
text_emb = Activation('relu')(text_emb)
final_output = Dense(classes, activation='softmax')(text_emb)
model = Model(inputs = [vis_input, img_input], outputs = final_output, name = "model")
model.summary(line_length=200)
opt = SGD(lr=0.01)
model.compile(optimizer= 'adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
return model
tboard = keras.callbacks.TensorBoard(log_dir='./logs', histogram_freq=0, batch_size=32, write_graph=True, write_grads=False,
write_images=False, embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None, embeddings_data=None)
validation_data_gen = prepare_training_generators(test_data, len(unique_classes), chunk_size = len(test_data) ,channel_orientation = "channels_last")
train_data_gen = prepare_training_generators(train_data, len(unique_classes), chunk_size = batch_size ,channel_orientation = "channels_last")
text_only_model = build_text_only_model(text_input_shape =(max_seq_len,), image_input_shape = (256, 256, 3) ,classes=len(unique_classes))
history3 = text_only_model.fit_generator(train_data_gen,
steps_per_epoch=np.ceil(len(train_data) / batch_size),
epochs=30,
validation_data = validation_data_gen,
validation_steps = 1,
verbose=1, callbacks=[tboard])
IMAGE PLUS TEXT MODEL -
- In this model, we shall be leveraging the image as well as textual data.
- The image plus text model is basically a concatenation of embeddings from image model and embeddings from text model.
- For fair comparison, the model architecture for image as well as text model are exactly the same as trained in previous cases.
def build_image_text_model(text_input_shape =(26,), image_input_shape = (256,256,3), pretrained_model = 'vgg19', classes=245):
print('received pretrained model %s'%pretrained_model)
vis_input = Input(shape = text_input_shape, name = "vis_input")
if pretrained_model == 'inception':
pretrained_model = InceptionV3(
include_top=False,
input_shape=image_input_shape,
weights='imagenet'
)
elif pretrained_model == 'xception':
pretrained_model = Xception(
include_top=False,
input_shape=image_input_shape,
weights='imagenet'
)
elif pretrained_model == 'resnet50':
pretrained_model = ResNet50(
include_top=False,
input_shape=image_input_shape,
weights='imagenet'
)
elif pretrained_model == 'vgg19':
pretrained_model = VGG19(
include_top=False,
input_shape=image_input_shape,
weights='imagenet'
)
elif pretrained_model == 'all':
input = Input(shape=image_input_shape)
inception_model = InceptionV3(
include_top=False,
input_tensor=input,
weights='imagenet'
)
xception_model = Xception(
include_top=False,
input_tensor=input,
weights='imagenet'
)
resnet_model = ResNet50(
include_top=False,
input_tensor=input,
weights='imagenet'
)
flattened_outputs = [Flatten()(inception_model.output),
Flatten()(xception_model.output),
Flatten()(resnet_model.output)]
output = Concatenate()(flattened_outputs)
pretrained_model = Model(input, output)
# We can select from inception, xception, resnet50, vgg19, or a combination of the first three as the basis for our image classifier. We specify include_top=False in these models in order to remove the top level classification layers. These are the layers used to classify images into the categories of the ImageNet competition; since our categories are different, we can remove these top layers and replace them with our own.
# def get_model(pretrained_model, all_character_names) continued...
if pretrained_model.output.shape.ndims > 2:
output = Flatten()(pretrained_model.output)
else:
output = pretrained_model.output
output = BatchNormalization()(output)
output = Dropout(0.2)(output)
output = Dense(128, activation='relu')(output)
output = BatchNormalization()(output)
output = Dropout(0.2)(output)
output = Dense(256, activation='relu')(output)
output = BatchNormalization()(output)
output = Dropout(0.2)(output)
text_emb = Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1], weights=[embedding_matrix], trainable=False)(vis_input)
text_emb = Conv1D(128, 3, padding='same')(text_emb)
text_emb = Activation('relu')(text_emb)
text_emb = MaxPooling1D(2)(text_emb)
text_emb = Conv1D(256, 3, padding='same')(text_emb)
text_emb = Activation('relu')(text_emb)
text_emb = MaxPooling1D(2)(text_emb)
text_emb = Dropout(0.2)(text_emb)
text_emb = Flatten()(text_emb)
text_emb = Dense(512, kernel_regularizer=regularizers.l2(weight_decay))(text_emb)
text_emb = Activation('relu')(text_emb)
img_plus_text_emb = concatenate([output,text_emb],axis=-1)
final_output = Dense(classes, activation='softmax')(img_plus_text_emb)
model = Model(inputs = [vis_input, pretrained_model.input], outputs = final_output, name = "model")
for layer in pretrained_model.layers:
layer.trainable = False
model.summary(line_length=200)
opt = SGD(lr=0.02)
model.compile(optimizer= 'rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
return model
tboard = keras.callbacks.TensorBoard(log_dir='./logs', histogram_freq=0, batch_size=32, write_graph=True, write_grads=False,
write_images=False, embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None, embeddings_data=None)
validation_data_gen = prepare_training_generators(test_data, len(unique_classes), chunk_size = len(test_data) ,channel_orientation = "channels_last")
train_data_gen = prepare_training_generators(train_data, len(unique_classes), channel_orientation = "channels_last")
image_text_model = build_image_text_model(text_input_shape =(max_seq_len,), image_input_shape = (256, 256, 3) ,classes=len(unique_classes))
history2 = image_text_model.fit_generator(train_data_gen,
steps_per_epoch=np.ceil(len(train_data) / batch_size),
epochs=30,
validation_data = validation_data_gen,
validation_steps = 1,
verbose=1, callbacks=[tboard])
Evaluation and Predictions -
In this section, lets evaluate all the trained model and summarise the results
- Start off by initializing generators for evaluation and testing. Since we have very less data to work, we shall be using the same data (test_data) for evauation as well as testing. Ideally the entire dataset should be split into 3 datasets, viz. train, validation and test. It is important to note that test data should be used only for computing the true of estimation of accuracy.
- Next, we shall leverage the evaluate_generator function predict_generator function for summarising the results.
validation_data_gen = prepare_training_generators(test_data, len(unique_classes), chunk_size = len(test_data) , channel_orientation = "channels_last")
test_data_gen = prepare_test_generators(test_data, len(unique_classes), chunk_size = len(test_data) , channel_orientation = "channels_last")
image_only_score = image_only_model.evaluate_generator(validation_data_gen, 1, use_multiprocessing=False)
print("image only model - Loss: ", image_only_score[0], "Accuracy: ", image_only_score[1])
text_only_score = text_only_model.evaluate_generator(validation_data_gen, 1, use_multiprocessing=False)
print("Text only model Loss: ", text_only_score[0], "Accuracy: ", text_only_score[1])
image_text_score = image_text_model.evaluate_generator(validation_data_gen, 1, use_multiprocessing=False)
print("Image Text model Loss: ", image_text_score[0], "Accuracy: ", image_text_score[1])
Classification Report -¶
y_pred_text_raw = text_only_model.predict_generator(test_data_gen, steps = 1)
y_pred_image_text_raw = image_text_model.predict_generator(test_data_gen, steps = 1)
y_pred_text = np.argmax(y_pred_text_raw, axis=-1)
y_pred_image_text = np.argmax(y_pred_image_text_raw, axis=-1)
y_test_labels = le.transform(test_data['label'])
y_pred_text_readable = le.inverse_transform(y_pred_text)
y_pred_image_text_readable = le.inverse_transform(y_pred_image_text)
y_true_readable = le.inverse_transform(y_test_labels)
print("classification report for text only model is - ")
print(classification_report(y_true_readable, y_pred_text_readable))
print("classification report for image plus text model is -")
print(classification_report(y_true_readable, y_pred_image_text_readable))
HURRAY !!!¶
It can be clearly seen that a pure text based model is the poorest performer whereas the image plus text model is the best performing model.
Lets summarise some of the findings -
A pure text based model has a decent performance and trains faster as compared to other model especially when an anamoly was intentionally induced by using the same metadata for apple- The Fruit and apple - iPhone.
In my opinion, even a pure text based can do much better with more training data. Having said that, that is whole point to this exercise - to check if we can model the training better by leveraging all the information at our disposal. Such a model is typical be beneficial when 'limited' training data is available.
- It was always an uphill task for a pure image based model which was evident from the visualization exercise done at the start of this article. The fruits selected for training a model are visually very similar.
- I was expecting the image plus text model to do a little better than 85% but nevertheless its a sizeable gain in performance.
- If you closely observe the epoch accuracies, it can be seen that the model training starts off slow with training accuracies in the range of 14% - 30% for the first 5-8 iterations. Post that the model takes a huge leap in training accuracy.
- It can be seen that all the pitfalls of a pure text based model are overcome by a image plus text model which has successfully learnt to rely on pixel data when it comes to apples. I feel , thats awesome !!
References -
[1] https://en.wikipedia.org/wiki/Apple
[2] https://www.britannica.com/plant/apple-fruit-and-tree
[3] https://food.ndtv.com/food-drinks/apple-fruit-benefits-8-incredible-health-benefits-of-apple-that-you-may-not-have-known-1761603
[4] https://freecontent.manning.com/deep-learning-for-text/
[5] https://nlp.stanford.edu/projects/glove/