Product Classificatication Using Image and Text (Deep Learning)

by Pritish Jadhav, Mrunal Jadhav - Wed, 10 Oct 2018
Tags: #python

PRODUCT CLASSIFICATION USING IMAGE AND TEXT (DEEP LEARNING)

Image classification using deep learning and its applications has been proving its worth across business verticals. However, building an image centric AI product is often marred by -

  1. Unavailability of large amounts of data.
  2. Poor quality of images.
  3. Even if data is available, the training times for achieving convergence are way too high.
  4. Often metadata and other information associated with the images is not utilized.

In this article, we shall explore an approach that can leverage the metadata and may evenachieve better results.

About the Dataset -

  1. I have customized the FIDS30 dataset for this article to better suit the problem statement.

  2. For the sake of this tutorial, lets consider only 4 fruit categories from the FIDS30 dataset, viz. Apples, Pears, Peaches and Plums.

  3. The reason for shortlisting the above mentioned fruit categories is the fact that these fruits are visually similar and it would be great to check how a deep learning algorithm like CNN handles it.

  4. In addition to this, each of the images has a metadata associated with it which describes the image. The metadata for fruit categories has been scrapped from Wikepedia. Please refer to the appendix for more details.

  5. To make things interesting, an addition category has been added to the dataset - 'Iphone'. The metadata for iphone images is same as that of apples. Such noise in textual data will cause pure text based models to fail. For eg - When a metadata for an image is something like - "Apple sales in India have increased by 30% in last 2 years", it is very difficult to know if the text is referring to Apple - the fruit or Apple - Iphone. In such circumstances, ML models based purely on textual data struggle.

To make things clear, lets explore the Data !!

Lets Start by loading python libraries needed to training the deep learning model.

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "none"

from IPython.display import display
from IPython.display import HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
display(HTML("<style>.container { width:100% !important; }</style>"))

import os, sys, glob, re, codecs

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
pd.set_option('display.max_colwidth', -1)

import matplotlib.pyplot as plt

from sklearn import preprocessing
from sklearn.utils import shuffle
from sklearn.metrics import classification_report

## nltk library for working with textual data.
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

## import custom functions from shared library
from shared import utils

import keras
from keras import backend as K
from keras import optimizers, regularizers
from keras.models import Sequential, Model
from keras.layers import (Dense, Dropout, Activation, 
                          Flatten, Embedding, Conv1D, 
                          MaxPooling1D, GlobalMaxPooling1D,
                          concatenate, Conv2D, MaxPooling2D, ZeroPadding2D,
                         Input)
from keras.layers.normalization import BatchNormalization
from keras.preprocessing import sequence, image
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.utils import to_categorical
from keras.optimizers import SGD


## load pretrained weights
from keras.applications.inception_v3 import InceptionV3
from keras.applications.xception import Xception
from keras.applications.resnet50 import ResNet50
from keras.applications.vgg19 import VGG19


np.random.seed(0)

from IPython.display import HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))
display(HTML("<style>.container { width:100% !important; }</style>"))
Using TensorFlow backend.

Set the Training Data Directory

In [2]:
train_data_basdir = os.path.join('.', 'data', 'image_text_data', 'raw_metadata')

Read a csv file with following columns -

  1. image_filepaths - filpaths for the downloaded images.
  2. metadata - textual data describing the image.
  3. label - Target label for the model to be trained on.
In [3]:
raw_data = pd.read_csv(os.path.join(train_data_basdir, 'server_images_text_raw_metadata1.csv'))

print('\n')
print("There are %d training images"%len(raw_data))

display(HTML('<font size=2>'+raw_data.head().to_html()+'</font>'))

There are 143 training images
image_filepaths metadata label
0 ./data/image_text_data/train/FIDS30/apples/41.jpg An apple is a sweet, edible fruit produced by an apple tree (Malus pumila). apple trees are cultivated worldwide, and are the most widely grown species in the genus Malus. apple have also been linked to enhancing brain power. apple up the acetylcholine production. apple a great source of water and fiber that act as cleansing agents.\nApples are frequently used as a pastry filling, apple pie being perhaps the archetypal American dessert. Especially in Europe, fried apple characteristically accompany certain dishes of sausage or pork. apples
1 ./data/image_text_data/train/FIDS30/apples/14.jpg The apple tree originated in Central Asia, where its wild ancestor, Malus sieversii, is still found today. apple have been grown for thousands of years in Asia and Europe, and were brought to North America by European colonists. A typical apple serving weighs 242 grams and provides 126 calories with a moderate content of dietary fiber (table). Otherwise, there is generally low content of essential nutrients (table). apple can be consumed various ways: juice, raw in salads, baked in pies, cooked into sauces and spreads like apple butter, and other baked dishes. Cider apple are typically too tart and astringent to eat fresh, but they give the beverage a rich flavor that dessert apple cannot. apple are often eaten raw. apples
2 ./data/image_text_data/train/FIDS30/apples/3.jpg apple have religious and mythological significance in many cultures, including Norse, Greek and European Christian traditions. apple trees are large if grown from seed. Generally apple cultivars are propagated by grafting onto rootstocks, which control the size of the resulting tree. Sliced apple consumption tripled in the US from 2004 to 2014 to 500 million apple annually due to its convenience.\nOrganic apple are commonly produced in the United States.Due to infestations by key insects and diseases, organic production is difficult in Europe. A light coating of kaolin, which forms a physical barrier to some pests, also may help prevent apple sun scalding. The soils in which apple trees grow must be well drained; fertilizers can be used if the yield is not high enough. Rolling hilltops or the sloping sides of hills are preferred because they provide “air drainage,” allowing the colder, heavier air to drain away to the valley below during frosty spring nights, when blossoms or young fruit would be destroyed by exposure to cold. apples
3 ./data/image_text_data/train/FIDS30/apples/47.jpg The apple is a deciduous tree, generally standing 6 to 15 ft (1.8 to 4.6 m) tall in cultivation and up to 30 ft (9.1 m) in the wild. When cultivated, the size, shape and branch density are determined by rootstock selection and trimming method. The leaves are alternately arranged dark green-colored simple ovals with serrated margins and slightly downy undersides. Phlorizin is a flavonoid that is found in apple trees, particularly in the leaves, and in only small amounts if at all in other plants, even other species of the genus Malus. Sliced apple consumption tripled in the US from 2004 to 2014 to 500 million apple annually due to its convenience.Since the apple requires a considerable period of dormancy, it thrives in areas having a distinct winter period, generally from latitude 30° to 60°, both north and south. Northward, apple growing is limited by low winter temperatures and a short growing season. apples
4 ./data/image_text_data/train/FIDS30/apples/25.jpg The fruit matures in late summer or autumn, and cultivars exist with a wide range of sizes. Commercial growers aim to produce an apple that is 2 3⁄4 to 3 1⁄4 in (7.0 to 8.3 cm) in diameter, due to market preference. Some consumers, especially those in Japan, prefer a larger apple, while apple below 2 1⁄4 in (5.7 cm) are generally used for making juice and have little fresh market value. The skin of ripe apple is generally red, yellow, green, pink, or russetted although many bi- or tri-colored cultivars may be found. The skin may also be wholly or partly russeted i.e. rough and brown. The skin is covered in a protective layer of epicuticular wax. The exocarp (flesh) is generally pale yellowish-white, though pink or yellow. exocarps also occur. Since the apple requires a considerable period of dormancy, it thrives in areas having a distinct winter period, generally from latitude 30° to 60°, both north and south. Northward, apple growing is limited by low winter temperatures and a short growing season. A certain favanoid phlorizin, found in apple skin, may help prevent bone loss associated with menopause, as it fights the inflammation and free radical production that leads to bone degeneration. apple are a rich source of various phytochemicals including flavonoids (e.g., catechins, flavanols, and quercetin) and other phenolic compounds (e.g., epicatechin and procyanidins) found in the skin, core, and pulp of the apple; they have unknown health value in humans. apples

Catgeory Distribution -

A quick check on how training dataset is distributed across target categories will help us discover any skewness in data that may exist.

In [4]:
grouped_data = raw_data.groupby('label').agg({'image_filepaths': {'data_count': 'count'}})
grouped_data.columns = grouped_data.columns.droplevel(0)
grouped_data = grouped_data.reset_index()

display(grouped_data)
/home/pritish.jadhav/jupyterhub3/lib/python3.5/site-packages/pandas/core/groupby/groupby.py:4656: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
  return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
label data_count
0 apples 38
1 iphone 15
2 peaches 27
3 pears 32
4 plums 31

It can be seen that iphone images are almost half in size as compared to other categories. Lets make a note of it and check it impact while evaluating results.

Next, lets quickly visualize the images in our dataset.

In [6]:
## fix the height and width of the image to be displayed
height, width =10, 10

# decide the number of images to be displayed from each category
columns = 5


# Find the number of unique categories in the training dataset
rows = raw_data['label'].nunique()
# simple function for shuffle the data.
def get_shuffled_data(grouped_data, n = 5):
    shuffled_data = shuffle(grouped_data).head(n)
    return shuffled_data

# Initialize subplots using matplotlib
fig, axes = plt.subplots(nrows=rows, ncols=columns, figsize=(width,height), sharex = True, sharey = True);

sampled_data = raw_data.groupby('label', as_index = False).apply(get_shuffled_data, columns).reset_index(drop = True)
labels = sampled_data['label'].unique()
        
for index in range((columns*rows)):
    ax = fig.add_subplot(rows,columns,index+1)
    image_array = image.load_img(sampled_data.loc[index, 'image_filepaths'])
    ax.axis('off')
    ax.imshow(image_array, cmap='gray', interpolation='nearest')

    
for ax, row in zip(axes[:, 0], labels):
    ax.set_ylabel(row)
    ax.set_yticks([])
    
for ax, row in zip(axes[0, :], labels):
    ax.set_xticks([])

Visual Ambiguity -

  1. It can be seen that the images are quite visually similar. The Green apples are very similar to pears.
  2. A closeup image of Peach looks almost like an apple.
  3. Many images have more than one fruit in the image.
  4. Certain images of peaches and plums are very difficult to distinguish.

All in all, the above set of sample images show that it will be a challenge to model such a dataset.

Also, we will have to convert labels in one hot encoded vectors. Lets get that out of the way using below piece of code.

In [7]:
le = preprocessing.LabelEncoder()
unique_classes = raw_data['label'].unique()
Y = le.fit_transform(raw_data['label'])

Y = to_categorical(Y, num_classes=len(unique_classes))
print("There are %d training data points and %d categories"%(len(Y), Y.shape[1]))
There are 143 training data points and 5 categories

Now that we have explored and investigated the image data at our disposal, lets turn focus to the textual metadata available at our disposal.

In [8]:
display(shuffle(raw_data[['metadata', 'label']]).head())
metadata label
11 apple are self-incompatible; they must cross-pollinate to develop fruit. During the flowering each season, apple growers often utilize pollinators to carry pollen. The apple is a deciduous tree, generally standing 6 to 15 ft (1.8 to 4.6 m) tall in cultivation and up to 30 ft (9.1 m) in the wild. When cultivated, the size, shape and branch density are determined by rootstock selection and trimming method. The leaves are alternately arranged dark green-colored simple ovals with serrated margins and slightly downy undersides. Other desired qualities in modern commercial apple breeding are a colorful skin, absence of russeting, ease of shipping, lengthy storage ability, high yields, disease resistance, common apple shape, and developed flavor. apples
37 apple have also been linked to enhancing brain power. apple up the acetylcholine production. apple a great source of water and fiber that act as cleansing agents. apples
21 Organic apple are commonly produced in the United States.Due to infestations by key insects and diseases, organic production is difficult in Europe. A light coating of kaolin, which forms a physical barrier to some pests, also may help prevent apple sun scalding.Many apple grow readily from seeds. However, more than with most perennial fruits, apple must be propagated asexually by grafting to obtain the sweetness and other desirable characteristics of the parent. This is because seedling apple are an example of "extreme heterozygotes", in that rather than inheriting genes from their parents to create a new apple with parental characteristics, they are instead significantly different from their parents, perhaps to compete with the many pests. apple trees are cultivated worldwide, and are the most widely grown species in the genus Malus. apples
46 peach can be broadly classified into two varieties- free stone variety where the seed is free in the center of the fruit and clinging seed variety where the seed is firmly attached to the pulp. A peach is a soft, juicy and fleshy stone fruit produced by a peach tree. consuming peach is a great way of improving your beta carotene levels to maintain healthy eyes. The outer surface of a peach is fuzzy and features longitudinal depressions extending from the stem to the tip. peaches
14 apple are often eaten raw. Cultivars bred for raw consumption are termed dessert or table apple. In the UK, a toffee apple is a traditional confection made by coating an apple in hot toffee and allowing it to cool. Similar treats in the U.S. are candy apple (coated in a hard shell of crystallized sugar syrup), and caramel apple (coated with cooled caramel). apple are a rich source of various phytochemicals including flavonoids (e.g., catechins, flavanols, and quercetin) and other phenolic compounds (e.g., epicatechin and procyanidins) found in the skin, core, and pulp of the apple; they have unknown health value in humans. apples

Need for Text Cleaning -

It can be seen that the text data consists of -

a. stop words
b. mix of lower and upper case characters
c. punctuations

Lets clean the textual data.

In [9]:
MAX_NB_WORDS = 100000
tokenizer = RegexpTokenizer(r'\w+')
stop_words = set(stopwords.words('english'))
stop_words.update(['.', ',', '"', "'", ':', ';', '(', ')', '[', ']', '{', '}'])
re_pattern = re.compile(u'[^\u0000-\uD7FF\uE000-\uFFFF]', re.UNICODE)  
In [10]:
print("pre-processing train data...")
processed_docs_train = []

def get_cleaned_text(raw_string):
    try:        
        tokens = tokenizer.tokenize(raw_string)
        filtered = ' '.join([word.lower() for word in tokens if word not in stop_words])
        return filtered
    except Exception as e:
        print(e)
        return ""
        
raw_data['filtered_text'] = raw_data['metadata'].apply(get_cleaned_text)
display(raw_data.head())
pre-processing train data...
image_filepaths metadata label filtered_text
0 ./data/image_text_data/train/FIDS30/apples/41.jpg An apple is a sweet, edible fruit produced by an apple tree (Malus pumila). apple trees are cultivated worldwide, and are the most widely grown species in the genus Malus. apple have also been linked to enhancing brain power. apple up the acetylcholine production. apple a great source of water and fiber that act as cleansing agents.\nApples are frequently used as a pastry filling, apple pie being perhaps the archetypal American dessert. Especially in Europe, fried apple characteristically accompany certain dishes of sausage or pork. apples an apple sweet edible fruit produced apple tree malus pumila apple trees cultivated worldwide widely grown species genus malus apple also linked enhancing brain power apple acetylcholine production apple great source water fiber act cleansing agents apples frequently used pastry filling apple pie perhaps archetypal american dessert especially europe fried apple characteristically accompany certain dishes sausage pork
1 ./data/image_text_data/train/FIDS30/apples/14.jpg The apple tree originated in Central Asia, where its wild ancestor, Malus sieversii, is still found today. apple have been grown for thousands of years in Asia and Europe, and were brought to North America by European colonists. A typical apple serving weighs 242 grams and provides 126 calories with a moderate content of dietary fiber (table). Otherwise, there is generally low content of essential nutrients (table). apple can be consumed various ways: juice, raw in salads, baked in pies, cooked into sauces and spreads like apple butter, and other baked dishes. Cider apple are typically too tart and astringent to eat fresh, but they give the beverage a rich flavor that dessert apple cannot. apple are often eaten raw. apples the apple tree originated central asia wild ancestor malus sieversii still found today apple grown thousands years asia europe brought north america european colonists a typical apple serving weighs 242 grams provides 126 calories moderate content dietary fiber table otherwise generally low content essential nutrients table apple consumed various ways juice raw salads baked pies cooked sauces spreads like apple butter baked dishes cider apple typically tart astringent eat fresh give beverage rich flavor dessert apple cannot apple often eaten raw
2 ./data/image_text_data/train/FIDS30/apples/3.jpg apple have religious and mythological significance in many cultures, including Norse, Greek and European Christian traditions. apple trees are large if grown from seed. Generally apple cultivars are propagated by grafting onto rootstocks, which control the size of the resulting tree. Sliced apple consumption tripled in the US from 2004 to 2014 to 500 million apple annually due to its convenience.\nOrganic apple are commonly produced in the United States.Due to infestations by key insects and diseases, organic production is difficult in Europe. A light coating of kaolin, which forms a physical barrier to some pests, also may help prevent apple sun scalding. The soils in which apple trees grow must be well drained; fertilizers can be used if the yield is not high enough. Rolling hilltops or the sloping sides of hills are preferred because they provide “air drainage,” allowing the colder, heavier air to drain away to the valley below during frosty spring nights, when blossoms or young fruit would be destroyed by exposure to cold. apples apple religious mythological significance many cultures including norse greek european christian traditions apple trees large grown seed generally apple cultivars propagated grafting onto rootstocks control size resulting tree sliced apple consumption tripled us 2004 2014 500 million apple annually due convenience organic apple commonly produced united states due infestations key insects diseases organic production difficult europe a light coating kaolin forms physical barrier pests also may help prevent apple sun scalding the soils apple trees grow must well drained fertilizers used yield high enough rolling hilltops sloping sides hills preferred provide air drainage allowing colder heavier air drain away valley frosty spring nights blossoms young fruit would destroyed exposure cold
3 ./data/image_text_data/train/FIDS30/apples/47.jpg The apple is a deciduous tree, generally standing 6 to 15 ft (1.8 to 4.6 m) tall in cultivation and up to 30 ft (9.1 m) in the wild. When cultivated, the size, shape and branch density are determined by rootstock selection and trimming method. The leaves are alternately arranged dark green-colored simple ovals with serrated margins and slightly downy undersides. Phlorizin is a flavonoid that is found in apple trees, particularly in the leaves, and in only small amounts if at all in other plants, even other species of the genus Malus. Sliced apple consumption tripled in the US from 2004 to 2014 to 500 million apple annually due to its convenience.Since the apple requires a considerable period of dormancy, it thrives in areas having a distinct winter period, generally from latitude 30° to 60°, both north and south. Northward, apple growing is limited by low winter temperatures and a short growing season. apples the apple deciduous tree generally standing 6 15 ft 1 8 4 6 tall cultivation 30 ft 9 1 wild when cultivated size shape branch density determined rootstock selection trimming method the leaves alternately arranged dark green colored simple ovals serrated margins slightly downy undersides phlorizin flavonoid found apple trees particularly leaves small amounts plants even species genus malus sliced apple consumption tripled us 2004 2014 500 million apple annually due convenience since apple requires considerable period dormancy thrives areas distinct winter period generally latitude 30 60 north south northward apple growing limited low winter temperatures short growing season
4 ./data/image_text_data/train/FIDS30/apples/25.jpg The fruit matures in late summer or autumn, and cultivars exist with a wide range of sizes. Commercial growers aim to produce an apple that is 2 3⁄4 to 3 1⁄4 in (7.0 to 8.3 cm) in diameter, due to market preference. Some consumers, especially those in Japan, prefer a larger apple, while apple below 2 1⁄4 in (5.7 cm) are generally used for making juice and have little fresh market value. The skin of ripe apple is generally red, yellow, green, pink, or russetted although many bi- or tri-colored cultivars may be found. The skin may also be wholly or partly russeted i.e. rough and brown. The skin is covered in a protective layer of epicuticular wax. The exocarp (flesh) is generally pale yellowish-white, though pink or yellow. exocarps also occur. Since the apple requires a considerable period of dormancy, it thrives in areas having a distinct winter period, generally from latitude 30° to 60°, both north and south. Northward, apple growing is limited by low winter temperatures and a short growing season. A certain favanoid phlorizin, found in apple skin, may help prevent bone loss associated with menopause, as it fights the inflammation and free radical production that leads to bone degeneration. apple are a rich source of various phytochemicals including flavonoids (e.g., catechins, flavanols, and quercetin) and other phenolic compounds (e.g., epicatechin and procyanidins) found in the skin, core, and pulp of the apple; they have unknown health value in humans. apples the fruit matures late summer autumn cultivars exist wide range sizes commercial growers aim produce apple 2 3 4 3 1 4 7 0 8 3 cm diameter due market preference some consumers especially japan prefer larger apple apple 2 1 4 5 7 cm generally used making juice little fresh market value the skin ripe apple generally red yellow green pink russetted although many bi tri colored cultivars may found the skin may also wholly partly russeted e rough brown the skin covered protective layer epicuticular wax the exocarp flesh generally pale yellowish white though pink yellow exocarps also occur since apple requires considerable period dormancy thrives areas distinct winter period generally latitude 30 60 north south northward apple growing limited low winter temperatures short growing season a certain favanoid phlorizin found apple skin may help prevent bone loss associated menopause fights inflammation free radical production leads bone degeneration apple rich source various phytochemicals including flavonoids e g catechins flavanols quercetin phenolic compounds e g epicatechin procyanidins found skin core pulp apple unknown health value humans

Converting Raw Text to Numeric Vector Representation -

As we know, deep learning models cannot comprehend textual words in the human sense. They can only work with numeric vectors. The following section highlights the process of converting raw text into vector representation.

Consider a sample training data with just 2 data points -
sample_data = ['The fruit matures in late summer or autumn', 'The skin of ripe apples is generally red, yellow, green, pink, or russetted although many bi- or tri-colored cultivars may be found']

1. Initialize a tokenizer for breaking down the raw text into tokens. In general, tokenizers are designed to work with words (n-grams) or with characters. For the sake of this article, we shall work with word level tokens. Feel free to experiment with character level tokenization.
[in] sample_tokenizer = Tokenizer(num_words=20, lower=True, char_level=False)

2. Fit the tokenization model on top of raw text to build a dictionary. The model basically extracts the top n words in raw text and assigns them an integer value.
[in] sample_tokenizer.fit_on_texts(sample_data)
[in] print sample_tokenizer.word_index
[out] {'summer': 7, 'ripe': 11, 'is': 13, 'in': 5, 'yellow': 16, 'autumn': 8, 'skin': 9, 'pink': 18, 'matures': 4, 'cultivars': 25, 'generally': 14, 'late': 6, 'russetted': 19, 'red': 15, 'be': 27, 'may': 26, 'bi': 22, 'fruit': 3, 'although': 20, 'tri': 23, 'colored': 24, 'of': 10, 'green': 17, 'apples': 12, 'many': 21, 'found': 28, 'the': 2, 'or': 1}

3. Convert each training example into a sequences.
[in] word_sequence = sample_tokenizer.texts_to_sequences(sample_data)
[in] print word_sequence
[out] [[2, 3, 4, 5, 6, 7, 1, 8], [2, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1, 19, 20, 21, 22, 1, 23, 24, 25, 26, 27, 28]]

4. Sequence padding
As it can be seen from the output of step 3, the length of word sequence of input text sequence is not uniform. Therefore, we cannot feed the output of step 3 directly to our deep learning model.
To solve this, lets pad our training length using dummy token as follows -

[in] sequence.pad_sequences(sequences, maxlen=20)
[out] [[ 0 0 0 0 0 0 0 0 0 0 0 0 2 3 4 5 6 7 1 8] [11 12 13 14 15 16 17 18 1 19 20 21 22 1 23 24 25 26 27 28]]

For more information, please check out this blog

In [11]:
print("tokenizing input data...")
tokenizer1 = Tokenizer(num_words=MAX_NB_WORDS, lower=True, char_level=False)

## fit tokenization model
tokenizer1.fit_on_texts(raw_data['filtered_text'])

## convert raw text into a sequence of words
word_seq_train = tokenizer1.texts_to_sequences(raw_data['filtered_text'])

## huerictic for deciding the max length for padding text sequences
raw_data['doc_len'] = raw_data['filtered_text'].apply(lambda sentence: len(sentence.split(' ')))
# max_seq_len = np.round(raw_data['doc_len'].mean() + raw_data['doc_len'].std()).astype(int)
max_seq_len = raw_data['doc_len'].max()
print("The length of the input text will be capped off at %d"%max_seq_len)
## pad text input
word_seq_train = sequence.pad_sequences(word_seq_train, maxlen=max_seq_len)

## recover the word index 
word_index = tokenizer1.word_index
tokenizing input data...
The length of the input text will be capped off at 177

We shall be using the pre-trained glove model to generate word embeddings for tokens in training data

In [12]:
#training params
batch_size = 25
num_epochs = 10

#model parameters
num_filters = 64 
embed_dim = 100
weight_decay = 1e-4


#load embeddings
print('loading word embeddings...')
embeddings_index = {}
f = codecs.open(os.path.join('.', 'data', 'image_text_data' ,'glove.6B.100d.txt'), encoding='utf-8')
for line in f:
    values = line.rstrip().rsplit(' ')
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('found %s word vectors' % len(embeddings_index))

#embedding matrix
print('preparing embedding matrix...')
words_not_found = []
nb_words = min(MAX_NB_WORDS, len(word_index)+1)
embedding_matrix = np.zeros((nb_words, embed_dim))
for word, i in word_index.items():
    if i >= nb_words:
        continue
    embedding_vector = embeddings_index.get(word)

    if (embedding_vector is not None) and len(embedding_vector) > 0:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
    else:
        words_not_found.append(word)
print('number of null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))
loading word embeddings...
found 400000 word vectors
preparing embedding matrix...
number of null word embeddings: 68
In [13]:
def get_image_representation(image_filepaths, channel_orientation = 'channels_first'):
    '''
    Function for reading images. 
    input:
        image_filepaths: list of images filepaths 
        channel_orientation - String value for controlling the channel orientation
        
    '''
    image_representation = []
    for img_path in image_filepaths:
        img = image.load_img(img_path, target_size=(256, 256))
        img = image.img_to_array(img, data_format=channel_orientation)
        img = img/255.
        image_representation.append(img)
    return np.array(image_representation)
In [14]:
def prepare_training_generators(train_df, len_unique_classes ,chunk_size = 5, channel_orientation = "channels_first"):
    '''
    This function generates mini batches of training data in the form of an iterator.
    inputs:
        train_df: pandas dataframe of training data. 
        len_unique_classes - integer value highlughting the number of unique target labels in training data.
        chunk_size - integer value highlighting the mini batch siz. The default value is set to 16
        channel_orientation - A string values used to represenattion the channel orientation of images to be read.
    output:
        iterator of text_data, image_data and output label
    '''
    index_tracker = 0
    while True:
        text_x = tokenizer1.texts_to_sequences(train_df.iloc[index_tracker:index_tracker+chunk_size]['filtered_text'])
        text_x = sequence.pad_sequences(text_x, maxlen=max_seq_len)
        image_x = get_image_representation(train_df.iloc[index_tracker:index_tracker+chunk_size]['image_filepaths'], channel_orientation)
        y_le = le.transform(train_df.iloc[index_tracker:index_tracker+chunk_size]['label'])
        y = to_categorical(y_le, num_classes=len_unique_classes)
        index_tracker += chunk_size
        if index_tracker >= len(train_df):
            index_tracker = 0
            train_df = shuffle(train_df).reset_index(drop = True)
            
        yield [text_x,image_x],y

def prepare_test_generators(test_df, len_unique_classes ,chunk_size = 5, channel_orientation = "channels_first"):
    '''
    This function generates mini batches of training data in the form of an iterator.
    inputs:
        train_df: pandas dataframe of training data. 
        len_unique_classes - integer value highlughting the number of unique target labels in training data.
        chunk_size - integer value highlighting the mini batch siz. The default value is set to 16
        channel_orientation - A string values used to represenattion the channel orientation of images to be read.
    output:
        iterator of text_data, image_data and output label
    '''
    index_tracker = 0
    while True:
        text_x = tokenizer1.texts_to_sequences(test_df.iloc[index_tracker:index_tracker+chunk_size]['filtered_text'])
        text_x = sequence.pad_sequences(text_x, maxlen=max_seq_len)
        image_x = get_image_representation(test_df.iloc[index_tracker:index_tracker+chunk_size]['image_filepaths'], channel_orientation)
        y_le = le.transform(test_df.iloc[index_tracker:index_tracker+chunk_size]['label'])
        y = to_categorical(y_le, num_classes=len_unique_classes)
        index_tracker += chunk_size
        if index_tracker >= len(test_df):
            index_tracker = 0
    
        yield [text_x,image_x]
In [16]:
train_data = raw_data.groupby('label', group_keys=False).apply(lambda x: x.sample(frac = 0.8, random_state = 2))
test_data = raw_data.loc[~raw_data.index.isin(train_data.index)]
print("training data has %d rows"%len(train_data))
print("test_data has %d rows"%len(test_data))
training data has 115 rows
test_data has 28 rows
In [17]:
tdidf_tokenizer = Tokenizer(num_words=2000)
tdidf_tokenizer.fit_on_texts(raw_data['filtered_text'])

x_train = tdidf_tokenizer.texts_to_matrix(train_data['filtered_text'], mode='tfidf')
x_test = tdidf_tokenizer.texts_to_matrix(test_data['filtered_text'], mode='tfidf')

IMAGE ONLY MODEL -

  1. The first model that we will be fitting on our training data is a image only model
  2. The function for building a image only model relies on pretrained weights available with keras.
  3. We shall be adding some dense layers to serve our purpose.
  4. Please note that the text input from generators will be ignored in an image only model
In [28]:
def build_image_only_model(text_input_shape =(26,), image_input_shape = (256,256,3), pretrained_model = 'vgg19', classes=245):
    
    print('received pretrained model %s'%pretrained_model)
    vis_input = Input(shape = text_input_shape, name = "vis_input")
    if pretrained_model == 'inception':
        pretrained_model = InceptionV3(
            include_top=False,
            input_shape=image_input_shape,
            weights='imagenet'
        )
    elif pretrained_model == 'xception':
        pretrained_model = Xception(
            include_top=False,
            input_shape=image_input_shape,
            weights='imagenet'
        )
    elif pretrained_model == 'resnet50':
        pretrained_model = ResNet50(
            include_top=False,
            input_shape=image_input_shape,
            weights='imagenet'
        )
    elif pretrained_model == 'vgg19':
        pretrained_model = VGG19(
            include_top=False,
            input_shape=image_input_shape,
            weights='imagenet'
        )
    elif pretrained_model == 'all':
        input = Input(shape=image_input_shape)
        inception_model = InceptionV3(
            include_top=False,
            input_tensor=input,
            weights='imagenet'
        )
        xception_model = Xception(
            include_top=False,
            input_tensor=input,
            weights='imagenet'
        )
        resnet_model = ResNet50(
            include_top=False,
            input_tensor=input,
            weights='imagenet'
        )
        flattened_outputs = [Flatten()(inception_model.output),
                             Flatten()(xception_model.output),
                             Flatten()(resnet_model.output)]
        output = Concatenate()(flattened_outputs)
        pretrained_model = Model(input, output)
    '''
    We can select from inception, xception, resnet50, vgg19, or a combination of the first three as the basis for our image classifier. 
    We specify include_top=False in these models in order to remove the top level classification layers. 
    These are the layers used to classify images into the categories of the ImageNet competition; 
    since our categories are different, we shall remove these top layers and replace them with our own.
    '''
    if pretrained_model.output.shape.ndims > 2:
        output = Flatten()(pretrained_model.output)
    else:
        output = pretrained_model.output

    output = BatchNormalization()(output)
    output = Dropout(0.2)(output)
    output = Dense(128, activation='relu')(output)
    
    output = BatchNormalization()(output)
    output = Dropout(0.2)(output)
    output = Dense(256, activation='relu')(output)
    
    output = BatchNormalization()(output)
    output = Dropout(0.2)(output)
    output = Dense(classes, activation='softmax')(output)
    model = Model(inputs = [vis_input, pretrained_model.input], outputs = output, name = "model")
    for layer in pretrained_model.layers:
        layer.trainable = False
    model.summary(line_length=200)

    model.compile(optimizer='adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model
In [29]:
'''
Print out the model summary
'''
tboard = keras.callbacks.TensorBoard(log_dir='./logs', histogram_freq=0, batch_size=32, write_graph=True, write_grads=False,
                                       write_images=False, embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None, embeddings_data=None)

validation_data_gen = prepare_training_generators(test_data, len(unique_classes), chunk_size = len(test_data) ,channel_orientation = "channels_last")
train_data_gen = prepare_training_generators(train_data, len(unique_classes), chunk_size = batch_size, channel_orientation = "channels_last")

image_only_model = build_image_only_model(text_input_shape =(max_seq_len,), image_input_shape = (256, 256, 3) ,classes=len(unique_classes))
received pretrained model vgg19
________________________________________________________________________________________________________________________________________________________________________________________________________
Layer (type)                                                                              Output Shape                                                                    Param #                       
========================================================================================================================================================================================================
input_3 (InputLayer)                                                                      (None, 256, 256, 3)                                                             0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
block1_conv1 (Conv2D)                                                                     (None, 256, 256, 64)                                                            1792                          
________________________________________________________________________________________________________________________________________________________________________________________________________
block1_conv2 (Conv2D)                                                                     (None, 256, 256, 64)                                                            36928                         
________________________________________________________________________________________________________________________________________________________________________________________________________
block1_pool (MaxPooling2D)                                                                (None, 128, 128, 64)                                                            0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
block2_conv1 (Conv2D)                                                                     (None, 128, 128, 128)                                                           73856                         
________________________________________________________________________________________________________________________________________________________________________________________________________
block2_conv2 (Conv2D)                                                                     (None, 128, 128, 128)                                                           147584                        
________________________________________________________________________________________________________________________________________________________________________________________________________
block2_pool (MaxPooling2D)                                                                (None, 64, 64, 128)                                                             0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
block3_conv1 (Conv2D)                                                                     (None, 64, 64, 256)                                                             295168                        
________________________________________________________________________________________________________________________________________________________________________________________________________
block3_conv2 (Conv2D)                                                                     (None, 64, 64, 256)                                                             590080                        
________________________________________________________________________________________________________________________________________________________________________________________________________
block3_conv3 (Conv2D)                                                                     (None, 64, 64, 256)                                                             590080                        
________________________________________________________________________________________________________________________________________________________________________________________________________
block3_conv4 (Conv2D)                                                                     (None, 64, 64, 256)                                                             590080                        
________________________________________________________________________________________________________________________________________________________________________________________________________
block3_pool (MaxPooling2D)                                                                (None, 32, 32, 256)                                                             0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
block4_conv1 (Conv2D)                                                                     (None, 32, 32, 512)                                                             1180160                       
________________________________________________________________________________________________________________________________________________________________________________________________________
block4_conv2 (Conv2D)                                                                     (None, 32, 32, 512)                                                             2359808                       
________________________________________________________________________________________________________________________________________________________________________________________________________
block4_conv3 (Conv2D)                                                                     (None, 32, 32, 512)                                                             2359808                       
________________________________________________________________________________________________________________________________________________________________________________________________________
block4_conv4 (Conv2D)                                                                     (None, 32, 32, 512)                                                             2359808                       
________________________________________________________________________________________________________________________________________________________________________________________________________
block4_pool (MaxPooling2D)                                                                (None, 16, 16, 512)                                                             0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
block5_conv1 (Conv2D)                                                                     (None, 16, 16, 512)                                                             2359808                       
________________________________________________________________________________________________________________________________________________________________________________________________________
block5_conv2 (Conv2D)                                                                     (None, 16, 16, 512)                                                             2359808                       
________________________________________________________________________________________________________________________________________________________________________________________________________
block5_conv3 (Conv2D)                                                                     (None, 16, 16, 512)                                                             2359808                       
________________________________________________________________________________________________________________________________________________________________________________________________________
block5_conv4 (Conv2D)                                                                     (None, 16, 16, 512)                                                             2359808                       
________________________________________________________________________________________________________________________________________________________________________________________________________
block5_pool (MaxPooling2D)                                                                (None, 8, 8, 512)                                                               0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
flatten_6 (Flatten)                                                                       (None, 32768)                                                                   0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
batch_normalization_7 (BatchNormalization)                                                (None, 32768)                                                                   131072                        
________________________________________________________________________________________________________________________________________________________________________________________________________
dropout_10 (Dropout)                                                                      (None, 32768)                                                                   0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
dense_12 (Dense)                                                                          (None, 128)                                                                     4194432                       
________________________________________________________________________________________________________________________________________________________________________________________________________
batch_normalization_8 (BatchNormalization)                                                (None, 128)                                                                     512                           
________________________________________________________________________________________________________________________________________________________________________________________________________
dropout_11 (Dropout)                                                                      (None, 128)                                                                     0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
dense_13 (Dense)                                                                          (None, 256)                                                                     33024                         
________________________________________________________________________________________________________________________________________________________________________________________________________
batch_normalization_9 (BatchNormalization)                                                (None, 256)                                                                     1024                          
________________________________________________________________________________________________________________________________________________________________________________________________________
dropout_12 (Dropout)                                                                      (None, 256)                                                                     0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
dense_14 (Dense)                                                                          (None, 5)                                                                       1285                          
========================================================================================================================================================================================================
Total params: 24,385,733
Trainable params: 4,295,045
Non-trainable params: 20,090,688
________________________________________________________________________________________________________________________________________________________________________________________________________
In [30]:
history = image_only_model.fit_generator(train_data_gen,
                      steps_per_epoch=np.ceil(len(train_data) / batch_size),
                      epochs=30,
                      validation_data = validation_data_gen,
                      validation_steps = 1,
                      verbose=1, callbacks=[tboard])
Epoch 1/30
5/5 [==============================] - 50s 10s/step - loss: 2.5140 - acc: 0.2000 - val_loss: 1.5906 - val_acc: 0.6071
Epoch 2/30
5/5 [==============================] - 46s 9s/step - loss: 1.3471 - acc: 0.5063 - val_loss: 1.0311 - val_acc: 0.5714
Epoch 3/30
5/5 [==============================] - 47s 9s/step - loss: 0.8266 - acc: 0.6694 - val_loss: 0.8242 - val_acc: 0.7143
Epoch 4/30
5/5 [==============================] - 47s 9s/step - loss: 0.5169 - acc: 0.8495 - val_loss: 0.7280 - val_acc: 0.7500
Epoch 5/30
5/5 [==============================] - 47s 9s/step - loss: 0.3619 - acc: 0.8775 - val_loss: 0.6781 - val_acc: 0.7500
Epoch 6/30
5/5 [==============================] - 47s 9s/step - loss: 0.2539 - acc: 0.9469 - val_loss: 0.6386 - val_acc: 0.8214
Epoch 7/30
5/5 [==============================] - 47s 9s/step - loss: 0.2649 - acc: 0.9181 - val_loss: 0.6266 - val_acc: 0.7857
Epoch 8/30
5/5 [==============================] - 47s 9s/step - loss: 0.1270 - acc: 0.9793 - val_loss: 0.6326 - val_acc: 0.7857
Epoch 9/30
5/5 [==============================] - 47s 9s/step - loss: 0.1418 - acc: 0.9306 - val_loss: 0.6553 - val_acc: 0.7500
Epoch 10/30
5/5 [==============================] - 47s 9s/step - loss: 0.1177 - acc: 0.9550 - val_loss: 0.6730 - val_acc: 0.7500
Epoch 11/30
5/5 [==============================] - 47s 9s/step - loss: 0.0944 - acc: 0.9838 - val_loss: 0.7067 - val_acc: 0.7143
Epoch 12/30
5/5 [==============================] - 47s 9s/step - loss: 0.0599 - acc: 0.9919 - val_loss: 0.7327 - val_acc: 0.7143
Epoch 13/30
5/5 [==============================] - 47s 9s/step - loss: 0.0429 - acc: 1.0000 - val_loss: 0.7472 - val_acc: 0.7143
Epoch 14/30
5/5 [==============================] - 47s 9s/step - loss: 0.0274 - acc: 1.0000 - val_loss: 0.7516 - val_acc: 0.7143
Epoch 15/30
5/5 [==============================] - 47s 9s/step - loss: 0.0384 - acc: 0.9919 - val_loss: 0.7452 - val_acc: 0.7500
Epoch 16/30
5/5 [==============================] - 47s 9s/step - loss: 0.0336 - acc: 1.0000 - val_loss: 0.7317 - val_acc: 0.7857
Epoch 17/30
5/5 [==============================] - 47s 9s/step - loss: 0.0465 - acc: 0.9919 - val_loss: 0.7301 - val_acc: 0.7143
Epoch 18/30
5/5 [==============================] - 47s 9s/step - loss: 0.0190 - acc: 1.0000 - val_loss: 0.7293 - val_acc: 0.7143
Epoch 19/30
5/5 [==============================] - 47s 9s/step - loss: 0.0369 - acc: 0.9919 - val_loss: 0.7326 - val_acc: 0.7143
Epoch 20/30
5/5 [==============================] - 47s 9s/step - loss: 0.0589 - acc: 0.9874 - val_loss: 0.7403 - val_acc: 0.7143
Epoch 21/30
5/5 [==============================] - 47s 9s/step - loss: 0.0273 - acc: 1.0000 - val_loss: 0.7521 - val_acc: 0.7143
Epoch 22/30
5/5 [==============================] - 47s 9s/step - loss: 0.0130 - acc: 1.0000 - val_loss: 0.7616 - val_acc: 0.7500
Epoch 23/30
5/5 [==============================] - 47s 9s/step - loss: 0.0324 - acc: 0.9838 - val_loss: 0.7763 - val_acc: 0.7143
Epoch 24/30
5/5 [==============================] - 47s 9s/step - loss: 0.0136 - acc: 1.0000 - val_loss: 0.8022 - val_acc: 0.6786
Epoch 25/30
5/5 [==============================] - 47s 9s/step - loss: 0.0148 - acc: 1.0000 - val_loss: 0.8119 - val_acc: 0.6786
Epoch 26/30
5/5 [==============================] - 47s 9s/step - loss: 0.0141 - acc: 1.0000 - val_loss: 0.8105 - val_acc: 0.6786
Epoch 27/30
5/5 [==============================] - 47s 9s/step - loss: 0.0126 - acc: 1.0000 - val_loss: 0.7799 - val_acc: 0.7143
Epoch 28/30
5/5 [==============================] - 47s 9s/step - loss: 0.0146 - acc: 1.0000 - val_loss: 0.7668 - val_acc: 0.7500
Epoch 29/30
5/5 [==============================] - 47s 9s/step - loss: 0.0168 - acc: 1.0000 - val_loss: 0.7632 - val_acc: 0.7500
Epoch 30/30
5/5 [==============================] - 47s 9s/step - loss: 0.0314 - acc: 1.0000 - val_loss: 0.7610 - val_acc: 0.7857

TEXT ONLY MODEL -

  1. In this model, we shall be relying only on textual metadata for each of the image.
  2. We shall be using the embedding from pretrained GloVe model.
  3. Each word in a GloVe model is represented by a vector of size 100 trained on a corpus of wikepedia. There are many other model available trained on different text corpus. For more detail, check out the Glove Project.
  4. Please note that for a text only model, we shall be ignoring the image data.
In [20]:
def build_text_only_model(text_input_shape =(26,), image_input_shape = (256,256,3), pretrained_model = 'vgg19', classes=245):
    
    print('received pretrained model %s'%pretrained_model)
    vis_input = Input(shape = text_input_shape, name = "vis_input")
    img_input = Input(shape=image_input_shape, name="img_input")
    
    text_emb = Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1], weights=[embedding_matrix], trainable=False)(vis_input)
    
    text_emb = Conv1D(128, 3, padding='same')(text_emb)
    text_emb = Activation('relu')(text_emb)
    text_emb = MaxPooling1D(2)(text_emb)
    
    text_emb = Conv1D(256, 3, padding='same')(text_emb)
    text_emb = Activation('relu')(text_emb)
    text_emb = MaxPooling1D(2)(text_emb)
    text_emb = Dropout(0.2)(text_emb)
    
    text_emb = Flatten()(text_emb)

    text_emb = Dense(512, kernel_regularizer=regularizers.l2(weight_decay))(text_emb)
    text_emb = Activation('relu')(text_emb)

    
    final_output = Dense(classes, activation='softmax')(text_emb)
    
    model = Model(inputs = [vis_input, img_input], outputs = final_output, name = "model")

    model.summary(line_length=200)
    opt = SGD(lr=0.01)
    model.compile(optimizer= 'adam',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model
In [26]:
tboard = keras.callbacks.TensorBoard(log_dir='./logs', histogram_freq=0, batch_size=32, write_graph=True, write_grads=False,
                                       write_images=False, embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None, embeddings_data=None)

validation_data_gen = prepare_training_generators(test_data, len(unique_classes), chunk_size = len(test_data) ,channel_orientation = "channels_last")
train_data_gen = prepare_training_generators(train_data, len(unique_classes), chunk_size = batch_size ,channel_orientation = "channels_last")
text_only_model = build_text_only_model(text_input_shape =(max_seq_len,), image_input_shape = (256, 256, 3) ,classes=len(unique_classes))
received pretrained model vgg19
________________________________________________________________________________________________________________________________________________________________________________________________________
Layer (type)                                                                              Output Shape                                                                    Param #                       
========================================================================================================================================================================================================
vis_input (InputLayer)                                                                    (None, 177)                                                                     0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
embedding_3 (Embedding)                                                                   (None, 177, 100)                                                                184700                        
________________________________________________________________________________________________________________________________________________________________________________________________________
conv1d_5 (Conv1D)                                                                         (None, 177, 128)                                                                38528                         
________________________________________________________________________________________________________________________________________________________________________________________________________
activation_7 (Activation)                                                                 (None, 177, 128)                                                                0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
max_pooling1d_5 (MaxPooling1D)                                                            (None, 88, 128)                                                                 0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
conv1d_6 (Conv1D)                                                                         (None, 88, 256)                                                                 98560                         
________________________________________________________________________________________________________________________________________________________________________________________________________
activation_8 (Activation)                                                                 (None, 88, 256)                                                                 0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
max_pooling1d_6 (MaxPooling1D)                                                            (None, 44, 256)                                                                 0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
dropout_9 (Dropout)                                                                       (None, 44, 256)                                                                 0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
flatten_5 (Flatten)                                                                       (None, 11264)                                                                   0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
dense_10 (Dense)                                                                          (None, 512)                                                                     5767680                       
________________________________________________________________________________________________________________________________________________________________________________________________________
activation_9 (Activation)                                                                 (None, 512)                                                                     0                             
________________________________________________________________________________________________________________________________________________________________________________________________________
dense_11 (Dense)                                                                          (None, 5)                                                                       2565                          
========================================================================================================================================================================================================
Total params: 6,092,033
Trainable params: 5,907,333
Non-trainable params: 184,700
________________________________________________________________________________________________________________________________________________________________________________________________________
In [27]:
history3 = text_only_model.fit_generator(train_data_gen,
                    steps_per_epoch=np.ceil(len(train_data) / batch_size),
                    epochs=30,
                    validation_data = validation_data_gen,
                    validation_steps = 1,
                    verbose=1, callbacks=[tboard])
Epoch 1/30
5/5 [==============================] - 5s 1s/step - loss: 4.6729 - acc: 0.0568 - val_loss: 1.6772 - val_acc: 0.1786
Epoch 2/30
5/5 [==============================] - 5s 990ms/step - loss: 1.6183 - acc: 0.2531 - val_loss: 1.6669 - val_acc: 0.3214
Epoch 3/30
5/5 [==============================] - 5s 1s/step - loss: 1.6445 - acc: 0.5403 - val_loss: 1.6566 - val_acc: 0.2857
Epoch 4/30
5/5 [==============================] - 5s 1s/step - loss: 1.5847 - acc: 0.4982 - val_loss: 1.5738 - val_acc: 0.1786
Epoch 5/30
5/5 [==============================] - 5s 990ms/step - loss: 1.5017 - acc: 0.3668 - val_loss: 1.4814 - val_acc: 0.3214
Epoch 6/30
5/5 [==============================] - 5s 994ms/step - loss: 1.3332 - acc: 0.4458 - val_loss: 1.4417 - val_acc: 0.3214
Epoch 7/30
5/5 [==============================] - 5s 1000ms/step - loss: 1.1944 - acc: 0.6332 - val_loss: 1.4272 - val_acc: 0.4643
Epoch 8/30
5/5 [==============================] - 5s 989ms/step - loss: 1.0071 - acc: 0.7476 - val_loss: 1.3675 - val_acc: 0.4643
Epoch 9/30
5/5 [==============================] - 5s 993ms/step - loss: 0.7612 - acc: 0.8288 - val_loss: 1.2863 - val_acc: 0.5000
Epoch 10/30
5/5 [==============================] - 5s 996ms/step - loss: 0.5747 - acc: 0.8812 - val_loss: 1.1363 - val_acc: 0.6786
Epoch 11/30
5/5 [==============================] - 5s 992ms/step - loss: 0.3890 - acc: 0.9387 - val_loss: 1.0973 - val_acc: 0.6429
Epoch 12/30
5/5 [==============================] - 5s 995ms/step - loss: 0.2774 - acc: 0.9586 - val_loss: 1.0828 - val_acc: 0.6071
Epoch 13/30
5/5 [==============================] - 5s 995ms/step - loss: 0.1826 - acc: 0.9919 - val_loss: 1.1286 - val_acc: 0.6429
Epoch 14/30
5/5 [==============================] - 5s 988ms/step - loss: 0.1488 - acc: 0.9757 - val_loss: 1.0737 - val_acc: 0.6786
Epoch 15/30
5/5 [==============================] - 5s 992ms/step - loss: 0.1154 - acc: 0.9919 - val_loss: 1.1121 - val_acc: 0.6071
Epoch 16/30
5/5 [==============================] - 5s 993ms/step - loss: 0.1057 - acc: 0.9919 - val_loss: 1.2667 - val_acc: 0.6429
Epoch 17/30
5/5 [==============================] - 5s 995ms/step - loss: 0.1019 - acc: 0.9793 - val_loss: 1.1540 - val_acc: 0.6786
Epoch 18/30
5/5 [==============================] - 5s 996ms/step - loss: 0.0849 - acc: 1.0000 - val_loss: 1.4471 - val_acc: 0.7143
Epoch 19/30
5/5 [==============================] - 5s 993ms/step - loss: 0.0940 - acc: 0.9793 - val_loss: 1.4991 - val_acc: 0.6429
Epoch 20/30
5/5 [==============================] - 5s 996ms/step - loss: 0.0848 - acc: 0.9919 - val_loss: 1.4331 - val_acc: 0.6429
Epoch 21/30
5/5 [==============================] - 5s 995ms/step - loss: 0.0883 - acc: 0.9919 - val_loss: 1.1204 - val_acc: 0.7143
Epoch 22/30
5/5 [==============================] - 5s 991ms/step - loss: 0.0818 - acc: 0.9919 - val_loss: 1.2926 - val_acc: 0.7143
Epoch 23/30
5/5 [==============================] - 5s 988ms/step - loss: 0.1014 - acc: 0.9874 - val_loss: 1.3596 - val_acc: 0.6786
Epoch 24/30
5/5 [==============================] - 5s 991ms/step - loss: 0.0820 - acc: 0.9919 - val_loss: 1.1890 - val_acc: 0.6429
Epoch 25/30
5/5 [==============================] - 5s 989ms/step - loss: 0.0909 - acc: 0.9919 - val_loss: 1.3454 - val_acc: 0.6429
Epoch 26/30
5/5 [==============================] - 5s 997ms/step - loss: 0.0737 - acc: 1.0000 - val_loss: 1.3541 - val_acc: 0.6786
Epoch 27/30
5/5 [==============================] - 5s 997ms/step - loss: 0.0733 - acc: 0.9919 - val_loss: 1.2650 - val_acc: 0.7143
Epoch 28/30
5/5 [==============================] - 5s 1s/step - loss: 0.0772 - acc: 0.9838 - val_loss: 1.2754 - val_acc: 0.7143
Epoch 29/30
5/5 [==============================] - 5s 996ms/step - loss: 0.0784 - acc: 0.9919 - val_loss: 1.2389 - val_acc: 0.7143
Epoch 30/30
5/5 [==============================] - 5s 996ms/step - loss: 0.0793 - acc: 0.9919 - val_loss: 1.2980 - val_acc: 0.7143

IMAGE PLUS TEXT MODEL -

  1. In this model, we shall be leveraging the image as well as textual data.
  2. The image plus text model is basically a concatenation of embeddings from image model and embeddings from text model.
  3. For fair comparison, the model architecture for image as well as text model are exactly the same as trained in previous cases.
In [23]:
def build_image_text_model(text_input_shape =(26,), image_input_shape = (256,256,3), pretrained_model = 'vgg19', classes=245):
    
    print('received pretrained model %s'%pretrained_model)
    vis_input = Input(shape = text_input_shape, name = "vis_input")
    if pretrained_model == 'inception':
        pretrained_model = InceptionV3(
            include_top=False,
            input_shape=image_input_shape,
            weights='imagenet'
        )
    elif pretrained_model == 'xception':
        pretrained_model = Xception(
            include_top=False,
            input_shape=image_input_shape,
            weights='imagenet'
        )
    elif pretrained_model == 'resnet50':
        pretrained_model = ResNet50(
            include_top=False,
            input_shape=image_input_shape,
            weights='imagenet'
        )
    elif pretrained_model == 'vgg19':
        pretrained_model = VGG19(
            include_top=False,
            input_shape=image_input_shape,
            weights='imagenet'
        )
    elif pretrained_model == 'all':
        input = Input(shape=image_input_shape)
        inception_model = InceptionV3(
            include_top=False,
            input_tensor=input,
            weights='imagenet'
        )
        xception_model = Xception(
            include_top=False,
            input_tensor=input,
            weights='imagenet'
        )
        resnet_model = ResNet50(
            include_top=False,
            input_tensor=input,
            weights='imagenet'
        )
        flattened_outputs = [Flatten()(inception_model.output),
                             Flatten()(xception_model.output),
                             Flatten()(resnet_model.output)]
        output = Concatenate()(flattened_outputs)
        pretrained_model = Model(input, output)

    # We can select from inception, xception, resnet50, vgg19, or a combination of the first three as the basis for our image classifier. We specify include_top=False in these models in order to remove the top level classification layers. These are the layers used to classify images into the categories of the ImageNet competition; since our categories are different, we can remove these top layers and replace them with our own.

    # def get_model(pretrained_model, all_character_names) continued...
    if pretrained_model.output.shape.ndims > 2:
        output = Flatten()(pretrained_model.output)
    else:
        output = pretrained_model.output

    output = BatchNormalization()(output)
    output = Dropout(0.2)(output)
    output = Dense(128, activation='relu')(output)
    
    output = BatchNormalization()(output)
    output = Dropout(0.2)(output)
    output = Dense(256, activation='relu')(output)
    
    output = BatchNormalization()(output)
    output = Dropout(0.2)(output)
    
    text_emb = Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1], weights=[embedding_matrix], trainable=False)(vis_input)
    
    text_emb = Conv1D(128, 3, padding='same')(text_emb)
    text_emb = Activation('relu')(text_emb)
    text_emb = MaxPooling1D(2)(text_emb)
    
    text_emb = Conv1D(256, 3, padding='same')(text_emb)
    text_emb = Activation('relu')(text_emb)
    text_emb = MaxPooling1D(2)(text_emb)
    text_emb = Dropout(0.2)(text_emb)
    
    text_emb = Flatten()(text_emb)

    text_emb = Dense(512, kernel_regularizer=regularizers.l2(weight_decay))(text_emb)
    text_emb = Activation('relu')(text_emb)
    
    img_plus_text_emb = concatenate([output,text_emb],axis=-1)
    final_output = Dense(classes, activation='softmax')(img_plus_text_emb)
    
    model = Model(inputs = [vis_input, pretrained_model.input], outputs = final_output, name = "model")
    for layer in pretrained_model.layers:
        layer.trainable = False
    model.summary(line_length=200)

    opt = SGD(lr=0.02)
    model.compile(optimizer= 'rmsprop',
                  loss='categorical_crossentropy',
                  metrics=['accuracy'])
    return model
In [24]:
tboard = keras.callbacks.TensorBoard(log_dir='./logs', histogram_freq=0, batch_size=32, write_graph=True, write_grads=False,
                                       write_images=False, embeddings_freq=0, embeddings_layer_names=None, embeddings_metadata=None, embeddings_data=None)

validation_data_gen = prepare_training_generators(test_data, len(unique_classes), chunk_size = len(test_data) ,channel_orientation = "channels_last")
train_data_gen = prepare_training_generators(train_data, len(unique_classes), channel_orientation = "channels_last")

image_text_model = build_image_text_model(text_input_shape =(max_seq_len,), image_input_shape = (256, 256, 3) ,classes=len(unique_classes))
received pretrained model vgg19
________________________________________________________________________________________________________________________________________________________________________________________________________
Layer (type)                                                      Output Shape                                Param #                 Connected to                                                      
========================================================================================================================================================================================================
input_2 (InputLayer)                                              (None, 256, 256, 3)                         0                                                                                         
________________________________________________________________________________________________________________________________________________________________________________________________________
block1_conv1 (Conv2D)                                             (None, 256, 256, 64)                        1792                    input_2[0][0]                                                     
________________________________________________________________________________________________________________________________________________________________________________________________________
block1_conv2 (Conv2D)                                             (None, 256, 256, 64)                        36928                   block1_conv1[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
block1_pool (MaxPooling2D)                                        (None, 128, 128, 64)                        0                       block1_conv2[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
block2_conv1 (Conv2D)                                             (None, 128, 128, 128)                       73856                   block1_pool[0][0]                                                 
________________________________________________________________________________________________________________________________________________________________________________________________________
block2_conv2 (Conv2D)                                             (None, 128, 128, 128)                       147584                  block2_conv1[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
block2_pool (MaxPooling2D)                                        (None, 64, 64, 128)                         0                       block2_conv2[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
block3_conv1 (Conv2D)                                             (None, 64, 64, 256)                         295168                  block2_pool[0][0]                                                 
________________________________________________________________________________________________________________________________________________________________________________________________________
block3_conv2 (Conv2D)                                             (None, 64, 64, 256)                         590080                  block3_conv1[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
block3_conv3 (Conv2D)                                             (None, 64, 64, 256)                         590080                  block3_conv2[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
block3_conv4 (Conv2D)                                             (None, 64, 64, 256)                         590080                  block3_conv3[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
block3_pool (MaxPooling2D)                                        (None, 32, 32, 256)                         0                       block3_conv4[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
block4_conv1 (Conv2D)                                             (None, 32, 32, 512)                         1180160                 block3_pool[0][0]                                                 
________________________________________________________________________________________________________________________________________________________________________________________________________
block4_conv2 (Conv2D)                                             (None, 32, 32, 512)                         2359808                 block4_conv1[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
block4_conv3 (Conv2D)                                             (None, 32, 32, 512)                         2359808                 block4_conv2[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
block4_conv4 (Conv2D)                                             (None, 32, 32, 512)                         2359808                 block4_conv3[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
block4_pool (MaxPooling2D)                                        (None, 16, 16, 512)                         0                       block4_conv4[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
block5_conv1 (Conv2D)                                             (None, 16, 16, 512)                         2359808                 block4_pool[0][0]                                                 
________________________________________________________________________________________________________________________________________________________________________________________________________
block5_conv2 (Conv2D)                                             (None, 16, 16, 512)                         2359808                 block5_conv1[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
block5_conv3 (Conv2D)                                             (None, 16, 16, 512)                         2359808                 block5_conv2[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
vis_input (InputLayer)                                            (None, 177)                                 0                                                                                         
________________________________________________________________________________________________________________________________________________________________________________________________________
block5_conv4 (Conv2D)                                             (None, 16, 16, 512)                         2359808                 block5_conv3[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
embedding_2 (Embedding)                                           (None, 177, 100)                            184700                  vis_input[0][0]                                                   
________________________________________________________________________________________________________________________________________________________________________________________________________
block5_pool (MaxPooling2D)                                        (None, 8, 8, 512)                           0                       block5_conv4[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
conv1d_3 (Conv1D)                                                 (None, 177, 128)                            38528                   embedding_2[0][0]                                                 
________________________________________________________________________________________________________________________________________________________________________________________________________
flatten_3 (Flatten)                                               (None, 32768)                               0                       block5_pool[0][0]                                                 
________________________________________________________________________________________________________________________________________________________________________________________________________
activation_4 (Activation)                                         (None, 177, 128)                            0                       conv1d_3[0][0]                                                    
________________________________________________________________________________________________________________________________________________________________________________________________________
batch_normalization_4 (BatchNormalization)                        (None, 32768)                               131072                  flatten_3[0][0]                                                   
________________________________________________________________________________________________________________________________________________________________________________________________________
max_pooling1d_3 (MaxPooling1D)                                    (None, 88, 128)                             0                       activation_4[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
dropout_5 (Dropout)                                               (None, 32768)                               0                       batch_normalization_4[0][0]                                       
________________________________________________________________________________________________________________________________________________________________________________________________________
conv1d_4 (Conv1D)                                                 (None, 88, 256)                             98560                   max_pooling1d_3[0][0]                                             
________________________________________________________________________________________________________________________________________________________________________________________________________
dense_6 (Dense)                                                   (None, 128)                                 4194432                 dropout_5[0][0]                                                   
________________________________________________________________________________________________________________________________________________________________________________________________________
activation_5 (Activation)                                         (None, 88, 256)                             0                       conv1d_4[0][0]                                                    
________________________________________________________________________________________________________________________________________________________________________________________________________
batch_normalization_5 (BatchNormalization)                        (None, 128)                                 512                     dense_6[0][0]                                                     
________________________________________________________________________________________________________________________________________________________________________________________________________
max_pooling1d_4 (MaxPooling1D)                                    (None, 44, 256)                             0                       activation_5[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
dropout_6 (Dropout)                                               (None, 128)                                 0                       batch_normalization_5[0][0]                                       
________________________________________________________________________________________________________________________________________________________________________________________________________
dropout_8 (Dropout)                                               (None, 44, 256)                             0                       max_pooling1d_4[0][0]                                             
________________________________________________________________________________________________________________________________________________________________________________________________________
dense_7 (Dense)                                                   (None, 256)                                 33024                   dropout_6[0][0]                                                   
________________________________________________________________________________________________________________________________________________________________________________________________________
flatten_4 (Flatten)                                               (None, 11264)                               0                       dropout_8[0][0]                                                   
________________________________________________________________________________________________________________________________________________________________________________________________________
batch_normalization_6 (BatchNormalization)                        (None, 256)                                 1024                    dense_7[0][0]                                                     
________________________________________________________________________________________________________________________________________________________________________________________________________
dense_8 (Dense)                                                   (None, 512)                                 5767680                 flatten_4[0][0]                                                   
________________________________________________________________________________________________________________________________________________________________________________________________________
dropout_7 (Dropout)                                               (None, 256)                                 0                       batch_normalization_6[0][0]                                       
________________________________________________________________________________________________________________________________________________________________________________________________________
activation_6 (Activation)                                         (None, 512)                                 0                       dense_8[0][0]                                                     
________________________________________________________________________________________________________________________________________________________________________________________________________
concatenate_1 (Concatenate)                                       (None, 768)                                 0                       dropout_7[0][0]                                                   
                                                                                                                                      activation_6[0][0]                                                
________________________________________________________________________________________________________________________________________________________________________________________________________
dense_9 (Dense)                                                   (None, 5)                                   3845                    concatenate_1[0][0]                                               
========================================================================================================================================================================================================
Total params: 30,477,761
Trainable params: 10,202,373
Non-trainable params: 20,275,388
________________________________________________________________________________________________________________________________________________________________________________________________________
In [25]:
history2 = image_text_model.fit_generator(train_data_gen,
                      steps_per_epoch=np.ceil(len(train_data) / batch_size),
                      epochs=30,
                      validation_data = validation_data_gen,
                      validation_steps = 1,
                      verbose=1, callbacks=[tboard])
Epoch 1/30
5/5 [==============================] - 20s 4s/step - loss: 0.5665 - acc: 0.8000 - val_loss: 11.2703 - val_acc: 0.2857
Epoch 2/30
5/5 [==============================] - 18s 4s/step - loss: 4.5975 - acc: 0.4400 - val_loss: 2.4112 - val_acc: 0.2500
Epoch 3/30
5/5 [==============================] - 17s 3s/step - loss: 1.7102 - acc: 0.5200 - val_loss: 2.2272 - val_acc: 0.3571
Epoch 4/30
5/5 [==============================] - 17s 3s/step - loss: 1.9602 - acc: 0.2800 - val_loss: 1.8663 - val_acc: 0.3571
Epoch 5/30
5/5 [==============================] - 18s 4s/step - loss: 1.7058 - acc: 0.3200 - val_loss: 1.8315 - val_acc: 0.2857
Epoch 6/30
5/5 [==============================] - 18s 4s/step - loss: 1.6524 - acc: 0.2800 - val_loss: 1.4536 - val_acc: 0.4286
Epoch 7/30
5/5 [==============================] - 18s 4s/step - loss: 1.9054 - acc: 0.2000 - val_loss: 1.2263 - val_acc: 0.6071
Epoch 8/30
5/5 [==============================] - 18s 4s/step - loss: 1.4382 - acc: 0.3600 - val_loss: 1.2956 - val_acc: 0.3929
Epoch 9/30
5/5 [==============================] - 18s 4s/step - loss: 1.5843 - acc: 0.2800 - val_loss: 0.9134 - val_acc: 0.6786
Epoch 10/30
5/5 [==============================] - 18s 4s/step - loss: 0.8839 - acc: 0.7600 - val_loss: 1.0124 - val_acc: 0.5714
Epoch 11/30
5/5 [==============================] - 18s 4s/step - loss: 0.6852 - acc: 0.6800 - val_loss: 0.7815 - val_acc: 0.7143
Epoch 12/30
5/5 [==============================] - 18s 4s/step - loss: 0.8132 - acc: 0.6400 - val_loss: 0.6712 - val_acc: 0.7857
Epoch 13/30
5/5 [==============================] - 18s 4s/step - loss: 0.4837 - acc: 0.8400 - val_loss: 0.8353 - val_acc: 0.6786
Epoch 14/30
5/5 [==============================] - 18s 4s/step - loss: 0.3001 - acc: 0.9600 - val_loss: 0.5340 - val_acc: 0.8214
Epoch 15/30
5/5 [==============================] - 18s 4s/step - loss: 0.1669 - acc: 1.0000 - val_loss: 1.0969 - val_acc: 0.7143
Epoch 16/30
5/5 [==============================] - 18s 4s/step - loss: 0.1598 - acc: 0.9600 - val_loss: 0.5919 - val_acc: 0.7500
Epoch 17/30
5/5 [==============================] - 18s 4s/step - loss: 0.2748 - acc: 0.8800 - val_loss: 3.1193 - val_acc: 0.5357
Epoch 18/30
5/5 [==============================] - 18s 4s/step - loss: 0.4611 - acc: 0.9200 - val_loss: 0.4943 - val_acc: 0.8571
Epoch 19/30
5/5 [==============================] - 18s 4s/step - loss: 0.1486 - acc: 0.9200 - val_loss: 0.4299 - val_acc: 0.8571
Epoch 20/30
5/5 [==============================] - 18s 4s/step - loss: 0.0507 - acc: 1.0000 - val_loss: 0.3753 - val_acc: 0.8929
Epoch 21/30
5/5 [==============================] - 18s 4s/step - loss: 0.0773 - acc: 1.0000 - val_loss: 0.3547 - val_acc: 0.9286
Epoch 22/30
5/5 [==============================] - 18s 4s/step - loss: 0.0479 - acc: 1.0000 - val_loss: 0.3802 - val_acc: 0.8571
Epoch 23/30
5/5 [==============================] - 18s 4s/step - loss: 0.0498 - acc: 1.0000 - val_loss: 0.3061 - val_acc: 0.8929
Epoch 24/30
5/5 [==============================] - 18s 4s/step - loss: 0.0738 - acc: 1.0000 - val_loss: 0.3495 - val_acc: 0.8214
Epoch 25/30
5/5 [==============================] - 18s 4s/step - loss: 0.0257 - acc: 1.0000 - val_loss: 0.3435 - val_acc: 0.8571
Epoch 26/30
5/5 [==============================] - 18s 4s/step - loss: 0.0267 - acc: 1.0000 - val_loss: 0.4079 - val_acc: 0.8214
Epoch 27/30
5/5 [==============================] - 18s 4s/step - loss: 0.0263 - acc: 1.0000 - val_loss: 0.2743 - val_acc: 0.9643
Epoch 28/30
5/5 [==============================] - 18s 4s/step - loss: 1.2119 - acc: 0.8800 - val_loss: 0.2463 - val_acc: 0.9286
Epoch 29/30
5/5 [==============================] - 18s 4s/step - loss: 0.0330 - acc: 1.0000 - val_loss: 0.2604 - val_acc: 0.8929
Epoch 30/30
5/5 [==============================] - 18s 4s/step - loss: 0.0269 - acc: 1.0000 - val_loss: 0.3244 - val_acc: 0.8571

Evaluation and Predictions -

In this section, lets evaluate all the trained model and summarise the results

  1. Start off by initializing generators for evaluation and testing. Since we have very less data to work, we shall be using the same data (test_data) for evauation as well as testing. Ideally the entire dataset should be split into 3 datasets, viz. train, validation and test. It is important to note that test data should be used only for computing the true of estimation of accuracy.
  2. Next, we shall leverage the evaluate_generator function predict_generator function for summarising the results.
In [44]:
validation_data_gen = prepare_training_generators(test_data, len(unique_classes), chunk_size = len(test_data) , channel_orientation = "channels_last")
test_data_gen = prepare_test_generators(test_data, len(unique_classes), chunk_size = len(test_data) , channel_orientation = "channels_last")
In [45]:
image_only_score = image_only_model.evaluate_generator(validation_data_gen, 1, use_multiprocessing=False)
print("image only model -  Loss: ", image_only_score[0], "Accuracy: ", image_only_score[1])

text_only_score = text_only_model.evaluate_generator(validation_data_gen, 1, use_multiprocessing=False)
print("Text only model Loss: ", text_only_score[0], "Accuracy: ", text_only_score[1])

image_text_score = image_text_model.evaluate_generator(validation_data_gen, 1, use_multiprocessing=False)
print("Image Text model Loss: ", image_text_score[0], "Accuracy: ", image_text_score[1])
image only model -  Loss:  0.7610096335411072 Accuracy:  0.7857142686843872
Text only model Loss:  1.2980186939239502 Accuracy:  0.7142857313156128
Image Text model Loss:  0.3243611454963684 Accuracy:  0.8571428656578064

Classification Report -

In [46]:
y_pred_text_raw = text_only_model.predict_generator(test_data_gen, steps = 1)
y_pred_image_text_raw = image_text_model.predict_generator(test_data_gen, steps = 1)

y_pred_text = np.argmax(y_pred_text_raw, axis=-1)
y_pred_image_text = np.argmax(y_pred_image_text_raw, axis=-1)

y_test_labels = le.transform(test_data['label'])

y_pred_text_readable = le.inverse_transform(y_pred_text)
y_pred_image_text_readable = le.inverse_transform(y_pred_image_text)
y_true_readable = le.inverse_transform(y_test_labels)
In [47]:
print("classification report for text only model is - ")
print(classification_report(y_true_readable, y_pred_text_readable))

print("classification report for image plus text model is -")
print(classification_report(y_true_readable, y_pred_image_text_readable))
classification report for text only model is - 
             precision    recall  f1-score   support

     apples       0.64      0.88      0.74         8
     iphone       0.00      0.00      0.00         3
    peaches       0.71      1.00      0.83         5
      pears       0.80      0.67      0.73         6
      plums       0.80      0.67      0.73         6

avg / total       0.65      0.71      0.67        28

classification report for image plus text model is -
             precision    recall  f1-score   support

     apples       1.00      1.00      1.00         8
     iphone       1.00      1.00      1.00         3
    peaches       1.00      0.60      0.75         5
      pears       1.00      0.67      0.80         6
      plums       0.60      1.00      0.75         6

avg / total       0.91      0.86      0.86        28

HURRAY !!!

It can be clearly seen that a pure text based model is the poorest performer whereas the image plus text model is the best performing model.

Lets summarise some of the findings -

  1. A pure text based model has a decent performance and trains faster as compared to other model especially when an anamoly was intentionally induced by using the same metadata for apple- The Fruit and apple - iPhone.

  2. In my opinion, even a pure text based can do much better with more training data. Having said that, that is whole point to this exercise - to check if we can model the training better by leveraging all the information at our disposal. Such a model is typical be beneficial when 'limited' training data is available.

  3. It was always an uphill task for a pure image based model which was evident from the visualization exercise done at the start of this article. The fruits selected for training a model are visually very similar.
  4. I was expecting the image plus text model to do a little better than 85% but nevertheless its a sizeable gain in performance.
  5. If you closely observe the epoch accuracies, it can be seen that the model training starts off slow with training accuracies in the range of 14% - 30% for the first 5-8 iterations. Post that the model takes a huge leap in training accuracy.
  6. It can be seen that all the pitfalls of a pure text based model are overcome by a image plus text model which has successfully learnt to rely on pixel data when it comes to apples. I feel , thats awesome !!
In [ ]:
 

Comments