

Overview¶
This notebook is going to be a basic run through of a Kaggle competition where contestants are tasked with predicting the breed of a dog in an image.
Preprocessing¶
After downloading the provided data, which includes 10,222 images of 120 different breeds. First, I index and one-hot the labels (breeds). Next, the images need to be resized with OpenCV to the equal dimensions – I will use 256 x 256 for computational constraints. Often times, I will augment the images with keras’ ImageDataGenerator to artificially increase sample size. However, 10k + images is plenty to work with. I split the data into 80% for training and 20% for validation. The resizing process takes a couple of minutes so I pickle the objects to avoid repeating the process if I have to rerun something later on.
import numpy as np
import pandas as pd
import cv2
from keras.utils import to_categorical
from sklearn import cross_validation
import pickle
labelsDF = pd.read_csv(r"C:\Users\James\PycharmProjects\DogBreeds\labels.csv")
ids = labelsDF['id'].tolist()
labels = labelsDF['breed'].tolist()
print(labelsDF.head(10))
breedIndex = []
for breed in labels:
if breed not in breedIndex:
breedIndex.append(breed)
breedIndex = sorted(breedIndex)
encodedLabels = []
for breed in labels:
# print(breed,breedIndex.index(breed))
encodedLabels.append(breedIndex.index(breed))
print(len(encodedLabels))
print(encodedLabels[:10])
labels = np.array(encodedLabels)
print(labels.shape)
labels = to_categorical(labels)
print(labels.shape)
features = []
for file in ids:
img = cv2.imread(r"C:\Users\James\PycharmProjects\DogBreeds\train/"+file+".jpg",1)
img = cv2.resize(img,(256,256))
features.append(img)
# cv2.imshow('FRAME', img)
# cv2.waitKey(0)
# cv2.destroyAllWindows()
features = np.array(features)
print(features.shape)
featureTrain, featureTest, labelTrain, labelTest = cross_validation.train_test_split(
features,labels, test_size= 0.20)
print(featureTrain.shape, featureTest.shape, labelTrain.shape, labelTest.shape)
# saveFeatureTrain = open("featureTrain.pickle","wb")
# pickle.dump(featureTrain, saveFeatureTrain)
# saveFeatureTrain.close()
# saveLabelTrain = open("labelTrain.pickle","wb")
# pickle.dump(labelTrain, saveLabelTrain)
# saveLabelTrain.close()
# saveFeatureTest = open("featureTest.pickle","wb")
# pickle.dump(featureTest, saveFeatureTest)
# saveFeatureTest.close()
# saveLabelTest = open("labelTest.pickle","wb")
# pickle.dump(labelTest, saveLabelTest)
# saveLabelTest.close()
Building & Training the Model¶
I use Inception V3 pretrained model with weights and modify output layers to match the data dimensions. I use the Adam optimizer and categorical_crossentropy cost function to coincide with the labels. The model will create a likelihood for each of the 120 breeds. Commented out because it takes an hour or so to run and I already trained it and don’t know how to skip a cell in Jupyter lol.
# from keras.models import Sequential, Model
# from keras import applications
# from keras.layers import Flatten, Dense, Dropout
# from keras.callbacks import ModelCheckpoint, EarlyStopping
# import pickle
# featureTrain = pickle.load(open("featureTrain.pickle","rb"))
# labelTrain = pickle.load(open("labelTrain.pickle","rb"))
# featureTest = pickle.load(open("featureTest.pickle","rb"))
# labelTest = pickle.load(open("labelTest.pickle","rb"))
# baseModel = applications.InceptionV3(include_top=False,
# input_shape=(256,256,3),
# weights='imagenet')
# addLayer = Sequential()
# addLayer.add(Flatten(input_shape=baseModel.output_shape[1:]))
# addLayer.add(Dense(256, activation='relu'))
# addLayer.add(Dropout(0.2))
# addLayer.add(Dense(120, activation='softmax'))
# model = Model(inputs=baseModel.input, outputs=addLayer(baseModel.output))
# model.compile(loss='categorical_crossentropy',
# optimizer='adam',
# metrics=['accuracy'])
# model.summary()
# model.fit(featureTrain,labelTrain,
# batch_size=32,epochs=20,
# validation_data=(featureTest,labelTest),
# callbacks=[ModelCheckpoint('modelInception.model',monitor='val_acc',save_best_only=True),
# EarlyStopping(patience=5)])
Making Predictions¶
Making predictions involves applying the same preproccessing procedures to the test data which includes 10,357 images. Then load the model and pass through the test data for predicting.
import os
import cv2
from keras.models import load_model
import numpy as np
import pickle
path = os.listdir(r"C:\Users\James\PycharmProjects\DogBreeds\test/")
featuresTest = []
ids = []
for file in path:
id = file[:-4]
ids.append(id)
img = cv2.imread(r"C:\Users\James\PycharmProjects\DogBreeds\test/"+file,1)
img = cv2.resize(img,(256,256))
featuresTest.append(img)
# model = load_model("modelInception.h5")
# prediction = model.predict(img)
# print(prediction)
featuresTest = np.array(featuresTest)
print(len(ids))
print(featuresTest.shape)
model = load_model(r"C:\Users\James\PycharmProjects\DogBreeds\modelInception.h5")
# predictions = model.predict(featuresTest)
# print(predictions.shape)
# print(predictions[0])
# savePredictions = open("modelPredictions.pickle","wb")
# pickle.dump(predictions, savePredictions)
# savePredictions.close()
Testing on My Dog¶
Here I’ll pass through an image of my dog, who is a mix of a collie, boxer, and other stuff, and see what the model predicts
import cv2
import numpy as np
from keras.models import load_model
import pandas as pd
featuresTest = []
img = cv2.imread(r"C:\Users\James\Desktop\Finn.jpg")
img = cv2.resize(img,(256,256))
featuresTest.append(img)
featuresTest = np.array(featuresTest)
labelsDF = pd.read_csv(r"C:\Users\James\PycharmProjects\DogBreeds\labels.csv")
labels = labelsDF['breed'].tolist()
breedIndex = []
for breed in labels:
if breed not in breedIndex:
breedIndex.append(breed)
breedIndex = sorted(breedIndex)
model = load_model(r"C:\Users\James\PycharmProjects\DogBreeds\modelInception.h5")
predictions = model.predict(featuresTest)
bestPrediction = np.argmax(predictions)
prediction = breedIndex[bestPrediction]
print(prediction)
cv2.imshow(str(prediction),img)
cv2.waitKey(0)
cv2.destroyAllWindows()

The model thinks my dog is a Saint Bernard, which makes sense because of his color scheme, so It’s kinda close.
Submitting Predictions¶
You’ll want to submit the predictions to the Kaggle page so you can be scored. I think this model got around top 50% in the rankings, which is usually around what I get before ensembling and fine-tuning.
import pickle
import pandas as pd
import numpy as np
ids = pickle.load(open(r"C:\Users\James\PycharmProjects\DogBreeds\idsTest.pickle","rb"))
print(ids[:10])
predictions = pickle.load(open(r"C:\Users\James\PycharmProjects\DogBreeds\modelPredictions.pickle","rb"))
print(len(ids))
print(predictions.shape)
labelsDF = pd.read_csv(r"C:\Users\James\PycharmProjects\DogBreeds\labels.csv")
labels = labelsDF['breed'].tolist()
breedIndex = []
for breed in labels:
if breed not in breedIndex:
breedIndex.append(breed)
breedIndex = sorted(breedIndex)
columns = ["id"] + breedIndex
df = pd.DataFrame(columns=columns)
idDF = pd.DataFrame(ids, columns=["id"])
predictionsList = predictions.tolist()
predDF = pd.DataFrame(predictionsList,columns=breedIndex)
predDF['id'] = ids
predDF = predDF[['id']+breedIndex]
predDF.to_csv(r"C:\Users\James\PycharmProjects\DogBreeds\submissions2.csv",float_format='%.6f',index=False)
Improving Results¶
To improve results, I’d fine-tune some parameters, increase image size and run on EC2, and ensemble with other NN’s, averaging their outputs, then do a stacking ensemble by running the averaged results through a Random Forest.