If you are curious about image analysis and classification as well as in machine learning, keep reading. I will use a case I studied myself to introduce you these themes.

In the first place, I was looking for a way to analyze resumes’ display. I thought about making an “Accept” or “Reject” machine learning tool based on resume layout. A not very clean layout often goes with a not very serious candidate. To say that roughly, my problem was a binary classification problem based on details extracted from a file.

To familiarize with image recognition, I decided to try comparing images of cars (Fiat 500) to images of giraffes. Don’t ask me why I chose this example, I’m still wondering!
However, the purpose of this article is just to highlight my approach of this problem and to provide you key tools to proceed to your own images classification.

The goal here is to create a machine learning classifier that could spot the difference between:

Fiat 500 vs Giraffes
fiat_500      giraffe

Before starting, it is important you know I worked with a 3.x Python environment for this first part.

1st attempt:

I downloaded in separated folders 20 images of giraffes and Fiat 500 respectively, and 16 for testing.

| +—Car
| \—Giraffe
| +—Car
| \—Giraffe

Then, I used image_to_matrix function to resize and then convert all those images to 3 dimensions numpy arrays (300, 150, 3). Each component of the array is the RGB code of a pixel.
After having flattened every matrix to a 1-dimension vector (1, 300 x 150 x 3) thanks to flatten_matrix function, I gathered all vectors in a big array called data.

COMMON_SIZE = (300, 150)
def image_to_matrix(filename):
takes a filename and turns it into a numpy array of RGB pixels
img = Image.open(filename)
img = img.resize(COMMON_SIZE)
img = np.array(img)
return img

def flatten_matrix(matrix):
takes in an (m, n) numpy array and flattens it
into an array of shape (1, m * n * 3)
s = matrix.shape[0] * matrix.shape[1] * 3
mat = matrix.reshape(1, s)
return mat[0]

classes = ['Giraffe', 'Car']
train_image_dirs = ["train/Giraffe/", "train/Car/"]
train_images = []
train_labels = []
# Retrieval of path of all pictures from "train" folder
for directory in train_image_dirs:
    temp_list = [directory+ f for f in os.listdir(directory)]

# Creation of the whole training dataset
train_data = []
for image in train_images:
    image = image_to_matrix(image)
    image = flatten_matrix(image)
train_data = np.array(train_data)

I did the same with my test directory to obtain my test_data dataset that is useful to get the accuracy of my classification.

test_labels = []
test_image_dirs = ["test/Giraffe/", "test/Car/"]
test_images = []
for directory in test_image_dirs:
    temp_list = [directory+ f for f in os.listdir(directory)]

test_data = []
for image in test_images:
    image = image_to_matrix(image)
    image = flatten_matrix(image)
test_data = np.array(test_data)

I used RandomizedPCA from sklearn library to reduce the dimension of each vector, representing an image, to 2 dimensions. This way it was easier to have a rough idea if the classification result would be accurate, thanks to a simple 2D plot of each picture coordinates.

pca = RandomizedPCA(n_components=2)
X = pca.fit_transform(train_data)
df = pd.DataFrame({"x": X[:, 0], "y": X[:, 1], "class": np.asarray(train_labels)})
colors = {'Car':"blue", 'Giraffe':"yellow"}
figure, axes = pl.subplots()
axes.scatter(df['x'], df['y'], c=df['label'].apply(lambda x: colors[x]))

classifCars are in blue while giraffes are in yellow

Intermediate conclusion: classification will not be very easy by this way of processing because it is hard to see a clear separation between blue and yellow dots. Let’s confirm this.

I then decided to reduce the dimensions of my train and test vectorized images to 10 components.
I used two classifiers from sklearn library, RandomForestClassifier and KNeighborsClassifier, to see how much time they could predict well if a picture from test directory was a car or a giraffe. Obviously, I trained them before on the train dataset.

kn = KNeighborsClassifier(n_neighbors=6)
rf = RandomForestClassifier(random_state=42)

kn.fit(train_X, train_Y)
rf.fit(train_X, train_Y)
print("KNeighbors average score: %f" % kn.score(test_X, test_Y))
print("RF average score: %f" % rf.score(test_X, test_Y)

The average scores I obtained are the following: KNeighborsClassifier: maximum 71% of correct prediction for a number of closest neighbors n_neighbors fixed to 6.
RandomForestClassifier: 68% of correct prediction.

This way of processing is not accurate enough and the reason could be that it is only based on colours found in the images. Maybe it could fit well to issues such as classify fruits or objects which have a fixed colour. Nevertheless, in my case, I had to find another more convincing classification model.

2nd attempt:

I continued surfing the web for better tools and I soon came across a very powerful python library when it comes to images: simpleCV.

I installed it with the invaluable help from this website. I also installed Python 2.7 because I had always been working with a 3.x version but simpleCV doesn’t support it yet.

Then I started looking into its main features.

It offers a large number of extractors which are classes able to extract data from pictures.
For example: HueHistogramFeatureExtractor – gathers pixels in bins depending on their color
HaarLikeFeatureExtractor – Haar Like features are relevant to detect contrasts or variation in pictures’ pixels
EdgeFeatureExtractor – gathers in bins lines’ length and directions in a picture.

The library offers some machine learning tools enabling users to use well-known models directly on pictures.

The process I used is the following:

  • Create train and test folders and fill them with folders ordered by class (already done in previous attempt)
  • Initialize my extractors and classifiers
  • Train and test my classifiers and print results
  • Try to change some parameters to improve my models’ accuracy
  • Try to classify new pictures
  1. See attempt 1
  2. Initialize your extractors and classifiers

I initialized three extractors. See documentation to know their parameters.

# Initialization of extractors
edgeExtractor = EdgeHistogramFeatureExtractor(20)
hueExtractor = HueHistogramFeatureExtractor(10)
morphoExtractor = MorphologyFeatureExtractor()

# Creation of a list of extractors to pass it as arguments to classifiers
extractorsList = [edgeExtractor, hueExtractor, morphoExtractor]

Then, you have to pass your list of classifiers as input for your classifiers.

#Initialization of clasifiers
svm = SVMClassifier(extractorsList)
tree = TreeClassifier(extractorsList, flavor='Boosted')
naiveBayes = NaiveBayesClassifier(extractorsList)
  1. Train and Test steps

As in every machine learning process, you will have to train and test your classifying model on different datasets. In my case, the work is done on images but we will see that, behind the scenes, it still relies on datasets.

trainPaths = ['./train/Car/', './train/Giraffe/']
classes = ['Car', 'Giraffe']

# Training of all classifiers
svm.train(trainPaths, classes, savedata="mydata.txt", verbose=False)
tree.train(trainPaths, classes, verbose=False)
naiveBayes.train(trainPaths, classes, verbose=False)

It is interesting to notice that train and test methods have a savedata parameter which enables user to save extracted data as datasets in specified files.

After having trained the classifier on the train images, let’s test it on the test images. It will enable us to analyze the performance of our model and to try to modify some parameters to improve it

testPaths = ['./test/Car/', './test/Giraffe/']

# Test of all classifiers and display of results
print "---------------------Results of test session ----------------"
print "SVM:", svm.test(testPaths, classes, verbose=False) # [good, bad, confusion]
print "Tree:", tree.test(testPaths, classes, verbose=False)
print "Naive Bayes", naiveBayes.test(testPaths, classes, verbose=False)

Looking at the results, it seems clear that TreeClassifier is the best choice to solve our problem. Also, I noticed that using EdgeHistogramFeatureExtractor as the only feature extractor gave the same or better scores so don’t be surprised if you don’t see the others thereafter.

  1. Time to try to find the best configuration for our problem.

What I can change to improve the accuracy is number of train images, parameters from my extractors, number of extractors and parameters of my classifier.

For example, we can try compare the score of a TreeClassifier with flavor parameter set as ‘Forest’ and another with same parameter set as ‘Boosted’ (lots of little trees vs highly optimized trees) on the number of bins of my edgeHistogramFeatureExtractor.

For this we will use matplotlib library. Then we display a graph of each classifier’s success score depending on the number of bins of the EdgeHistogrameFeatureExtractor.

# Comparison of tree flavour "Forest" and tree flavour "Boosting"
treeTreeScores = []
treeBoostedScores = []

# Calculation for the Forest flavored tree
for i in range(8,22):
	edge = EdgeHistogramFeatureExtractor(i)
	treeTree = TreeClassifier([edge], flavor='Forest')
	treeTree.train(trainPaths, classes, verbose=False)
	score = treeTree.test(testPaths, classes, verbose=False)[0]
	print "My forest flavored tree score is", score
# Calculation for the Boosted flavored tree
for i in range(8,22):
	edge = EdgeHistogramFeatureExtractor(i)
	treeBoosted = TreeClassifier([edge], flavor='Boosted')
	treeBoosted.train(trainPaths, classes, verbose=False)
	score = treeBoosted.test(testPaths, classes, verbose=False)[0]
	print "My score is", score

x = np.array([i for i in range(8,22)])
y1 = np.asarray(treeTreeScores)
y2 = np.asarray(treeBoostedScores)

plt.plot(x, y1, label="Forest flavored")
plt.plot(x, y2, label="Boosted flavored")
plt.xlabel("Number of  bins of the EdgeHistogramFeatureExtractor")
plt.ylabel("Percentage of good answers of our classifier")
plt.title("Comparison between Forest and Boosting flavor")



The graph displays that the ‘Forest’ flavored classifier seems a bit steadier than the other one for different bins number. However, what matters is to get the highest score. That’s why I will continue with the ‘Boosted’ model and use the number of bins where it scores the best: 20.

  1. Try to classify new pictures

Thanks to step 4, we were able to determine which custom classifier is the most accurate for our problem. Let’s confront our model to new pictures!

For that, we will use the ImageSet class which is convenient to show pictures and the results of classification.

edgeExtractor = EdgeHistogramFeatureExtractor(20)
tree = TreeClassifier([edgeExtractor], flavor='Boosted')

listImages = ImageSet()
for p in ['./newImages']: # 10 new images in newImages folder
	listImages += ImageSet(p)

for image in listImages:
	className = tree.classify(image)
listImages.show() # show images with the name predicted by our classifier

On 10 random pictures, the model was correct 8 times (see video below).

The model is not perfect but the accuracy is satisfactory since not so much data was provided and not so much time was devoted to parameter optimization.

It could easily be improved by doing the latter and adding a lot more training pictures.

To go further:

To bring image analysis to a classical machine learning problem with datasets, I used simpleCV documentation to create the function saveData that would make it possible to convert images into datasets of extracted features. You can use saveData to change your train folder into a train dataset.

def trainImageSet(imageset,className, extractors, rawDataset):
	count = 0
	badFeat = False
	for img in imageset:
		featureVector = []
		for extractor in extractors:
			feats = extractor.extract(img)
			if( feats is not None ):
				badFeat = True
			badFeat = False
		count = count + 1
		del img
	return count

def trainPath(path,className, extractors, rawDataset):
	count = 0
	files = []
	for ext in ["*.jpg","*.png"]:
		files.extend(glob.glob( os.path.join(path, ext)))
	print "My files are", files
	nfiles = len(files)
	badFeat = False
	for i in range(nfiles):
		infile = files[i]
		img = Image(infile)
		featureVector = []
		for extractor in extractors:
			feats = extractor.extract(img)

			if( feats is not None ):
				badFeat = True

			badFeat = False
		count = count + 1
		del img
	return count		

def saveData(imagesPath, classes, extractors, outputFile):
	count = 0
	rawDataset = []
	for i in range(len(classes)):
		if ( isinstance(imagesPath[i], str) ):
			print "I was here"
			count = count + trainPath(imagesPath[i], classes[i], extractors, rawDataset)
			count = count + trainImageSet(imagesPath[i], classes[i], extractors, rawDataset)
	colNames = []
	for extractor in extractors:
	if(count <= 0):
		print "Warning: No features were extracted"
		return None
	orangeDomain = orange.Domain(map(orange.FloatVariable,colNames),orange.EnumVariable("type",values=classes))
        datasetOrange = orange.ExampleTable(orangeDomain,rawDataset)
	orange.saveTabDelimited (outputFile, datasetOrange)

saveData(["./train/Car/", "./train/Giraffe/"], ["Car", "Giraffe"], extractorsList, "train_data.txt")

After that, you can try to create and set your own classifiers with existing machine learning library such as the famous scikit-learn library. Indeed, simpleCV classifiers are not very customizable so it would be profitable to use another library to get more accurate results.

How about you try to classify your own objects now? I would be glad to hear about the results you get whether they are good or bad!