Getting Down and Dirty with Image Classification

Nathan Torento
7 min readOct 22, 2020
This is a stock image of a farmer overlaid with a generic “data” background. The intention is to evoke sentiments of manual labor and intentional nurturing essential to growing your precious data science skills.

You’re a fledgling data scientist finally getting into… what data science topic is it this time?

Image Classification. Ahh yes, how practical of you.

You’ve also heard of Kaggle countless times. This time, you actually want to join one of their competitions… so you scroll through the most recent challenges in Image Classification, and select a challenge that piques your interest.

Cancer detection. Ahh yes, how humanitarian of you.

Now seems like the time. Time to fatten your scrawny body of applicable data science skills. Time to get through an entire Kaggle challenge alone, perhaps with a team of friends. Time to apply all the pre-processing, model training, and model testing technical and critical-thinking skills you’ve developed.

Well, this is what my team and I accomplished recently (article to come soon). What was my role in the group? Initial data exploration, structuring the Google Colab Python notebook so that everyone’s sections flowed into each other’s, and commenting everybody’s work so that readers and the team are constantly aware of what’s happening in the code.

This article will show you how exactly I contributed to our collaboration, why my contributions were so important, and hopefully point out some practical tips. Show you how we imported the Kaggle competition’s data, how we explored the data to gain insights for later analysis, and how I set-up the overall Google Colab iPython Notebook we worked on so that it would be easy for readers and ourselves to follow along.

Creating an Introduction

For all intents and reasons, you could be working on an entire Kaggle challenge alone, and that’s perfectly fine. It just might take more time. For this article, let’s say you’re working in a team (or treat your future self as a separate team member).

As a team, one of your first priorities is to identify the exact problem, brainstorm and finalize on a proposed solution, then fairly allocate tasks. Refer to the introduction I had written for my team’s collaborative code. For the introduction, I made sure to explain to readers what the challenge was, crucial details and context about the dataset, how the dataset looked like, and how the notebook and sections were going to be divided. Note that the code was designed to work on Google Colab.

IntroductionThis paper's dataset is taken from the Kaggle competition on Histopathologic Cancer Detection. It uses the PatchCamelyon (PCam) dataset, around 300k fixed-size histopathology (the study of tissue disease) colored scans of lymph nodes all around the body. The specific challenge in the original dataset and competition is to train a model that can most accurately detect metastatic cancer. The overall .zip file contains pictures and train-test csv files. The .csv files contains only two columns: id, and label, where the id contains the unique id or name of the picture, and the label determines whether the picture is indeed indicative of metastatic cancer. This paper is created by Abdul Qadir, Asmaa Alaa Aly, Wei-Ting Yap, and Nathan Torento. For their and the reader's convenience, code and text are all written in this Google Colab notebook. It consists of four parts that they've split amongst themselves.

Next, Qadir and I worked together to instruct you on how to download the dataset on Kaggle on Colab. Notice the use of the “google.colab” package and consider how that might change in other coding environments.

1. Data preparationDownloading the Dataset on Kaggle (Guide)
It’s difficult to make a >6 GB dataset easily accessible. Simply follow the instructions below, and at some point, you will gain permission to download the data from the Kaggle website yourself.

# Install and upgrade the kaggle package before you can download the dataset
! pip install -q kaggle
! pip install --upgrade --force-reinstall --no-deps kaggle

Follow the steps in the following link.
You should have a **kaggle.json** file at the end of it.
https://www.kaggle.com/general/74235

# Run this cell, then upload your "kaggle.json" file when prompted.

from google.colab import files
files.upload()

# Below is code to gain permission to download the dataset
# An exclamation mark runs the following code on your shell
! mkdir ~/.kaggle # We are creating a Kaggle category
! cp kaggle.json ~/.kaggle/ #cp is copy; we are copying the kaggle.json into our directory
! chmod 600 ~/.kaggle/kaggle.json #chmod 600 makes it so that only the owner or uploader of the file has access to it

# Download the desired dataset (in the default zip format)
! kaggle competitions download -c histopathologic-cancer-detection

# Unzip and load the dataset onto your colab runtime
import zipfile
zip = zipfile.ZipFile('histopathologic-cancer-detection.zip')
zip.extractall()

# See image count in each folder to verify their successful importation or open up the “Files” icon on the menu on the left side of the notebook
print(len(os.listdir('../content/train')))
print(len(os.listdir('../content/test')))

Now comes the time to pre-process. Note the main decisions we made: augmentation, class balancing, train_test_split. Another teammate, Abdul Qadir, created the augmentation function, and thus was responsible for choosing the specific types of augmentation and justifying each choice. Notice how I make sure to mention that at this stage, we must already be formatting the data to fit the requirements of whatever exact model we will train. This will require looking at the model function’s documentation and some examples — always good practice for any task where you’re using a package and its functions.

2. Data Processing
Now that the data has been properly loaded and set-up, we must now pre-process our data: in our case, we mainly subset the data, augment the images, and split it into train and test. Note that we decided early on to use Deep Neural Network, so the format of our data must be set up to fit the requirement of inputs for the tensorflow.keras.models.Sequential() function.

# Create a Dataframe containing all images
data = pd.read_csv('../content/train_labels.csv')

Justification for augmentation
This Kaggle challenge is ultimately a Supervised Machine Learning challenge. Supervised machine learning however, requires plenty and diverse training data to accurately predict future, and possibly characteristically different data points, without overfitting or underfitting. We have hundreds of thousands of data points already, so why would we want to add more? The original data are all of similar format in that the size of the space for possible tumor tissue at the center of the image are the same. Therefore, we felt that creating augmented versions of these images would mitigate overfitting.

Justification for class balancing
A histogram of the image classification labels through data[‘label’] below reveals a noticeable imbalance. Although it’s not drastic, it is a common problem in machine learning that is known to worsen the predictive performance or accuracy of our model due to “stronger” training in one label. What will happen in a model trained on this data is that it might be more sensitive to identifying 0s and predict more 0s, and not as many 1s. Hence, we decided to sample 80,000 of each class (a recommended number that another notebook said was high enough to not change the accuracy compared to using all the data, but small enough to also improve the model training speed).

Cheaply made “label” histogram; not pretty but it gave me the insight I needed.
#### Balance the target distribution
As decided earlier with the variable SAMPLE_SIZE, we will subset our original data into 160000 images half labelled 0, the other labelled 1.
# take a random sample of class 0 with size equal to num samples in class 1
df_0 = df_data[df_data['label'] == 0].sample(SAMPLE_SIZE, random_state = 101)
# filter out class 1
df_1 = df_data[df_data['label'] == 1].sample(SAMPLE_SIZE, random_state = 101)
# concat the dataframes
df_data = pd.concat([df_0, df_1], axis=0).reset_index(drop=True)
# shuffle
df_data = shuffle(df_data)
df_data['label'].value_counts()

Justification for train_test_split
The train test split is a well established, and quite essential process in machine learning. In cases where the dataset is diverse and large enough, not too imbalanced, or the provided data is augmented and/or bootstrapped, splitting the data into a train and test set allows us to optimize the predictive power of our model.

# stratify=y creates a balanced validation set.
from sklearn.model_selection import train_test_split

y = df_data['label']
df_train, df_val = train_test_split(df_data, test_size=0.10, random_state=101, stratify=y)
print(df_train.shape)
print(df_val.shape)

Other teammates were responsible for the model creation, training, and analysis portion. However, I helped set the tasks for the team members who were responsible for the next stages by doing two things: comment summarizing the format of the pre-processed data they need for the model, and leaving blank sections meant to remind them to justify any major decisions. These were the comments I left based on the ideas they proposed.

For Section 3: Model creation assessment:
- why a Deep Neural Network
- why the specific number of layers
- why the selected activation function
- why compare with the VGG function
For Section 4: Presentation of findings:
- Note all the important accuracy results, compare, and provide logical deductions or inductions for why each model performed so.
- How might different aspects of the pre-processing have affected the final result? Ex: not augmenting, different ratio of train_test_splits, not balancing the labels
- How do our results answer the main question?
- What were our biggest technical or practical challenges and what can we improve for this task, and for working as a group next time?

I hope you gained truly practical insights from my experience. Good luck to you in your journey towards becoming a more skilled data scientist. May the data be with you.

--

--

Nathan Torento

A compassion-first data scientist and problem solver.