Text Classification with NLP on Kaggle Twitter Data
A 4-person team NLP Pipeline proposal on the “Real or Not? NLP with Disaster Tweets” Kaggle Competition
0. Introduction
Overview
This paper will be based on the Kaggle competition Real or Not? NLP with Disaster Tweets. It’s an introductory challenge to serve as practice for Natural Language Processing with focus on Text Classification. The competition creators gathered 10875 tweets that are reporting an emergency or some man-made/natural disaster — the selection process is left unspecified.
Dataset
There are three provided files:
- train.csv — the training set
- test.csv — the test set
- sample_submission.csv — the framework for official competition submissions
The training dataset contains these columns:
- id: a unique numeric identifier for each tweet
- text: the actual content in the tweet
- keyword: keywords from the tweet manually selected by the competition creators (may be blank)
- location: the location the tweet was sent from (may be blank)
- target: values are hand classified by creators as either 0 for non-disaster or 1 for disaster
The test set is similar to the training data set except for not having a target column.
The sample_submission contains the id and a target column for all the tweets that competitors will have to populate with their proposed models.
Goal
The challenge of the competition is to create a model that can most accurately predict the test set and achieve a high F1 Score (read more at https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html).
This team however hoped to use this challenge as their final academic opportunity to apply as many of the skills they’ve learned in a Practical Data Science Tutorial while exploring additional Data Science fields and methods they were interested in.
Thus, the team decided to create an NLP pipeline for this challenge, split it into four parts, and self-allocate as you will see in the Team Roles section below.
Important Note: Everyone worked on different sections of a pipeline, and those sections naturally vary in magnitude of required effort. Thus, while the team consolidated efforts to create output that fed or flowed into each other’s section, each member individually explored into aspects of the data or task that were not necessary for the pipeline. Explore their individual contributions through the articles linked below.