Here’s one way to teach an introductory class to NLP

Nathan Torento
8 min readDec 2, 2020

Hint: It required a bit of effort from the class participants in terms of preparation, but required even more from me to prepare the class.

Creating an NLP pipeline on Python through the NLTK package for simple sentiment analysis of movie reviews

Screenshot of me teaching the class

There’s nothing that says “I know a lot about this topic” more than creating an entire class on it and receiving overall positive feedback from the class #humblebrag.

Here’s a breakdown of my lesson plan to teach an introductory class on NLP, with emphasis on practical application and long-term retention by getting the class to create an NLP pipeline. Keep in mind that this was tailored to fit my Practical Data Science Tutorial class for my senior year at my university, Minerva Schools at KGI. By now, the class has covered fundamental programming and Data Science skills like preprocessing and model selection and evaluation. Each class lasts 1.5 hours and operates under the “active learning” theory where students come to class having already self-studied the content (here’s a doc to the readings I prescribed).

This is the general format of a class in Minerva from the perspective of the instructor:

  • Explaining the Learning Goals to accomplish by the end of the class
  • Collect Pre-class work and give a Prep Assessment Poll
  • Activity 1: Start an activity in breakout groups (or as a class) with the aim of developing existing knowledge and/or clearing up any confusions
  • Debrief Activity 1: Discuss the insights gained in Activity 1 as a class
  • Activity 2: Start an activity that further extends on or creatively applies existing knowledge
  • Debrief Activity 2: Discuss the insights gained in Activity 2 as a class
  • Send out a Reflection Poll that nudges the class to summarize the entire class or guarantee that they took away the most important information.

For each section, I will give a rough estimate of the time required, describe how I chose to spend that time, followed by the rationale and a glimpse of the content. I’ve put in links to all the relevant resources and code. Feel free to reach out to me if any of them don’t work.

[0:00-0:05] Learning Goals

I begin class by explaining the Learning Goals below.

- Be able to build an NLP pipeline
- Gain experience and familiarity with NLTK package

Before we go into the actual content of class, let’s explain the rationale behind my most important decisions.

Why build a pipeline?

Natural Language Processing is a field of Data Science that details the theories and methods programmers use to gather, process, and present information about spoken and written text, i.e. natural language. Like all data science related tasks in the real world, programmers will often be tasked to “make sense of this data”. In NLP, this implies that they will have to collect, process, and present some text, often, all by themselves.

For those with some experience in the field, this is no simple task — or at least, will take some time. If you are skeptical on how I managed to fit a seemingly gargantuan task into 1.5 hours, much like my Professor originally was, read and find out how below.

Why Sentiment Analysis?

For my work-study, I had to analyze the theses of my peers. I noticed that a majority of the topics involved sentiment analysis. Apart from that, I found a way to implement it simply (in a few lines of code) and it’s a skill with several benefits.

Why the NLTK library?

The NLTK library is the introductory library for Python users diving into NLP. It contains crucial functions for any and all things NLP-related from the pre-processing to the analysis stage, multiple *corpora for training or testing, it is constantly updated, and in-depth documentation.

*corpus (plural corpora): a collection of written or spoken text made to be electronically accessible for educational or practical purposes;

Why Rotten Tomatoes reviews?

Movie reviews from Rotten Tomatoes are the most popular and most accessible binary reviews out there today — it’s real, accessible text that can be classified into “positive” or “negative”. Furthermore, no complicated code or API’s are required to collect the information.

[0:05–0:11] Pre-class work (PCW) and Prep assessment poll.

I give the class a few minutes to quickly submit the link to their code for pre-class work (described below), and then later, ask them to answer the prep assessment poll (also described below).

Pre-class work

Search for your favorite movie on Rotten Tomatoes.Manually copy down 10 critic reviews: 5 “fresh” (good) reviews and 5 “rotten” (bad) reviews. Separate each by a new line. Save it in a non-formatted ‘review.txt’ file which you can easily open and access later.In less than 100 lines, perform a sentiment analysis. Be sure to include three functions or code chunks. Be sure to effectively use the nltk package to cut the number of code you’ll need to manually create.1. clean(reviews) #make sure to transform text into lowercase, remove numbers, remove stop words, remove punctuation;2. lemmatize(text) #hint: from nltk.stem import WordNetLemmatizer, PorterStemmer3. sentiment analysis #train a model based on built-in nltk movie_reviews to categorize each of your selected reviews as either ‘positive’ or ‘negative’Have a final output to identify the accuracy of your model (out of 10) in successfully classifying your selected reviews.

Prep assessment poll

Identify the pre-processing steps you took for PCW and explain their importance to your larger goal of sentiment analysis.

Yes, quite a bit of expectation falls on the students to learn this new package and learn how to put it into practice for an “intro” class. Alas, tis the standard of computer science classes at Minerva. Trust, however, that this effort is put into good use for the following activities.

[0:11–0:36] Activity 1: Discuss PCW into breakout groups.

Compare and contrast how everyone in the breakout group implemented pre-class work, particularly with how they trained the model. Note differences that affected runtime, amount of code, and final accuracy.

[0:26–0:41] Debrief Activity 2: Share insights

Everyone share at least one insight they gained from the discussion. What were some practical challenges you ran into? How did someone else overcome those challenges?

This is a crucial time for people to debug code and exchange ideas on implementation such as which packages to use. Given that part of the limitation for the PCW was to minimize the maximum lines of code, this is an introductory session, and this is a “Practical Data Science” class, the easiest implementation is most beneficial to participants in real life and for the upcoming activities later.

[0:41–1:01] Activity 2: Work on separate parts of an example NLP pipeline

Combining all your reviews, create a NLP pipeline of article reviews with sentiment analysis as the end goal. Make sure to label each review for easy accuracy testing later!Input: combined reviewsOutput: accuracy scoreThe pipeline, in working Python code, should be able to:
- combine everyone’s movie review text files into one (by code or manually)
- pre-process text, with well-commented cleaning elements of your choice
- lemmatize the text
- train a sentiment analysis model on some data (any justifiable classification algorithm trained on the nnltk.movie_reviews or on a k-fold cross-validation of your combined data)
- output a final accuracy score
Instructions: One person will probably have to work on their computer and share their screen.Number of breakout groups: 3
An outline of the NLP Pipeline

For this activity, I split the class into three groups. Remember that the ultimate goal of class was to create a functioning NLP pipeline. Each group is responsible for a section of the pipeline. For the activity, they are tasked with reading a python notebook with pre-filled code and specific instructions to test their practical knowledge. If they ran out of time or were stuck, they were free to call on their instructor (me) for help, or view the answer key if they really needed to. See the notebooks below.

Group 1: Pre-processing (notebook)
Group 2: Model training (notebook)
Group 3: Sentiment Analysis (notebook)

[1:01–1:21] Debrief Activity 2: Synthesize the code and run the pipeline

Paste each group's code into the pipeline and run the pipeline.
Activity 2 Debrief: Paste in your code and the pipeline should work!

We then first spent a few minutes copy pasting every group’s code into the respect blocks. Should the code not work, I have backup code at the bottom that’s already run and completed, to show the class how the pipeline should look like and what it would output had it worked.

Crucial pre-filled code for Group 3

Important Note: Because the model training section of Group 2 is dependent on the pre-process section of Group 1 being complete, I’ve pre-filled Group 2 and Group 3's notebook.

We then had a discussion with the prompt below.

What were some technical challenges of working on this task as a group? What were some practical challenges of working on this task as a group?

Obviously, my specific implementation did not necessarily choose an objective “easiest” or “quickest” method.

[1:21–1:30] Reflection Poll

What technical skills or experience were you lacking that hindered your ability to finish PCW and Activity 2? How could they have made the tasks easier?

So, class went well. This is some of the positive feedback I received from my peers.

  • they now have practical experience of how to perform a sentiment analysis; in fact, my notebooks now serve as code they can use for future reference
  • although the pre-class required quite a bit of work, they felt that their effort really paid off because they were forced to directly apply their learnings in the activities
  • having the pre-class work, activity 1, and activity 2 build off each other ensured that the class was engaging and flowed naturally

Points of improvement for the future:

  • implement text scraping of website — it’s often too much work to manually copy and text from their original sources, so programmers scrape… but let’s leave the diverse and nuanced topic of scraping for another class
  • implement the input such that they are labelled, perhaps as a 2 column .csv file — referring to Kaggle, this is usually how text for sentiment analysis is pre-formatted
  • make sure that cells are pre-run, so that in the case that the example code doesn’t run during class, the class can clearly see what was supposed to happen (I forgot to run the final output code for the entire pipeline)

I hope you all enjoyed and found this information useful! As much as it provided me a much needed ego and mood boost to get such positive and constructive feedback, I felt very fulfilled knowing I taught actually useful skills and actually relevant and useful output.

--

--

Nathan Torento

A compassion-first data scientist and problem solver.