25 November 2017

Code and data can be found on Github

Figured I should chime in here since it’s been… let’s see here… ah, 4 months. I had previously promised to write about some NCAA hockey projects I had been working on. That… will have to wait -_-. In the meantime, let’s look instead at some Saturday Night Live data!

For those not familiar, Saturday Night Live, commonly abbreviated as ‘SNL’, is a comedy-variety show that airs live on NBC on select Saturday nights over the course of a ‘season’, which runs from September through May. SNL first aired in 1975, and is currently airing its 43rd season. Each episode features a celebrity guest host who participates in most of the sketches, along with a musical guest who performs twice over the course of the show.

A simple question came to me after watching a disappointing episode of SNL hosted by Kumail Nanjiani a few weeks ago. This episode just happened to be the third in a sequence of episodes airing over consecutive weeks. Knowing the grueling schedule that SNL writers operate on, I was curious if, as in sports such as hockey, some kind of fatigue, in this case something akin to writer’s block, may impact the quality of an episode.

In this post, I’ll perform exploratory analysis on SNL episode quality to get an idea of whether or not my ‘writer’s fatigue’ hypothesis makes sense. Episode quality will be based on averaged ratings (on a 0-10 scale) of users on IMDb, inspired by previous analyses of TV shows such as ‘The Simpsons’. I’ll note that, while writing this post, I thought the most recent episode hosted by Chance the Rapper was fantastic, and also happened to be the third episode in a stretch of three consecutive weeks of episodes.

I’ll also make clear right at the onset that this post is meant mostly as a fun exercise. There are a number of caveats to using IMDb data, such as the variable number of votes given to each episode, the composition of users that determine the ratings, and the rounding of final ratings to 1 decimal place. SNL ratings in particular are likely highly driven by topical events that occurred around the time of each episode, as well as the quality/notoriety/likability of the host. While a more determined analyst would do well to adjust for these factors, a much more involved cleaning job on the data will likely be required (something for the future perhaps?).

All that being said, let’s get right into it, starting by loading our handy libraries.

library(dplyr)     # Lifeblood of data science
library(rvest)     # Scrape scrape scrape
library(purrr)     # 'map' function
library(zoo)       # 'locf' function for carrying last observation forward
library(lubridate) # Working with dates 
library(stringr)   # Working with strings
library(ggplot2)   # Plots
library(ggridges)  # Fancy plots
library(viridis)   # User-friendly color scheme

Scrape, Scrape, Scrape!

For those not interested in the data collection portion of this post, but who still want to play along at home, full data can be found here, and feel free to skip directly to the analysis section.

As is usually the case for any project, the required data will have to be collected in pieces from disparate sources, cleaned separately and merged into a tidy format. First we need the ratings of each episode, which we scrape from IMDb using the read_html, html_nodes, and html_table functions from the rvest package. The main table of interest is the first table collected under the 'table' tag, which we convert to an R data frame.

url <- "http://www.imdb.com/title/tt0072562/epdate" #IMDb URL

raw.ratings <- url %>% 
  read_html %>% 
  html_nodes('table') %>% # Ratings table is taggeed as 'table'
  html_table(header=F) %>% # Extract tables
  `[[`(1) %>% # Retain first table with ratings
  data.frame(stringsAsFactors=F) # Convert to data frame
##    X1                                       X2         X3        X4
## 1   #                                  Episode UserRating UserVotes
## 2 1.1    George Carlin/Billy Preston/Janis Ian        7.6       266
## 3 1.2      Paul Simon/Randy Newman/Phoebe Snow        6.5       158
## 4 1.3                               Rob Reiner        7.0       129
## 5 1.4           Candice Bergen/Esther Phillips        7.2       126
## 6 1.5 Robert Klein/ABBA, Loudon Wainwright III        6.8       109
##                                           X5
## 1                                           
## 2 1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n7.6/10\nX
## 3 1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n6.5/10\nX
## 4 1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n7.0/10\nX
## 5 1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n7.2/10\nX
## 6 1\n2\n3\n4\n5\n6\n7\n8\n9\n10\n\n6.8/10\nX

I prefer scraping tables specifying header=F, as the headers of tables on websites tend to be in odd formats that screw up the scraping process (plus I usually end up changing variable names anyway). We remove the first row in our scraped table (which contains the headers we see on the web), then remove the last column of gibberish, and assign our own variable names. From there, we extract the season number from the first column of data, which is the number that appears before the . in that field. Finally, we filter out episodes from the current season (the 43rd season).

# Helper function for extracting the season from the full episode number
extract.season <- function(x){
  
  token.list <- x %>%
    as.character %>%
    strsplit(split=".", fixed=T) %>% # Season number appears before the '.'
    unlist
  
  return(as.numeric(token.list[1]))  # Convert to numeric before returning
}

# Clean SNL ratings data
snl.ratings <- raw.ratings %>%
  tail(-1) %>% #  Remove first row
  select(1:4) %>% # Retain first 4 columns
  rename(EpisodeNumber = X1,
         Guests = X2,
         Rating = X3,
         Votes = X4) %>% # Rename variables
  rowwise %>%
  mutate(Season = extract.season(EpisodeNumber)) %>% # Get season number from episode number
  select(Season, Guests, Rating, Votes) %>% # Retain variables
  ungroup %>%
  filter(Season < 43) # Not going to consider episodes from the currently airing season
##   Season                                   Guests Rating Votes
## 1      1    George Carlin/Billy Preston/Janis Ian    7.6   266
## 2      1      Paul Simon/Randy Newman/Phoebe Snow    6.5   158
## 3      1                               Rob Reiner    7.0   129
## 4      1           Candice Bergen/Esther Phillips    7.2   126
## 5      1 Robert Klein/ABBA, Loudon Wainwright III    6.8   109
## 6      1                              Lily Tomlin    7.2   108

So we have our ratings, but if we are to look at how breaks and consecutive runs of episodes affect ratings, we also need the date each episode aired (the ‘airdate’), which is not found in the data we collected. Various sources contain airdate information, including IMDb, though not in a handy table format, meaning a scraper would be quite a bit more tedious to design. Luckily, the airdates happen to be conveniently laid out on Wikipedia, and so we’ll simply go with that.

Tables of airdates, one for each season, are spread across three sections. We collect all tables across all three sections and bind these together. Note that tables in the last section contain some TV ratings information from Nielsen (these estimate number of viewers, not necessarily episode quality), which we omit when combining.

# Helper function for scraping SNL airdate tables from Wikipedia
extract.wiki.table <- function(url){
  
  raw.data <- url %>% 
    read_html %>% 
    html_nodes('.wikiepisodetable') %>% # Tables from wikipedia page are tagged with '.wikiepisodetable'
    map(html_table, header=F, fill=T) %>% # Convert scraped tables to data frames
    bind_rows # Bind the set of data frames into one large data frame
  
  return(raw.data)
}

# Gather and combine raw SNL airdate tables
raw.snl.1 <- extract.wiki.table("https://en.wikipedia.org/wiki/List_of_Saturday_Night_Live_episodes_(seasons_1%E2%80%9315)#Episodes")
raw.snl.2 <- extract.wiki.table("https://en.wikipedia.org/wiki/List_of_Saturday_Night_Live_episodes_(seasons_16%E2%80%9330)#Episodes")
raw.snl.3 <- extract.wiki.table("https://en.wikipedia.org/wiki/List_of_Saturday_Night_Live_episodes#Episodes") %>% select(1:5) # This table has the number of viewers in an extra column, which we leave out

raw.snl.combined <- rbind(raw.snl.1,raw.snl.2,raw.snl.3) # Bind the three data frames together
##   X           X1             X2             X3
## 1 1 No.\noverall No. in\nseason           Host
## 2 2            1              1  George Carlin
## 3 3            2              2     Paul Simon
## 4 4            3              3     Rob Reiner
## 5 5            4              4 Candice Bergen
## 6 6            5              5   Robert Klein
##                                                               X4
## 1                                               Musical guest(s)
## 2                                      Billy Preston & Janis Ian
## 3 Randy Newman, Phoebe Snow, Art Garfunkel & Jessy Dixon Singers
## 4                                                           none
## 5                                                Esther Phillips
## 6                                   ABBA & Loudon Wainwright III
##                               X5
## 1              Original air date
## 2  October?11,?1975?(1975-10-11)
## 3  October?18,?1975?(1975-10-18)
## 4  October?25,?1975?(1975-10-25)
## 5  November?8,?1975?(1975-11-08)
## 6 November?15,?1975?(1975-11-15)

Other than removing the website headers collected within the table (conveniently the string ‘Original air date’ appears in each case in the fifth column) and renaming our variables, we extract the airdate from the final column and convert to a more R-friendly format using lubridate.

# Helper function for extracting the correct airdate
extract.airdate <- function(airdate){
  
  clean.airdate <- airdate %>% 
    strsplit(split="(", fixed=T) %>% # Split off date string after '('
    unlist %>% # Convert to vector
    `[`(1) %>% # Extract first element
    trimws %>% # Remove whitespace from beginning and end
    mdy # Convert to simple date format
  
  return(clean.airdate)
}

# Clean SNL airdate data
snl.airdates <- raw.snl.combined %>% 
  rename(EpNumber = X1,
         SeasonEpNumber = X2,
         Host = X3,
         MusicalGuest = X4,
         AirDate = X5) %>% # Rename variables
  filter(AirDate != "Original air date") %>% # Remove rows not pertaining to actual episodes
  mutate(EpNumber = as.numeric(EpNumber),
         SeasonEpNumber = as.numeric(SeasonEpNumber)) %>% # Convert numeric columns
  rowwise %>%
  mutate(AirDate = extract.airdate(AirDate)) %>% # Convert airdate to R friendly format
  ungroup

Notice this data set does not contain the season number, but since we have the episode number within each season, we can extract the first episode of each season and assign its season number, then create a new variable in the original data set with the season number for these episodes and ‘NA’s for the rest. We then use the locf function from the zoo package to carry the last observation forward, filling out the season number for the remaning episodes.

# Extract first episodes from each season and assign a season number
first.episodes <- snl.airdates %>%
  filter(SeasonEpNumber == 1) %>%
  mutate(Season = 1:n()) %>%
  select(EpNumber, Season)
##   EpNumber Season
## 1        1      1
## 2       25      2
## 3       47      3
## 4       67      4
## 5       87      5
## 6      107      6
snl.airdates <- snl.airdates %>%
  left_join(first.episodes, by="EpNumber") %>% # Bind season number based on episode number
  mutate(Season = na.locf(Season)) %>% # Carry season number forward to remaining episodes in season
  select(Season, SeasonEpNumber, Host, MusicalGuest, AirDate) %>% # Retain variables of interest
  filter(Season < 43) # Not going to consider the season currently airing
##   Season SeasonEpNumber           Host
## 1      1              1  George Carlin
## 2      1              2     Paul Simon
## 3      1              3     Rob Reiner
## 4      1              4 Candice Bergen
## 5      1              5   Robert Klein
## 6      1              6    Lily Tomlin
##                                                     MusicalGuest
## 1                                      Billy Preston & Janis Ian
## 2 Randy Newman, Phoebe Snow, Art Garfunkel & Jessy Dixon Singers
## 3                                                           none
## 4                                                Esther Phillips
## 5                                   ABBA & Loudon Wainwright III
## 6                  Tomlin with Howard Shore & the All Nurse Band
##      AirDate
## 1 1975-10-11
## 2 1975-10-18
## 3 1975-10-25
## 4 1975-11-08
## 5 1975-11-15
## 6 1975-11-22

Note that, as in the previous case, we omit the episodes from the most recent season.

Merging our Data Sets

So we have two tables of data for what we assume is the same set of episodes. Simply bind the columns together and we’re good to go, right? Well, not so fast…

snl.airdates.season.counts <- snl.airdates %>%
  group_by(Season) %>%
  summarize(N1=n()) %>%
  arrange(Season)

snl.ratings.season.counts <- snl.ratings %>%
  group_by(Season) %>%
  summarize(N2=n()) %>%
  arrange(Season)

snl.airdates.season.counts %>%
  left_join(snl.ratings.season.counts, by="Season") %>%
  mutate(SameCounts = N1==N2) %>%
  filter(!SameCounts)
## # A tibble: 2 x 4
##   Season    N1    N2 SameCounts
##    <int> <int> <int>      <lgl>
## 1      2    22    23      FALSE
## 2     10    17    18      FALSE

It seems the number of episodes in each seasons don’t line up exactly; there are extra episodes in the ratings dataset for Seasons 2 and 10.

Typically, in a larger data setting, one would employ some kind of fuzzy matching scheme to match the Host field from the airdate dataset and the Guest field from the ratings dataset (especially since we can’t generally guarantee that the correct order of data points has been preserved). Since the data here is not particularly large, and there appears to only be two errors, we will inspect the situation manually.

# Seasons 2 and 10 have an extra episode in the ratings data set, let's find them!
snl.airdates %>%
  filter(Season == 2) %>%
  .$Host
##  [1] "Lily Tomlin"        "Norman Lear"        "Eric Idle"         
##  [4] "Karen Black"        "Steve Martin"       "Buck Henry"        
##  [7] "Dick Cavett"        "Paul Simon"         "Jodie Foster"      
## [10] "Candice Bergen"     "Ralph Nader"        "Ruth Gordon"       
## [13] "Fran Tarkenton"     "Steve Martin"       "Sissy Spacek"      
## [16] "Broderick Crawford" "Jack Burns"         "Julian Bond"       
## [19] "Elliott Gould"      "Eric Idle"          "Shelley Duvall"    
## [22] "Buck Henry"
snl.ratings %>%
  filter(Season == 2) %>%
  .$Guests 
##  [1] "Lily Tomlin/James Taylor"                                          
##  [2] "Norman Lear/Boz Scaggs"                                            
##  [3] "Eric Idle/Joe Cocker/Stuff"                                        
##  [4] "Karen Black/John Prine"                                            
##  [5] "Steve Martin/Kinky Friedman"                                       
##  [6] "Buck Henry/The Band"                                               
##  [7] "Dick Cavett/Ry Cooder"                                             
##  [8] "Paul Simon/George Harrison"                                        
##  [9] "Jodie Foster/Brian Wilson"                                         
## [10] "Candice Bergen/Frank Zappa"                                        
## [11] "Ralph Nader/George Benson"                                         
## [12] "Ruth Gordon/Chuck Berry"                                           
## [13] "Fran Tarkenton/Leo Sayer, Donnie Harper and the Voices of Tomorrow"
## [14] "Live from Mardi Gras"                                              
## [15] "Steve Martin/The Kinks"                                            
## [16] "Sissy Spacek/Richard Baskin"                                       
## [17] "Broderick Crawford/Levon Helm/Dr. John/The Meters"                 
## [18] "Jack Burns/Santana"                                                
## [19] "Julian Bond/Tom Waits, Brick"                                      
## [20] "Elliott Gould/Kate & Anna McGarrigle/Roslyn Kind"                  
## [21] "Eric Idle/Neil Innes"                                              
## [22] "Shelley Duvall/Joan Armatrading"                                   
## [23] "Buck Henry/Jennifer Warnes/Kenny Vance"

The episode ‘Live from Mardi Gras’ appears in the ratings data but not the airdate data.

snl.airdates %>%
  filter(Season == 10) %>%
  .$Host
##  [1] "(none)"            "Bob Uecker"        "Jesse Jackson"    
##  [4] "Michael McKean"    "George Carlin"     "Ed Asner"         
##  [7] "Ed Begley, Jr."    "Ringo Starr"       "Eddie Murphy"     
## [10] "Kathleen Turner"   "Roy Scheider"      "Alex Karras"      
## [13] "Harry Anderson"    "Pamela Sue Martin" "Mr. THulk Hogan"  
## [16] "Christopher Reeve" "Howard Cosell"
snl.airdates %>%
  filter(Season == 10, SeasonEpNumber == 1) %>%
  .$MusicalGuest
## [1] "Thompson Twins"
snl.ratings %>%
  filter(Season == 10) %>%
  .$Guests 
##  [1] "Thompson Twins"                               
##  [2] "Bob Uecker/Peter Wolf"                        
##  [3] "Jesse Jackson/Andrae Crouch, Wintley Phipps"  
##  [4] "Michael McKean/Chaka Khan/The Folksmen"       
##  [5] "George Carlin/Frankie Goes to Hollywood"      
##  [6] "Ed Asner/The Kinks"                           
##  [7] "Ed Begley Jr/Billy Squier"                    
##  [8] "Ringo Starr/Herbie Hancock"                   
##  [9] "Eddie Murphy/Robert Plant & The Honeydrippers"
## [10] "Kathleen Turner/John Waite"                   
## [11] "Roy Scheider/Billy Ocean"                     
## [12] "Alex Karras/Tina Turner"                      
## [13] "Harry Anderson/Bryan Adams"                   
## [14] "Pamela Sue Martin/Power Station"              
## [15] "SNL Film Festival"                            
## [16] "Mr. T and Hulk Hogan/The Commodores"          
## [17] "Christopher Reeve/Santana"                    
## [18] "Howard Cosell/Greg Kihn"

The extra season 10 episode in the ratings dataset is ‘SNL Film Festival’. Note that while the host of the season premiere is listed as ‘(none)’ in the airdate dataset, the musical guest (‘Thompson Twins’) matches the Guest field from the ratings dataset (in some SNL episodes, the host doubles as the musical guest).

We filter out these two extra episodes, and then bind the two datasets together.

snl.ratings <- snl.ratings %>%
  filter(!(Guests %in% c("Live from Mardi Gras","SNL Film Festival")))
# Combine data frames
snl.data <- cbind(snl.airdates, snl.ratings %>% select(-Season)) %>%
  select(Season,SeasonEpNumber,Guests,Host,MusicalGuest,AirDate,Rating,Votes)

Below are the episodes were the Host field string (originally from the airdate dataset) is not detected within the Guest field string (originally from the ratings dataset. I leave it to you to check that there are no issues other than typos (e.g. ‘Donald Pleasance’ vs. ‘Donald Pleasence’), improper capitalizations (e.g. ‘Elle MacPherson’ vs. ‘Elle Macpherson’), non-matching abbreviations (e.g. ‘Louis Gossett Jr’ vs. ‘Louis Gossett Jr.’), alternate names (e.g. ‘Tom & Dick Smothers’ vs. ‘The Smothers Brothers’), spacing issues (e.g. O.J. Simpson vs. O. J. Simpson) or combined hosts/musical guests that only appear in one column and not the other. Note that I’ve truncated the strings to make them easier to display.

# Quick sanity check, which episodes don't match hosts between data sets?
# Does the 'Host' from the Airdate data appear in the 'Guest' from the Ratings data?
snl.data %>%
  mutate(MatchDetected = str_detect(Guests,fixed(Host))) %>% 
  filter(!MatchDetected) %>%
  mutate(Guests = str_sub(Guests,0,20),
         Host = str_sub(Host,0,20),
         MusicalGuest = str_sub(MusicalGuest,0,20)) %>%
  select(Season,Guests,Host,MusicalGuest)
##    Season               Guests                 Host         MusicalGuest
## 1       3 O.J. Simpson/Ashford        O. J. Simpson  Ashford and Simpson
## 2       5 Paul Simon, James Ta                 none Paul SimonJames Tayl
## 3       5 Richard Benjamin, Pa Richard BenjaminPaul    The Grateful Dead
## 4       6 Jr. Walker & the All                 none Jr. Walker & the All
## 5       7          Rod Stewart               (none)          Rod Stewart
## 6       7 Donald Pleasance/Fea     Donald Pleasence                 Fear
## 7       8 Louis Gossett Jr/Geo   Louis Gossett, Jr. George Thorogood & t
## 8       8 Tom & Dick Smothers/ The Smothers Brother       Laura Branigan
## 9       8 Rick Moranis and Dav Rick MoranisDave Tho         The Bus Boys
## 10      8 Beau and Jeff Bridge Beau BridgesJeff Bri         Randy Newman
## 11      8 Robert Guilaume/Dura     Robert Guillaume          Duran Duran
## 12      9 Danny DeVito & Rhea  Danny DeVito & Rhea            Eddy Grant
## 13      9 Father Guido Sarducc Father Guido Sarducc Huey Lewis and the N
## 14      9 Billy Crystal/Ed Koc Billy Crystal, Ed Ko             The Cars
## 15     10       Thompson Twins               (none)       Thompson Twins
## 16     10 Ed Begley Jr/Billy S       Ed Begley, Jr.         Billy Squier
## 17     10 Mr. T and Hulk Hogan      Mr. THulk Hogan       The Commodores
## 18     11 Pee Wee Herman/Queen Paul Reubens as Pee- Queen Ida & the Bon 
## 19     11 George Wendt and Fra George WendtFrancis          Philip Glass
## 20     11 Catherine Oxenberg a Catherine OxenbergPa Paul SimonLadysmith 
## 21     11 Anjelica Huston and  Anjelica HustonBilly George ClintonParlia
## 22     12 Chevy Chase/Steve Ma Chevy Chase, Steve M         Randy Newman
## 23     12 Walter Payton & Joe  Joe MontanaWalter Pa        Deborah Harry
## 24     17               Hammer            MC Hammer            MC Hammer
## 25     17 Jason Priestly/Teena      Jason Priestley      Teenage Fanclub
## 26     17 Roseanne & Tom Arnol Roseanne ArnoldTom A Red Hot Chili Pepper
## 27     19 Shannon Doherty/Cypr      Shannen Doherty         Cypress Hill
## 28     19 Alec Baldwin & Kim B Alec Baldwin and Kim                 UB40
## 29     21 Elle MacPherson/Stin      Elle Macpherson                Sting
## 30     22 Robert Downey Jr/Fio   Robert Downey, Jr.          Fiona Apple
## 31     22 Pamela Lee/Rollins B      Pamela Anderson         Rollins Band
## 32     23 Rudy Giuliani/Sarah      Rudolph Giuliani      Sarah McLachlan
## 33     24 Cuba Gooding, Jr./Ri     Cuba Gooding Jr.         Ricky Martin
## 34     25       The Rock/AC/DC The Rock (Dwayne Joh                AC/DC
## 35     27 Ellen Degeneres/No D      Ellen DeGeneres             No Doubt
## 36     27 The Rock/Andrew W.K.       Dwayne Johnson          Andrew W.K.
## 37     29 Jessica Simpson and  Jessica Simpson & Ni               G-Unit
## 38     29 Mary-Kate & Ashley O Mary-Kate and Ashley               J-Kwon
## 39     40 J.K. Simmons/D'Angel        J. K. Simmons             D'Angelo

SNL Ratings Distributions

Alright, we have our data, now let’s make some charts! First let’s take a quick look at ratings curves by season. We use the excellent ggridges package, by Claus O. Wilke (available from CRAN), to display the large number of density curves (note that you may have heard of this package by another name). I also use the viridis package to display the median episode rating for each season by a user-friendly color.

# Extract median episode rating for each season
median.ratings <- snl.data %>% 
  group_by(Season) %>% 
  summarize(MedianRating = median(Rating))

# Merge medians to SNL data
snl.data <- snl.data %>%
  left_join(median.ratings, by="Season")

# Plot rating densities
ggplot(data=snl.data, aes(x=Rating, y=as.factor(Season), fill=MedianRating)) +
  geom_density_ridges(alpha=0.7, rel_min_height=0.01) +
  scale_fill_viridis(name = "Median Rating") +
  scale_y_discrete(position="right") +
  theme_bw(16) +
  xlim(0,10) +
  xlab("Rating") +
  ylab("Season") +
  annotate("text",x=5,y=20,col="red",label=paste0("@mathieubray ",lubridate::year(lubridate::today())),
           alpha=0.15,cex=15,fontface="bold",angle=30) +
  ggtitle("SNL Ratings Density by Season", subtitle = "Color represents median episode rating from the given season. Data from imdb.com")

plot of chunk snl_1

This chart gives a nice snapshot of viewer’s impressions of SNL over the years, with strong early seasons followed by a comparatively mediocre stretch from Seasons 9-16 (corresponding to 1983-1990), and another slightly worse stretch around Season 22 to Season 30 (1996-2004). While recent seasons appear to be more favorably rated in general, there does not seem to be much evidence to suggest any major SNL renaissance (which one would assume if we look at the number of Emmys earned last season).

Check out that rough patch in Season 41 where the density perks up far to the left of the main ratings curve. Tough to tell what may have caused that…

snl.data %>% 
  filter(Season==41, Rating < 4.0) %>%
  select(Season, SeasonEpNumber, Host, AirDate, Rating, Votes)# One of the least popular episodes
##   Season SeasonEpNumber         Host    AirDate Rating Votes
## 1     41              4 Donald Trump 2015-11-07    3.7   492

I’ll reiterate here that the IMDb ratings are reported as the average of ratings on a 0-10 scale. I should note that the number of votes tend to be high for recent episodes as well as early episodes, while episodes from most of the other seasons have comparatively much fewer votes.

mean.votes <- snl.data %>% 
  group_by(Season) %>% 
  summarize(MeanVotes = mean(Votes))

snl.data <- snl.data %>%
  left_join(mean.votes, by="Season")

ggplot(data=snl.data, aes(x=as.factor(Season), y=Votes, color = MeanVotes)) +
  scale_color_viridis(name = "Mean Number of Votes", direction=-1) +
  geom_boxplot(size=0.75) +
  xlab("Season") +
  theme_bw(16) +
  coord_flip() +
  annotate("text",x=21,y=250,col="red",label=paste0("@mathieubray ",lubridate::year(lubridate::today())),
           alpha=0.15,cex=12,fontface="bold",angle=30) +
  ggtitle("Number of Votes for SNL Rating is Not Consistent Season to Season",
          subtitle="Color represents mean number of votes by season. Data from imdb.com" )

plot of chunk snl_2

The overall distribution of ratings is shown below, as both a histogram (recall that ratings are collected rounded to 1 decimal place, effectively discretizing the data), as well as a density curve.

ggplot(data=snl.data, aes(Rating)) +
  geom_histogram(stat="count",alpha=0.8,fill="blue") +
  theme_bw(16) +
  xlim(0,10) +
  ylim(0,50) +
  xlab("Rating") +
  ylab("Number of Episodes") +
  annotate("text",x=5,y=25,col="red",label=paste0("@mathieubray ",lubridate::year(lubridate::today())),
           alpha=0.15,cex=15,fontface="bold",angle=30) +
  ggtitle("SNL Ratings Histogram",
          subtitle="Data from imdb.com" )

plot of chunk snl_3

ggplot(data=snl.data, aes(x=Rating)) +
  geom_density(alpha=0.5, fill="blue") +
  theme_bw(16) +
  xlim(0,10) +
  xlab("Rating") +
  ylab("Density") +
  annotate("text",x=5,y=0.25,col="red",label=paste0("@mathieubray ",lubridate::year(lubridate::today())),
           alpha=0.15,cex=15,fontface="bold",angle=30) +
  ggtitle("SNL Ratings Density",
          subtitle="Data from imdb.com" )

plot of chunk snl_4

We observe a bimodal density curve with most episodes earning ratings around 6.0 (‘mediocre’ episodes), or around 7.5 (‘good-great’ episodes). We continue to use densities going forward, mostly based on their elegance and ease of qualitative interpretation.

SNL Ratings Based on Number of Weeks Between Episodes

We are interested in observing whether some kind of ‘writer fatigue’ may affect the quality of SNL episodes. First, we will calculate for each episode the number of weeks since the previous episode, a.k.a. the break betwen each episode. This can be achieved using the lag function (to collect in a separate variable the airdate of the episode previous to each episode), and the as.period function from the lubridate package, to extract the difference in days between two airdates, which we then convert to a number of weeks. A snippet of the data with the new column is shown below.

### Ratings based on number of weeks since last episode

# Add in the number of weeks since the last episode
snl.data.enhanced <- snl.data %>%
  mutate(PreviousAirDate = lag(AirDate), # Get airdate of previous episode
         WeeksSinceLastEpisode = as.period(AirDate - PreviousAirDate)@day / 7) %>% # Calculate number of weeks since last episode
  select(-PreviousAirDate)
snl.data.enhanced %>% 
  filter(Season==42) %>%
  select(SeasonEpNumber,Host,AirDate,WeeksSinceLastEpisode) %>%
  head(10)
##    SeasonEpNumber                 Host    AirDate WeeksSinceLastEpisode
## 1               1        Margot Robbie 2016-10-01                    19
## 2               2   Lin-Manuel Miranda 2016-10-08                     1
## 3               3          Emily Blunt 2016-10-15                     1
## 4               4            Tom Hanks 2016-10-22                     1
## 5               5 Benedict Cumberbatch 2016-11-05                     2
## 6               6       Dave Chappelle 2016-11-12                     1
## 7               7         Kristen Wiig 2016-11-19                     1
## 8               8           Emma Stone 2016-12-03                     2
## 9               9            John Cena 2016-12-10                     1
## 10             10        Casey Affleck 2016-12-17                     1

Note that the season premiere will always have a large break (SNL does not air over the summer). Let’s look at the distribution of break lengths.

# Standardize these weeks into categories
snl.data.enhanced %>%
  group_by(WeeksSinceLastEpisode) %>%
  count
## # A tibble: 18 x 2
## # Groups:   WeeksSinceLastEpisode [18]
##    WeeksSinceLastEpisode     n
##                    <dbl> <int>
##  1                     1   481
##  2                     2   132
##  3                     3   117
##  4                     4    41
##  5                     5    10
##  6                     6     4
##  7                     7     1
##  8                     8     1
##  9                    16     1
## 10                    17     2
## 11                    18     3
## 12                    19    17
## 13                    20    11
## 14                    21     3
## 15                    25     2
## 16                    30     1
## 17                    32     1
## 18                    NA     1

(The NA above is due to the first episode not having a previous episode from which to calculate the number of weeks since the last episode). We discretize our variable, using the case_when function from the dplyr package (which I first learned from here, along with some other neat dplyr tricks), designating any break longer than 4 weeks as 4+ to balance the categories.

snl.data.enhanced <- snl.data.enhanced %>%
  mutate(WeeksCat = case_when(.$WeeksSinceLastEpisode >= 4 ~ "4+",
                            is.na(WeeksSinceLastEpisode) ~ "4+",
                            TRUE ~ as.character(WeeksSinceLastEpisode))) # Truncate at 4+ weeks break
  
snl.data.enhanced %>%
  group_by(WeeksCat) %>%
  count
## # A tibble: 4 x 2
## # Groups:   WeeksCat [4]
##   WeeksCat     n
##      <chr> <int>
## 1        1   481
## 2        2   132
## 3        3   117
## 4       4+    99

The ratings densities for each group of break lengths are shown below, with the overall rating density in the background in grey.

# Plot distributions based on number of weeks between episodes
ggplot(data=snl.data.enhanced, aes(x=Rating, fill=WeeksCat)) +
  geom_density(data = snl.data.enhanced %>% select(-WeeksCat), fill = "darkgrey") +
  geom_density(alpha = 0.5) +
  facet_wrap(~WeeksCat) +
  scale_fill_viridis(name="Weeks Since Last Episode",discrete=T) +
  theme_bw(16) +
  xlim(0,10) +
  ylim(0,0.62) +
  xlab("Rating") +
  ylab("Density") +
  annotate("text",x=5,y=0.31,col="red",label=paste0("@mathieubray ",lubridate::year(lubridate::today())),
           alpha=0.15,cex=6,fontface="bold",angle=30) +
  ggtitle("Does the Break Before an SNL Episode Affect its Rating?",
          subtitle="Grey curves represent overall ratings density. Data from imdb.com" )

plot of chunk snl_5

Some takeaways from this set of charts:

  • The ‘good-great’ mode is lower for episodes with no break (equivalently, 1 week since last episode)
  • Episodes with a week break prior (equivalently, 2 weeks since last episode) see a more even distribution of episode ratings
  • For episodes with long breaks prior to air, the distribution skews much higher toward ‘good-great’

This seems to lend evidence to support the fatigue hypothesis, though there are a few other factors we should probably keep in mind. Let’s look at episodes written with at least a week break versus those without.

snl.data.enhanced <- snl.data.enhanced %>%
  mutate(FreshEpisode = WeeksCat != "1") # Mark whether the episode is 'fresh' (at least 1 week break between episodes)

snl.data.enhanced %>%
  group_by(FreshEpisode) %>%
  count
## # A tibble: 2 x 2
## # Groups:   FreshEpisode [2]
##   FreshEpisode     n
##          <lgl> <int>
## 1        FALSE   481
## 2         TRUE   348
ggplot(data=snl.data.enhanced, aes(x=Rating, fill=FreshEpisode)) +
  geom_density(alpha=0.5) +
  scale_fill_viridis(discrete=T, name = "", labels=c("No Break","Break")) +
  theme_bw(16) +
  xlim(0,10) +
  ylim(0,0.5) +
  xlab("Rating") +
  ylab("Density") +
  annotate("text",x=5,y=0.25,col="red",label=paste0("@mathieubray ",lubridate::year(lubridate::today())),
           alpha=0.15,cex=15,fontface="bold",angle=30) +
  ggtitle("Are Ratings Affected by Whether an Episode Airs After a Break?",
          subtitle="Color represents whether episodes are the first of their seasons ('Premiere') or not ('Not Premiere').\nData from imdb.com" )

plot of chunk snl_6

The ‘No Break’ curve is equivalent to the top-left graph from the previous chart. We once again observe a bimodal distribution for episodes that aired with a break, though with fewer ‘medicore’ episodes and more ‘good-great’ episodes. It is interesting that we also observe a small third cluster of very poorly rated episodes among episodes that aired with a break.

Most of the episodes with at least 4 weeks of break are season premieres, ostensibly with a full summer of ideas fresh in the writers’ minds. Let’s look at the distribution of ratings for season premieres versus non-season premieres.

snl.data.enhanced <- snl.data.enhanced %>%
  mutate(Premiere = SeasonEpNumber == 1)

snl.data.enhanced %>%
  group_by(Premiere) %>%
  count
## # A tibble: 2 x 2
## # Groups:   Premiere [2]
##   Premiere     n
##      <lgl> <int>
## 1    FALSE   787
## 2     TRUE    42

With the obvious caveat that there are only 42 season premieres with which to draw the density, the two curves are shown below.

ggplot(data=snl.data.enhanced, aes(x=Rating, fill=Premiere)) +
  geom_density(position="identity",alpha=0.5) +
  scale_fill_viridis(discrete=T, name = "", labels=c("Not Premiere","Premiere")) +
  theme_bw(16) +
  xlim(0,10) +
  ylim(0,0.5) +
  xlab("Rating") +
  ylab("Density") +
  annotate("text",x=5,y=0.25,col="red",label=paste0("@mathieubray ",lubridate::year(lubridate::today())),
           alpha=0.15,cex=15,fontface="bold",angle=30) +
  ggtitle("Are Ratings Affected by Whether an Episode is a Season Premiere?",
          subtitle="Color represents whether episodes are the first of their seasons ('Premiere') or not ('Not Premiere').\nData from imdb.com" )

plot of chunk snl_7

Season premieres appear in general to be very highly rated compared to other episodes.

SNL Ratings Based on Consecutive Episode Sequences

Similar the length of the break prior to each episode, we can also look at the sequence of consecutive episodes and how that affects episode ratings. Episodes that air at the end of a sequence of consecutive episodes may well suffer in quality compared to those at the beginning of the sequence (i.e. those that aired after at least a week break).

The code to determine the placement of each episode within a sequence of consecutive episodes is included below. I used a ‘for’ loop here (gasp!), since I couldn’t quite figure out how to do this effectively in a more R-friendly way. If anyone has any suggestions, shoot me a line in the comments or on the Twitters.

num.episodes <- length(snl.data.enhanced$FreshEpisode)

consecutive.weeks <- numeric(num.episodes)
counter <- 1

for (i in 1:num.episodes){
  
  if (snl.data.enhanced$FreshEpisode[i]){ # If the episode is fresh
    counter <- 1 # Reset counter
  } else {
    consecutive.weeks[i] <- counter # For episode i, counter = number of episodes in consecutive weeks
    counter <- counter + 1 # Augment counter
  }
}

snl.data.enhanced$ConsecutiveWeeks <- consecutive.weeks

snl.data.enhanced %>% 
  filter(Season==42) %>%
  select(SeasonEpNumber,Host,AirDate,ConsecutiveWeeks) %>%
  head(10)
##    SeasonEpNumber                 Host    AirDate ConsecutiveWeeks
## 1               1        Margot Robbie 2016-10-01                0
## 2               2   Lin-Manuel Miranda 2016-10-08                1
## 3               3          Emily Blunt 2016-10-15                2
## 4               4            Tom Hanks 2016-10-22                3
## 5               5 Benedict Cumberbatch 2016-11-05                0
## 6               6       Dave Chappelle 2016-11-12                1
## 7               7         Kristen Wiig 2016-11-19                2
## 8               8           Emma Stone 2016-12-03                0
## 9               9            John Cena 2016-12-10                1
## 10             10        Casey Affleck 2016-12-17                2

We do a quick check of the counts of the number of episodes aired consecutively prior to each episode.

snl.data.enhanced %>%
  group_by(ConsecutiveWeeks) %>%
  count
## # A tibble: 4 x 2
## # Groups:   ConsecutiveWeeks [4]
##   ConsecutiveWeeks     n
##              <dbl> <int>
## 1                0   348
## 2                1   332
## 3                2   142
## 4                3     7

The longest string of episodes airing on consecutive weeks is 4 (i.e. episodes with 3 consecutive episodes prior), and there are only 7 such instances. We thus categorize the variable and include any episode with 3 consecutive episodes prior into the ‘2-3’ category. Densities are plotted below.

snl.data.enhanced <- snl.data.enhanced %>%
  mutate(ConsecutiveCat = case_when(.$ConsecutiveWeeks >= 2 ~ "2-3",
                          is.na(ConsecutiveWeeks) ~ "2-3",
                          TRUE ~ as.character(ConsecutiveWeeks)))
ggplot(data=snl.data.enhanced, aes(x=Rating, fill=ConsecutiveCat)) +
  geom_density(data = snl.data.enhanced %>% select(-ConsecutiveCat), fill = "darkgrey") +
  geom_density(alpha=0.7) +
  facet_wrap(~ConsecutiveCat) +
  scale_fill_viridis(name = "Number of Consecutive Episodes Prior", discrete=T) +
  theme_bw(16) +
  xlim(0,10) +
  xlab("Rating") +
  ylab("Density") +
  annotate("text",x=5,y=0.31,col="red",label=paste0("@mathieubray ",lubridate::year(lubridate::today())),
           alpha=0.15,cex=6,fontface="bold",angle=30) +
  theme(legend.position = "bottom") +
  ggtitle("Do Long Sequences of Episodes Affect Ratings?",
          subtitle="Grey curves represent overall ratings density. Data from imdb.com" )

plot of chunk snl_8

The leftmost panel is equivalent to the ‘Break’ curve shown previously. Episodes that aired as the second in a sequence of episodes (middle panel) see slightly more ‘mediocre’ episodes and slightly fewer ‘good-great’ episodes. The rightmost panel, which include episodes airing at the end of a long sequence of episodes, appears to be more spread out, with dips in both of the main modes. While interesting, there may be more at play here. For example, it may make a difference if, for example, the episode is the second in a long sequence of consecutive episodes versus the second episode in a sequence of two consecutive episodes. Perhaps a question for another day…

That’s about as much effort as I’m willing to put into this right now. As I said, not super deep, though I have a few ideas on how one can augment this data for some possible additional insights. Anyway, that’s it for now. Tune in for more fun stuff from me, and Live… from Ann Arbor… it’s Saturday Night! (spent staying in and procrastinating on actual work by writing this instead!)



blog comments powered by Disqus