# Welcome to Night Vale proverbs

Following a recommendation from a friend, I recently got really into the wonderfully weird and often outright bizarre world of the Welcome to Night Vale podcast. From Wikipedia:See here. “Welcome to Night Vale is a podcast presented as a radio show for the fictional town of Night Vale, reporting on the strange events that occur within it.”

One of my favourite parts of the show comes right at the very end, when the credit announcer reads out “today’s proverb”. These are without fail very strange, but always give me laugh. For example:

It’s not the size of the dog in the fight, it’s the size of the other dog in the fight.

Inspired by other fun R packages such as fortunesSee here., I wanted to create a function that would print a random, nicely formatted WTNV proverb. Additionally, I was keen to practice my webscraping and data-cleaning, and this was a nice little problem to play around with.

## Getting the data

Fortunately, some more diligent fans than I have already spent some time extracting the proverbs into a table, available from the WTNV Fandom site.See here.

My first step was to scrap the data from this table into R, by passing the tr class (used to define a row in a HTML table) to the extract_nodes() function of rvest, and then extracting the text contained in these elements.

I used the annotater package to automagically create the descriptions of each package.

# Load the relevant packages
library(rvest) # Easily Harvest (Scrape) Web Pages
library(dplyr) # A Grammar of Data Manipulation
library(tidyr) # Tidy Messy Data
library(stringr) # Simple, Consistent Wrappers for Common String Operations

#Extract nodes
html_nodes("tr") %>%
# Extract
html_text()

head(tab)
## [1] "Episode No.\n\nTitle\n\nRelease Date\n"
## [2] "1\n\n\"Pilot\"\n\n15 June 2012\n"
## [3] "Look to the north. Keep looking. There's nothing coming from the south.\n"
## [4] "2\n\n\"Glow Cloud\"\n\n1 July 2012\n"
## [5] "\nMen are from Mars; women are from Venus; Earth is a hallucination; podcasts are dreams.\n\n"
## [6] "3\n\n\"Station Management\"\n\n15 July 2012\n"

This gives us a character string of 398 elements, where each element is a unique row in the table. This presented two immediate problems:

• The way the table is laid out, the information for each episode is split across two rows, and so I need to extract every odd and even element seperately and then rejoin them to create the proper dataset.
• Multiple variables (the episode number, title and airdate) are contained within a single element of the vector (e.g. ‘1\n\n“Pilot”\n\n15 June 2012\n’).

## Cleaning the data

### Meta-data

I decided to start with the slightly harder task of cleaning the episode meta-data (number, title and airdate) first.

Having removed the header row (tab[-1]), I converted the vector to a dataframe and extracted every odd row (once I removed the header, the meta-data went from being on every even row to every odd row), making use of the slice() and row_number() functions from dplyr.

After this, I removed information on the proverbs from the live shows, which are presented in a second table at the end of the scraped webpage, by removing all rows after (and including) the row containing the header for that table.

I then needed to deal with the fact that three variables were contained in a single column in the dataframe. Fortunately, the variables were consistently seperated by a common delimiter (“\n\n”) which allowed me to use separate() from tidyr to split the single column into three new variables.

Finally, I removed all quotation marks from the three new variables,I appreciate this could be done in one step using dplyr::across(), but R kept hanging when I tried it and it wasn’t the focus of this project, so I moved on! and removed the trailing newline charater from airdate.

meta <- tab[-1] %>%
# Convert vector to dataframe
data.frame(number = .) %>%
# Extract every odd row
slice(which(row_number() %% 2 == 1)) %>%
# Remove live shows
slice(1:which(.$number == "Show No.\n\nTitle\n\nRelease Date\n") - 1) %>% # Split into component parts separate(number, c("number", "title", "airdate"), sep = '\n\n') %>% # Remove quotes from all variables mutate( title = stringr::str_replace_all(title, '\"', ""), number = stringr::str_replace_all(number, '\"', ""), airdate = stringr::str_replace_all(airdate, '\"', "") ) %>% # Remove trailing newline character from airdate/ mutate(airdate = stringr::str_replace(airdate, "\n", "")) head(meta) ## number title airdate ## 1 1 Pilot 15 June 2012 ## 2 2 Glow Cloud 1 July 2012 ## 3 3 Station Management 15 July 2012 ## 4 4 PTA Meeting 1 August 2012 ## 5 5 The Shape in Grove Park 15 August 2012 ## 6 6 The Drawbridge 1 September 2012 ### Proverb text I used a similar approach to clean the proverb data, taking every even row from the tab vector and again removing those proverbs from the live shows. Out of interest, I also created two additional variables, which count the number of characters and words in each proverb. Once I had this done, I removed any leading and trailing newline characters (I needed to retain internal newline characters as some proverbs are list-like) and create a unique numeric ID for each episode, as the number variable often had part indicators (19A and 19B). Finally, I merged the proverb data with the meta-data obtained above to create the final dataset. proverb_df <- tab[-1] %>% # Convert vector to dataframe data.frame(proverb = .) %>% # Extract every even row slice(which(row_number() %% 2 == 0)) %>% # Remove proverbs from live shows slice(1:which(.$proverb == "1\n\n\"Condos\"\n\n25 September 2013\n") -
1) %>%
# Generate variables with # of character/words for each proverb
mutate(
proverb_len_char = nchar(proverb),
proverb_len_word = str_count(proverb, '\\w+')
) %>%
# Clear random newlines at start and end of proverbs
mutate(
proverb = str_replace(proverb, "\n\n$", ""), proverb = str_replace(proverb, "\n$", ""),
proverb = str_replace(proverb, "^\n", "")
) %>%
mutate(id = seq_along(1:n())) %>%
# Bind to metadata and reorder
cbind(meta, .) %>%
select(id, everything()) %>%
tibble()

head(proverb_df)
## # A tibble: 6 x 7
##      id number title   airdate  proverb        proverb_len_char proverb_len_word
##   <int> <chr>  <chr>   <chr>    <chr>                     <int>            <int>
## 1     1 1      Pilot   15 June~ Look to the n~               72               13
## 2     2 2      Glow C~ 1 July ~ Men are from ~               90               15
## 3     3 3      Statio~ 15 July~ There's a spe~               66               13
## 4     4 4      PTA Me~ 1 Augus~ What has four~              141               30
## 5     5 5      The Sh~ 15 Augu~ A million dol~               64               13
## 6     6 6      The Dr~ 1 Septe~ Lost? Confuse~               72               12

## Get a random proverb

The last thing to do was to create a little function that returns a nicely formatted version of a random proverb or of a certain proverb given an ID.

nightvale_proverb <- function(id) {

# If an ID is provided, use that
# If not, get a random ID
if (!hasArg(id)) {
id <- sample(proverb_df$id,1) } # Format the resulting message message(proverb_df$proverb[id],
"\n    -- \"",
proverb_df$title[id], "\", ", proverb_df$airdate[id])

}

# Get a random proverb
nightvale_proverb()
## The word “motel” is an amalgam of the words “hotel” and “murder.”
##     -- "Civic Changes", 15 September 2015
# And a personal favourite
nightvale_proverb(22)
## Ask your doctor if right is left for you.
##     -- "A Memory of Europe", 15 April 2013

And that’s it - a fun half-hour exercise, and an excuse to do some webscraping and data-cleaning. I plan to convert this into a small little GitHub-only package in the future, but it might not be for a while. So for now…

Good night, Night Vale. Good night.

Luke A McGuinness

An NIHR Doctoral Research Fellow at the University of Bristol, Luke is passionate about dementia epidemiology, open science and R.