I’m a big fan of Doctor Who, so when the new series came out earlier this year I was inspired to do some exploratory analysis of scripts from the show. Doctor Who is about a time-travelling alien (the Doctor) who explores the universe with a series of companions, generally getting himself into lots of trouble and saving the world a few times along the way. The Doctor has the ability to regenerate instead of dying, which means that the show can fairly gracefully replace the actor playing him. The current Doctor is the thirteenth overall and the fifth since the show relaunched in 2005. The characters the Doctor travels with also change every so often. This means the show has an interesting mix of recurring and non-recurring characers, with the Doctor the one fixed character.
Scripts of the show are available on this site, so this project gave me the chance to practice scraping and cleaning data. While that process is not going to be the focus of this post, you can see the script I used for this in my GitLab. For this analysis, I’ll be focusing on episodes from 2005 onwards. This gives me eleven seasons of scripts to worth with (actually, eleven and a bit, since the current season is still running).
Initial exploration of the script data
I’ve processed the scripts into a ‘tidy’ form, which makes them easier to work with as all the information I need is included in each row. I have a different row for each line of dialogue that includes the name of the episode, the scene number, the character speaking, and which doctor is in the episode. Adding the doctor column is important because the doctor almost always appears in the scripts as just ‘DOCTOR’, and at some point I may want to look at differences between Doctors. The resulting data.table has over 75000 lines and includes over 1000 different characters.
We can now start to explore the data and form some questions.
library(magrittr)
library(treemap)
dt_scripts[, .N, by = "character"][!(character %like% "\\+") & N > 200] %>%
treemap(index = "character",
vSize = "N",
type = "index",
palette = "Spectral",
title = "Which characters have the most lines?",
fontface.labels = 2)
Here I’ve used a treemap to get a sense of the number of lines spoken by character across all episodes (only characters with at least 200 lines are shown). Unsurprisingly, the Doctor has by far the most lines. Companions also have quite a lot of lines, and we also see other recurring characters who are friends (like Vastra and Jenny) or enemies (Dalek, the Master, and Missy). These 27 characters account for over 60% of all the lines spoken.
library(ggplot2)
dt_scripts[, .N, by = episode] %>%
ggplot() +
geom_histogram(aes(N), binwidth = 20, fill = "midnightblue") +
theme_classic() +
scale_y_continuous(expand = c(0, 0)) +
xlab("Number of lines") +
ylab("Number of episodes") +
ggtitle("How many lines are there in each episode?") +
theme(plot.title = element_text(hjust = 0.5))
I’ve also made a histogram showing the number of lines per episode, in part to get a sense of whether my scraping and cleaning of the scripts might have failed. This could manifest in some oddly long or oddly short episodes. There are a small number of episodes with < 150 lines that look a bit suspicious to me, but on inspection these all turn out to be special episodes that are very short.
Visualising interactions between characters
I would like to find a way to visualise the relationships between characters. One proxy for interactions is the number of scenes that characters both appear in, which I can get from my dataset. I’m going to use the package igraph
to display these relationships, with vertices representing characters and edges representing scenes both appear in.
This isn’t a perfect approach, as two characters appearing in a scene doesn’t necessarily mean they interact (for example, one could leave before the other enters or one could be unaware the other is there). That said, this is a fairly straightforward way to get a sense of character interactions and I can use my domain knowledge of Doctor Who to figure out if the results make sense or not. Another potential problem with this approach is that sometimes different characters have the same name. For example, daleks often appear in the script simply as ‘DALEK’, so all the lines spoken by daleks end up being attributed to one character. This is also a problem for Clara’s boyfriend Danny, as there is a different character called Danny in an episode in an earlier season.
Note that here I’m using igraph
in a very simple way to visualise the relationships on the show, but you can use graphs to tackle much more complex questions. For example, graphs can be used to model social networks, where you can use them to answer questions about which people are most central to the network and different ways people can be connected.
I’ll begin by creating a data.table showing the number of interactions between different pairs of characters that will be the edges of my graph. I’m only going to visualise the top 27 characters (by total number of lines), as with many more this will get very messy to plot.
top_27 <- dt_scripts[,
.N, by = character
][order(-N)
][
1:27, character
]
dt_scripts_graph <- dt_scripts[character %in% top_27]
dt_scripts_graph[character %like% "DOCTOR", character := "DOCTOR"]
dt_scripts_graph[character %like% "YASMIN", character := "YAZ"]
dt_scripts_graph <- unique(dt_scripts_graph,
by = c("character", "episode", "scene"))
dt_scripts_graph[, `:=`(doctor = NULL, dialogue = NULL)]
dt_scripts_graph[, character := stringr::str_to_title(character)]
dt_linecount <- dt_scripts_graph[, .N , by = character]
character_linecounts <- dt_linecount$N
names(character_linecounts) <- dt_linecount$character
dt_edges <- dt_scripts_graph[dt_scripts_graph,
on = .(scene = scene, episode = episode),
allow.cartesian = TRUE][
character != i.character
][
character < i.character
][
, .(count = .N), by = .(character, i.character)
]
dt_edges[order(-count)]
## character i.character count
## 1: Clara Doctor 522
## 2: Amy Doctor 409
## 3: Doctor Rose 322
## 4: Amy Rory 263
## 5: Doctor Rory 247
## ---
## 164: Donna Missy 1
## 165: Jack Missy 1
## 166: Jenny Missy 1
## 167: Amy Missy 1
## 168: Dalek Graham 1
Even without plotting the graph, we already get some sense of what the most important interactions are going to be. Almost all of the most frequent interactions are the Doctor interacting with companions.
Now we can visualise these interactions. I’m also incorporating some extra information (such as total number of lines spoken by each character and their role) that I can add to the final visualisation. igraph
allows you to make graphs in a number of different ways, but here I’ll use my data.table of edges.
library(igraph)
character_graph <- graph_from_edgelist(as.matrix(dt_edges[, 1:2]),
directed = FALSE)
E(character_graph)$weight <- dt_edges$count
E(character_graph)$width <- log2(E(character_graph)$weight) / 2
V(character_graph)[name == "Doctor"]$role <- "doctor"
companions <- c("Clara", "Amy", "Rose", "Martha", "Bill",
"Donna", "Yaz", "Ryan", "Graham")
V(character_graph)[name %in% companions]$role <- "companion"
V(character_graph)[is.na(role)]$role <- "other"
role_colours <- c(doctor = "#6c6ca4", companion = "#73aa73", other = "#bf7373")
V(character_graph)$color <- role_colours[V(character_graph)$role]
V(character_graph)$line_count <- character_linecounts[V(character_graph)$name]
V(character_graph)$names <- stringr::str_to_title(names(V(character_graph)))
plot(character_graph,
vertex.color = adjustcolor(V(character_graph)$color),
vertex.label.color = "black",
vertex.label.family="sans",
vertex.size = log2(V(character_graph)$line_count) * 1.2,
vertex.frame.color = NA,
layout = layout_with_fr(character_graph),
main = "Interactions between characters on Doctor Who",
)
The resulting graph conveys a lot of information rather succinctly. A line between two characters indicates that those characters are in at least one scene together. The thickness of the lines between characters is proportional to the number of scenes both occur in, the size of the vertex represents the total number of lines for that character, and the colours indicate the character’s role in the show.
We can easily see the centrality of the Doctor to the show, as he has interactions with all other characters and also has the most lines. The Doctor interacts most frequently with companions, whereas other characters are more peripheral. This makes sense given that that the show focuses on the Doctor’s travels with various companions. We can also see connections between characters such as Rose, Mickey and Jackie that reflect their relationships that are independent of the Doctor.
Preparing the script dataset for textual analysis with quanteda
As this is a text-heavy dataset, one of the things I’m most interested in is performing a textual analysis. R has a lot of packages that can be used for this purpose, but here I’ll be using quanteda
.
The first step in using quanteda
is to create a corpus. A corpus is simply a collection of texts. What you define as a text depends on your dataset and kind of analysis you want to do, though it’s worth noting that once you have a corpus, you can then reshape it to break it apart into paragraphs or sentences. A corpus can also have associated document-level variables (docvars), which may be necessary for your analysis. In this case I’ll initially be treating each line as a text, which will result in an enormous corpus. Each line will have document-level variables such as the character who speaks the line and the episode the line is from. It could also make sense to aggregate by episode, depending on what you want to do. I’m initially interested in the differences between characters, so combining dialogue from different characters by episode doesn’t really make sense here.
library(quanteda)
dt_scripts[character == "DOCTOR", character := paste("DOCTOR", doctor)]
corpus_scripts <- corpus(dt_scripts, text_field = "dialogue")
Here I’ve also changed the DOCTOR to DOCTOR {number} in the character column so I can distinguish dialogue from different Doctors.
Aside from the corpus, the two other main types of objects you’ll work with in quanteda
are tokens and document feature matrices (DFMs). Tokens are essentially just words (or groups of words), though there’s often a lot of filtering and manipulation that goes on in going from your text to tokens. Once you tokenise, the order of the words in your text is lost.
tokens_scripts <- corpus_scripts %>%
tokens(
remove_numbers = TRUE,
remove_punct = TRUE,
remove_hyphens = TRUE,
include_docvars = TRUE,
remove_symbols = TRUE
) %>%
tokens_wordstem()
Here I have created tokens from my corpus. Most of the options I’ve chosen speak for themselves, such as removing numbers, punctuation, hyphens, and symbols. I’ve also opted to keep the docvars associated with each text/line.
The use of tokens_wordstem()
requires some explanation. Stemming shortens words to their stem; for example, the word ‘running’ becomes ‘run’ and ‘dogs’ becomes ‘dog’. This can be a useful step, since these pairs of words involve the same basic concept and so it probably doesn’t make sense to look at them separately. An alternative to stemming is lemmatization. Lemmatization goes a little further than stemming and can change a word beyond just shortening it. For example, the word ‘went’ would become ‘go’. Both approaches will reduce the total number of tokens, but lemmatization will generally reduce it more.
In this analysis I’ve opted to use stemming instead of lemmatization, in part because it’s a more conservative approach. For example, Rose is the name of one of the Doctor’s companions and so her name appears frequently in the dialogue. However, ‘rose’ is also the past tense of ‘rise’, so if I lemmatize the name ‘Rose’ will be changed to ‘rise’. In this context this would be very confusing and would mess up my analysis. While lemmatization can be very powerful, it’s important to bear things like this in mind and sanity check your findings to uncover anything weird like this that might distort your results.
Once you have tokens, the next step is to make a document frequency matrix (DFM). You can actually do this directly from a corpus object, but this givens you less control over the tokenisation.
dfm_all <- dfm(tokens_scripts,
tolower = TRUE,
remove = stopwords(source = "smart"))
The document frequency matrix (DFM) is a huge matrix in which each column is a different token and each row is a text. The DFM has the count of each token for each text. The docvars from the corpus and token objects carry through, so the character and episode information is still there. This means if I decide I want to filter out dialogue from some characters or aggregate by episode, I can do this directly from my existing DFM rather than have to go back to the corpus object. The DFM will be the object I will most often work with for subsequent analysis. It will often require some filtering or aggregation to answer specific questions.
Here I’ve chosen to remove stop words, which are commonly occurring works like ‘am’ and “there” that are so frequent that including them doesn’t add anything to the analysis.
Identifying differences between characters based on their dialogue
I’m interested in understanding how different the dialogue of various characters is. I’ll restrict this analysis to the top 15 characters in the show. This means subsetting my existing DFM and grouping by character. The quanteda
library has a function topfeatures()
that shows you the tokens with the highest counts in the DFM for each row (in this case, for each character). I can then reformat the output to view it more easily.
top_15 <- dt_scripts[ , .N , by = character][order(-N)][1:15, character]
dfm_top <- dfm_all %>%
dfm_subset(character %in% top_15) %>%
dfm_group(groups = "character")
features_top15 <- topfeatures(dfm_top, 10, group = "character")
dt_features <- as.data.table(as.data.frame(sapply(features_top15,
function(x) names(x))))
AMY | BILL | CLARA | DOCTOR 10 | DOCTOR 11 | DOCTOR 12 | DOCTOR 13 | DOCTOR 9 | DONNA | GRAHAM | MARTHA | RIVER | RORY | ROSE | RYAN |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
doctor | yeah | doctor | time | time | you’r | you’r | you’r | doctor | yeah | doctor | doctor | doctor | doctor | yeah |
rori | doctor | whi | you’r | you’r | time | whi | rose | you’r | doc | you’r | time | ami | yeah | hey |
yeah | whi | you’r | yeah | ami | clara | time | time | yeah | you’r | yeah | you’r | yeah | you’r | we’r |
whi | you’r | yeah | back | whi | whi | back | thing | whi | ryan | time | love | er | mum | doctor |
you’r | er | sorri | thing | back | veri | we’r | back | back | we’r | thing | back | you’r | time | thing |
time | time | time | good | yeah | becaus | veri | you’v | thing | thing | sorri | whi | sorri | back | you’r |
happen | thing | er | sorri | good | thing | thing | yeah | becaus | grace | back | man | time | thing | yaz |
becaus | someth | realli | whi | thing | they’r | work | good | time | er | someth | they’r | whi | happen | happen |
pleas | they’r | thing | stop | becaus | good | they’r | world | someth | hey | we’r | kill | back | whi | back |
someth | peopl | happen | you’v | sorri | back | yaz | human | we’r | mate | you’v | die | happen | someth | er |
While this is a good start, we can start to see some problems with this approach. For a start, we can see that the words ‘doctor’ and ‘time’ are very highly ranked for many of the characters. This tells us that these are commonly occurring words regardless of character, which is good to know but not very interesting. Secondly, there are a lot of character names. While this can be interesting to see, including character names in this kind of comparison can be misleading. This is because the names a character uses in speech are not really related to how they talk, but rather who they’re with. Incidentally, you can also see the effects of the stemming in these results, which leaves some words looking a bit odd (such as ‘sorri’). Some character names have also been mangled a bit.
To try to identify words that are more specific to each character, we’ll use a statistic called TF-IDF (term frequency - inverse document frequency). This is a measure of how frequently a word is used in one document (in this case, a character) relative to how frequently it’s used generally (by all characters). A term with a very high TF-IDF will be used very frequently by one character and almost never by others. Ranking by TF-IDF will give us a better idea of what is distinctive about each character’s dialogue. I will also filter out character names (and stemmed variants of character names, where possible), since the presence of those tells us more about who a character spends time with than about their language choices.
names <- unique(dt_scripts$character)
names_stemmed <- tokens(names) %>% tokens_wordstem()
dfm_top_filtered <- dfm_top %>%
dfm_remove(names) %>%
dfm_remove(names_stemmed) %>%
dfm_tfidf()
features_top15 <- topfeatures(dfm_top_filtered, 10, group = "character")
dt_features <- as.data.table(as.data.frame(sapply(features_top15,
function(x) names(x))))
AMY | BILL | CLARA | DOCTOR 10 | DOCTOR 11 | DOCTOR 12 | DOCTOR 13 | DOCTOR 9 | DONNA | GRAHAM | MARTHA | RIVER | RORY | ROSE | RYAN |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
angel | nardol | pink | viperox | pond | nardol | yaz | satellit | temp | doc | indigo | sweeti | jennif | what’r | yaz |
pond | puddl | oswald | reinett | angel | vardi | pting | narrow | binari | yaz | shakespear | ramon | nurs | wilson | nan |
cube | ars | souffl | allon | georg | oswald | kerblam | rift | gramp | cockl | annalis | spoiler | leadworth | tyler | logan |
raggedi | sutcliff | gallifrey | wormhol | madg | pott | praxeus | nanogen | lanc | poli | jone | angel | nephew | shareen | sire |
william | lectur | porridg | racnoss | cube | hybrid | resus | charl | clement | cos | osterhagen | aplan | doorknob | isolus | reload |
melodi | um | maisi | infostamp | sophi | bank | ux | tyler | chiswick | nan | dalekanium | archaeologist | beard | moonlight | hann |
pregnant | portion | franni | jone | dear | karabraxo | shaw | chula | dumbo | steve | arr | pandorica | drove | union | sinclair |
mel | defect | blackpool | adelaid | gillyflow | gemston | orb | fantast | chaplin | grandson | mast | diamond | ma’am | jacki | luther |
paisley | pott | muh | void | wifi | pe | dreg | what’r | hotter | sire | infinit | dear | astronaut | danc | cos |
petrichor | verita | moat | torchwood | snow | rob | cos | calcium | neri | son | sir | darillium | attention | anymor | bike |
This is a much more interesting list, and it tells us a lot more about differences between the characters. For example, looking at River’s top-ranked terms we see ‘sweetie’, ‘archaeologist’ and ‘spoiler’, which are terms that I certainly associate with her. There are still some names appearing such as “williams” (Rory’s last name), but this is an improvement over the initial list.
Examining the relationship between the Doctor’s sentiment and overall episode sentiment
Sentiment analysis is often incorporated into text mining and analysis. This involves attempting to identify emotions associated with a particular text. Here I’m use the library sentimentr
, since unlike many other libraries that can be used for sentiment analysis it takes into account valence shifters. This means that instead of detecting the word ‘good’ in a sentence and concluding that the sentiment is positive, it can take into account whether this is part of the phrase ‘not good’ and thereby has a negative sentiment. sentimentr
can also identify emotions and profanity, which can also be interesting to look at though I don’t here.
I’ll first get the average sentiment score for every line of dialogue and look at the highest and lowest ranked to get some sense of how this works.
library(sentimentr)
dt_sentiment_short <- with(dt_scripts,
sentiment_by(
get_sentences(dialogue),
list(character, episode_num,
episode, scene, dialogue)
))
dt_sentiment_short[order(ave_sentiment)] %>% head()
## character episode_num episode scene
## 1: VASTRA 114 Deep Breath 11
## 2: DOCTOR 10 55 Midnight 19
## 3: DOCTOR 9 10 The Doctor Dances 3
## 4: DOCTOR 12 139 The Husbands of River Song 4
## 5: DANNY 123 In the Forest of the Night 36
## 6: DONNA 54 Forest of the Dead 18
## dialogue word_count sd ave_sentiment
## 1: I don't know, but I fear devilment. 7 NA -1.488235
## 2: Knock, knock. 2 NA -1.414214
## 3: I wish. 2 NA -1.414214
## 4: Slash murderer slash thief. 4 NA -1.400000
## 5: You're worrying too much. 4 NA -1.375000
## 6: Sorry, but you're dead. 4 NA -1.375000
dt_sentiment_short[order(ave_sentiment)] %>% tail()
## character episode_num episode scene
## 1: DOCTOR 11 73 Cold Blood 21
## 2: RIVER 92 The Wedding of River Song 23
## 3: DOCTOR 10 112 The Day of the Doctor 14
## 4: PHARMACIST 1 33 Gridlock 5
## 5: RORY 76 The Pandorica Opens 16
## 6: DOCTOR 10 50 The Poison Sky 105
## dialogue word_count
## 1: Not to interrupt, but just a quick reminder to stay calm. 11
## 2: Please, my love, please, please just run! 7
## 3: Whoa, whoa, whoa, whoa, whoa. Oh, very clever. 8
## 4: Happy, Happy, lovely happy Happy! 5
## 5: Whoa, whoa, whoa. 3
## 6: Please, please, please, please, please, please, please. 7
## sd ave_sentiment
## 1: NA 1.413334
## 2: NA 1.417367
## 3: 1.030004 1.507745
## 4: NA 1.677051
## 5: NA 1.732051
## 6: NA 2.645751
Amongst the examples of dialogue with the lowest sentiment scores are “I don’t know, but I fear devilment”, “Sorry, but you’re dead”, and “Slash murderer slash thief”, which do seem solidly negative. Surprisingly, we also have “knock, knock” and “I wish”.
Similarly, the dialogue with highest sentiment scores includes “Whoa, whoa, whoa, whoa, whoa. Oh, very clever” and “Happy, Happy, lovely happy Happy!, but also”Please, my love, please, please just run!" and “Not to interrupt, but just a quick reminder to stay calm”.
Now I’m going to look at sentiment scores between Doctors to identify possible differences.
library(ggbeeswarm)
dt_sentiment <- with(dt_scripts,
sentiment_by(
get_sentences(dialogue),
list(character, episode_num, episode)))
dt_sentiment <- dt_sentiment[word_count > 50]
dt_doctors <- dt_sentiment[character %in% paste("DOCTOR", 9:13)]
dt_doctors[, character := factor(character, levels = paste("DOCTOR", 9:13))]
ggplot(dt_doctors, aes(character, ave_sentiment)) +
geom_beeswarm(aes(colour = character)) +
stat_summary(fun.y = "median", fun.ymax = "median", fun.ymin = "median",
geom = "crossbar",
width = 0.3) +
scale_color_viridis_d(end = 0.9) +
theme_classic() +
ggtitle("How does average sentiment vary by Doctor?") +
ylab("Average sentiment") +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none",
axis.title.x = element_blank())
This beeswarm plot shows the average sentiment for each episode for each Doctor. I’ve excluded instances where the Doctor has fewer than 50 words in an episode, since in those cases the averages tend to be very high or very low.
There don’t seem to be large overall differences between Doctors, though Doctors 10 and 11 are a little cheerier than the others.
We can identify the episodes where the average sentiment for the Doctor is highest and lowest.
dt_doctors[which.min(ave_sentiment)]
## character episode_num episode word_count sd ave_sentiment
## 1: DOCTOR 12 137 Heaven Sent 2587 0.2358146 -0.06130917
dt_doctors[which.max(ave_sentiment)]
## character episode_num episode word_count sd ave_sentiment
## 1: DOCTOR 10 56 Turn Left 171 0.2484174 0.2429945
The episode with the lowest average sentiment is Heaven Sent, which is the episode after Clara dies. This makes sense, as it is a fairly dark episode. Surprisingly though, Turn Left is the episode with the highest average sentiment. Turn Left is about a dystopian alternative reality, so it’s not a particularly happy episode. It’s appearance here could be because this analysis focuses only on the Doctor’s dialogue, and since the episode actually focuses on Donna we don’t capture the overall tone of the episode. Indeed, the Doctor has only 171 words.
How does average sentiment per episode compare to sentiment of the Doctor’s dialogue only? Are Doctors 10 and 11 simply more chipper than the other because the episodes they appear in are overall cheerier? One way to approach this is by examining the relationship between the average sentiment of the Doctor’s dialogue and that of all other dialogue.
There are several possibilities for how the Doctor’s mood or sentiment in an episode could relate to the overall mood of the episode (as reflected in dialogue):
- The sentiment of the Doctor’s dialogue is not strongly related to that of others. This would suggest that the differences we see in the sentiment scores of different Doctors are due mostly to differences in personality
- The Doctor’s mood drives of the overall mood of the episode. That is, a happy Doctor leads to other characters’ dialogue also being more cheerful
- The Doctor’s mood is mainly influenced by the circumstances and tone of the episode
In the first case, we should see only a weak correlation between the sentiment of the Doctor’s dialogue and that of other characters. However, the final two cases will be difficult to distinguish. In both cases, we would see a strong correlation but we would not necessarily be able to say why this occurs.
library(ggpubr)
dt_sentiment_nodoc <- with(dt_scripts[!(character %like% "DOCTOR")],
sentiment_by(
get_sentences(dialogue),
list(episode_num, episode, doctor)))
dt_sentiment_episode_doctor <- dt_sentiment_nodoc[
dt_doctors, on = "episode_num"
][
, doctor := factor(paste("DOCTOR", doctor), levels = paste("DOCTOR", 9:13))
]
ggplot(dt_sentiment_episode_doctor, aes(ave_sentiment, i.ave_sentiment)) +
geom_point(aes(color = doctor)) +
geom_smooth(method = "lm", se = FALSE, colour = "black") +
stat_regline_equation(label.y = -0.042) +
stat_cor(label.y = -0.055) +
scale_colour_viridis_d() +
ggtitle("How does the Doctor's sentiment inform the episode's overall sentiment?") +
theme_classic() +
theme(plot.title = element_text(hjust = 0.5),
legend.title = element_blank()) +
xlab("Average sentiment (Doctor)") +
ylab("Average sentiment (all others)")
The relationship between between the average sentiment of the Doctor and that overall across episodes is quite weak; that is, the Doctor’s sentiment doesn’t track very well with the overall sentiment of the episode. This suggests that the differences in the sentiment of Doctors that we saw above are reflective of real differences in characterisation, not just differences in the tone of episodes. In general, we see that the Doctor’s overall sentiment is lower than that of other characters in a given episode.
Summary
This Doctor Who script dataset was a lot of fun to play around with and to practice my scraping skills. My analysis here has been very exploratory, but this dataset could also be used to answer specific questions about the show.
Resources
- You can find the website I scraped the scripts from here.
- I discovered another analysis of Doctor Who scripts while I was in the late stages of writing up this post. This analysis by Jean-Michel D tackles different questions than I do here, so the two posts complement each other quite well.
- Debbie Liske’s three-part tutorial on using NLP to analyse the lyrics of Prince’s music is a great read and gave me a lot of ideas on the kinds of questions you can use NLP to answer
- Although I didn’t use the
tidytext
package, I did consult Text Mining with R by Julia Silge and David Robinson a lot for background on text mining and NLP. This is a fantastic resource. - The Quanteda tutorials site walks through many example of text analysis using
quanteda
. - I found this tutorial about
igraph
by Katya Ognyanova extremely useful as I was figuring outigraph
. - You can find the script I used to scrape and clean the Doctor Who script data on Gitlab here.