An exploratory analysis of Doctor Who scripts

I’m a big fan of Doctor Who, so when the new series came out earlier this year I was inspired to do some exploratory analysis of scripts from the show. Doctor Who is about a time-travelling alien (the Doctor) who explores the universe with a series of companions, generally getting himself into lots of trouble and saving the world a few times along the way. The Doctor has the ability to regenerate instead of dying, which means that the show can fairly gracefully replace the actor playing him. The current Doctor is the thirteenth overall and the fifth since the show relaunched in 2005. The characters the Doctor travels with also change every so often. This means the show has an interesting mix of recurring and non-recurring characers, with the Doctor the one fixed character.

Scripts of the show are available on this site, so this project gave me the chance to practice scraping and cleaning data. While that process is not going to be the focus of this post, you can see the script I used for this in my GitLab. For this analysis, I’ll be focusing on episodes from 2005 onwards. This gives me eleven seasons of scripts to worth with (actually, eleven and a bit, since the current season is still running).

Initial exploration of the script data

I’ve processed the scripts into a ‘tidy’ form, which makes them easier to work with as all the information I need is included in each row. I have a different row for each line of dialogue that includes the name of the episode, the scene number, the character speaking, and which doctor is in the episode. Adding the doctor column is important because the doctor almost always appears in the scripts as just ‘DOCTOR’, and at some point I may want to look at differences between Doctors. The resulting data.table has over 75000 lines and includes over 1000 different characters.

We can now start to explore the data and form some questions.

library(magrittr)
library(treemap)

dt_scripts[, .N, by = "character"][!(character %like% "\\+") & N > 200] %>% 
treemap(index = "character",
        vSize = "N",
        type = "index",
        palette = "Spectral",
        title = "Which characters have the most lines?",
        fontface.labels = 2)

Here I’ve used a treemap to get a sense of the number of lines spoken by character across all episodes (only characters with at least 200 lines are shown). Unsurprisingly, the Doctor has by far the most lines. Companions also have quite a lot of lines, and we also see other recurring characters who are friends (like Vastra and Jenny) or enemies (Dalek, the Master, and Missy). These 27 characters account for over 60% of all the lines spoken.

library(ggplot2)

dt_scripts[, .N, by = episode] %>% 
    ggplot() +
    geom_histogram(aes(N), binwidth = 20, fill = "midnightblue") +
    theme_classic() +
    scale_y_continuous(expand = c(0, 0)) +
    xlab("Number of lines") +
    ylab("Number of episodes") +
    ggtitle("How many lines are there in each episode?") +
    theme(plot.title = element_text(hjust = 0.5))

I’ve also made a histogram showing the number of lines per episode, in part to get a sense of whether my scraping and cleaning of the scripts might have failed. This could manifest in some oddly long or oddly short episodes. There are a small number of episodes with < 150 lines that look a bit suspicious to me, but on inspection these all turn out to be special episodes that are very short.

Visualising interactions between characters

I would like to find a way to visualise the relationships between characters. One proxy for interactions is the number of scenes that characters both appear in, which I can get from my dataset. I’m going to use the package igraph to display these relationships, with vertices representing characters and edges representing scenes both appear in.

This isn’t a perfect approach, as two characters appearing in a scene doesn’t necessarily mean they interact (for example, one could leave before the other enters or one could be unaware the other is there). That said, this is a fairly straightforward way to get a sense of character interactions and I can use my domain knowledge of Doctor Who to figure out if the results make sense or not. Another potential problem with this approach is that sometimes different characters have the same name. For example, daleks often appear in the script simply as ‘DALEK’, so all the lines spoken by daleks end up being attributed to one character. This is also a problem for Clara’s boyfriend Danny, as there is a different character called Danny in an episode in an earlier season.

Note that here I’m using igraph in a very simple way to visualise the relationships on the show, but you can use graphs to tackle much more complex questions. For example, graphs can be used to model social networks, where you can use them to answer questions about which people are most central to the network and different ways people can be connected.

I’ll begin by creating a data.table showing the number of interactions between different pairs of characters that will be the edges of my graph. I’m only going to visualise the top 27 characters (by total number of lines), as with many more this will get very messy to plot.

top_27 <- dt_scripts[, 
                     .N, by = character
                     ][order(-N)
                       ][
                         1:27, character
                         ]
dt_scripts_graph <- dt_scripts[character %in% top_27]
dt_scripts_graph[character %like% "DOCTOR", character := "DOCTOR"]
dt_scripts_graph[character %like% "YASMIN", character := "YAZ"]
dt_scripts_graph <- unique(dt_scripts_graph, 
                           by = c("character", "episode", "scene"))
dt_scripts_graph[, `:=`(doctor = NULL, dialogue = NULL)]
dt_scripts_graph[, character := stringr::str_to_title(character)]

dt_linecount <- dt_scripts_graph[, .N , by = character]
character_linecounts <- dt_linecount$N
names(character_linecounts) <- dt_linecount$character

dt_edges <- dt_scripts_graph[dt_scripts_graph, 
                          on = .(scene = scene, episode = episode),
                                 allow.cartesian = TRUE][
                              character != i.character
                          ][
                              character < i.character
                          ][
                              , .(count = .N), by = .(character, i.character)
                          ]

dt_edges[order(-count)]

##      character i.character count
##   1:     Clara      Doctor   522
##   2:       Amy      Doctor   409
##   3:    Doctor        Rose   322
##   4:       Amy        Rory   263
##   5:    Doctor        Rory   247
##  ---                            
## 164:     Donna       Missy     1
## 165:      Jack       Missy     1
## 166:     Jenny       Missy     1
## 167:       Amy       Missy     1
## 168:     Dalek      Graham     1

Even without plotting the graph, we already get some sense of what the most important interactions are going to be. Almost all of the most frequent interactions are the Doctor interacting with companions.

Now we can visualise these interactions. I’m also incorporating some extra information (such as total number of lines spoken by each character and their role) that I can add to the final visualisation. igraph allows you to make graphs in a number of different ways, but here I’ll use my data.table of edges.

library(igraph)

character_graph <- graph_from_edgelist(as.matrix(dt_edges[, 1:2]), 
                                       directed = FALSE)

E(character_graph)$weight <- dt_edges$count
E(character_graph)$width <- log2(E(character_graph)$weight) / 2

V(character_graph)[name == "Doctor"]$role <- "doctor"
companions <- c("Clara", "Amy", "Rose", "Martha", "Bill", 
                "Donna", "Yaz", "Ryan", "Graham")
V(character_graph)[name %in% companions]$role <- "companion"
V(character_graph)[is.na(role)]$role <- "other"
role_colours <- c(doctor = "#6c6ca4", companion = "#73aa73", other = "#bf7373")
V(character_graph)$color <- role_colours[V(character_graph)$role]
V(character_graph)$line_count <- character_linecounts[V(character_graph)$name]
V(character_graph)$names <- stringr::str_to_title(names(V(character_graph)))

plot(character_graph, 
     vertex.color = adjustcolor(V(character_graph)$color),
     vertex.label.color = "black",
     vertex.label.family="sans",
     vertex.size = log2(V(character_graph)$line_count) * 1.2,
     vertex.frame.color = NA,
     layout = layout_with_fr(character_graph),
     main = "Interactions between characters on Doctor Who",
     )

The resulting graph conveys a lot of information rather succinctly. A line between two characters indicates that those characters are in at least one scene together. The thickness of the lines between characters is proportional to the number of scenes both occur in, the size of the vertex represents the total number of lines for that character, and the colours indicate the character’s role in the show.

We can easily see the centrality of the Doctor to the show, as he has interactions with all other characters and also has the most lines. The Doctor interacts most frequently with companions, whereas other characters are more peripheral. This makes sense given that that the show focuses on the Doctor’s travels with various companions. We can also see connections between characters such as Rose, Mickey and Jackie that reflect their relationships that are independent of the Doctor.

Preparing the script dataset for textual analysis with `quanteda`

As this is a text-heavy dataset, one of the things I’m most interested in is performing a textual analysis. R has a lot of packages that can be used for this purpose, but here I’ll be using quanteda.

The first step in using quanteda is to create a corpus. A corpus is simply a collection of texts. What you define as a text depends on your dataset and kind of analysis you want to do, though it’s worth noting that once you have a corpus, you can then reshape it to break it apart into paragraphs or sentences. A corpus can also have associated document-level variables (docvars), which may be necessary for your analysis. In this case I’ll initially be treating each line as a text, which will result in an enormous corpus. Each line will have document-level variables such as the character who speaks the line and the episode the line is from. It could also make sense to aggregate by episode, depending on what you want to do. I’m initially interested in the differences between characters, so combining dialogue from different characters by episode doesn’t really make sense here.

library(quanteda)
dt_scripts[character == "DOCTOR", character := paste("DOCTOR", doctor)]
corpus_scripts <- corpus(dt_scripts, text_field = "dialogue")

Here I’ve also changed the DOCTOR to DOCTOR {number} in the character column so I can distinguish dialogue from different Doctors.

Aside from the corpus, the two other main types of objects you’ll work with in quanteda are tokens and document feature matrices (DFMs). Tokens are essentially just words (or groups of words), though there’s often a lot of filtering and manipulation that goes on in going from your text to tokens. Once you tokenise, the order of the words in your text is lost.

tokens_scripts <- corpus_scripts %>% 
    tokens(
        remove_numbers = TRUE,
        remove_punct = TRUE,
        remove_hyphens = TRUE,
        include_docvars = TRUE,
        remove_symbols = TRUE
        ) %>% 
    tokens_wordstem()

Here I have created tokens from my corpus. Most of the options I’ve chosen speak for themselves, such as removing numbers, punctuation, hyphens, and symbols. I’ve also opted to keep the docvars associated with each text/line.

The use of tokens_wordstem() requires some explanation. Stemming shortens words to their stem; for example, the word ‘running’ becomes ‘run’ and ‘dogs’ becomes ‘dog’. This can be a useful step, since these pairs of words involve the same basic concept and so it probably doesn’t make sense to look at them separately. An alternative to stemming is lemmatization. Lemmatization goes a little further than stemming and can change a word beyond just shortening it. For example, the word ‘went’ would become ‘go’. Both approaches will reduce the total number of tokens, but lemmatization will generally reduce it more.

In this analysis I’ve opted to use stemming instead of lemmatization, in part because it’s a more conservative approach. For example, Rose is the name of one of the Doctor’s companions and so her name appears frequently in the dialogue. However, ‘rose’ is also the past tense of ‘rise’, so if I lemmatize the name ‘Rose’ will be changed to ‘rise’. In this context this would be very confusing and would mess up my analysis. While lemmatization can be very powerful, it’s important to bear things like this in mind and sanity check your findings to uncover anything weird like this that might distort your results.

Once you have tokens, the next step is to make a document frequency matrix (DFM). You can actually do this directly from a corpus object, but this givens you less control over the tokenisation.

dfm_all <- dfm(tokens_scripts, 
               tolower = TRUE, 
               remove = stopwords(source = "smart"))

The document frequency matrix (DFM) is a huge matrix in which each column is a different token and each row is a text. The DFM has the count of each token for each text. The docvars from the corpus and token objects carry through, so the character and episode information is still there. This means if I decide I want to filter out dialogue from some characters or aggregate by episode, I can do this directly from my existing DFM rather than have to go back to the corpus object. The DFM will be the object I will most often work with for subsequent analysis. It will often require some filtering or aggregation to answer specific questions.

Here I’ve chosen to remove stop words, which are commonly occurring works like ‘am’ and “there” that are so frequent that including them doesn’t add anything to the analysis.

Identifying differences between characters based on their dialogue

I’m interested in understanding how different the dialogue of various characters is. I’ll restrict this analysis to the top 15 characters in the show. This means subsetting my existing DFM and grouping by character. The quanteda library has a function topfeatures() that shows you the tokens with the highest counts in the DFM for each row (in this case, for each character). I can then reformat the output to view it more easily.

top_15  <- dt_scripts[ , .N , by = character][order(-N)][1:15, character]
dfm_top <- dfm_all %>% 
    dfm_subset(character %in% top_15) %>% 
    dfm_group(groups = "character")

features_top15 <- topfeatures(dfm_top, 10, group = "character")
dt_features <- as.data.table(as.data.frame(sapply(features_top15, 
                                                  function(x) names(x))))

AMY	BILL	CLARA	DOCTOR 10	DOCTOR 11	DOCTOR 12	DOCTOR 13	DOCTOR 9	DONNA	GRAHAM	MARTHA	RIVER	RORY	ROSE	RYAN
doctor	yeah	doctor	time	time	you’r	you’r	you’r	doctor	yeah	doctor	doctor	doctor	doctor	yeah
rori	doctor	whi	you’r	you’r	time	whi	rose	you’r	doc	you’r	time	ami	yeah	hey
yeah	whi	you’r	yeah	ami	clara	time	time	yeah	you’r	yeah	you’r	yeah	you’r	we’r
whi	you’r	yeah	back	whi	whi	back	thing	whi	ryan	time	love	er	mum	doctor
you’r	er	sorri	thing	back	veri	we’r	back	back	we’r	thing	back	you’r	time	thing
time	time	time	good	yeah	becaus	veri	you’v	thing	thing	sorri	whi	sorri	back	you’r
happen	thing	er	sorri	good	thing	thing	yeah	becaus	grace	back	man	time	thing	yaz
becaus	someth	realli	whi	thing	they’r	work	good	time	er	someth	they’r	whi	happen	happen
pleas	they’r	thing	stop	becaus	good	they’r	world	someth	hey	we’r	kill	back	whi	back
someth	peopl	happen	you’v	sorri	back	yaz	human	we’r	mate	you’v	die	happen	someth	er

While this is a good start, we can start to see some problems with this approach. For a start, we can see that the words ‘doctor’ and ‘time’ are very highly ranked for many of the characters. This tells us that these are commonly occurring words regardless of character, which is good to know but not very interesting. Secondly, there are a lot of character names. While this can be interesting to see, including character names in this kind of comparison can be misleading. This is because the names a character uses in speech are not really related to how they talk, but rather who they’re with. Incidentally, you can also see the effects of the stemming in these results, which leaves some words looking a bit odd (such as ‘sorri’). Some character names have also been mangled a bit.

To try to identify words that are more specific to each character, we’ll use a statistic called TF-IDF (term frequency - inverse document frequency). This is a measure of how frequently a word is used in one document (in this case, a character) relative to how frequently it’s used generally (by all characters). A term with a very high TF-IDF will be used very frequently by one character and almost never by others. Ranking by TF-IDF will give us a better idea of what is distinctive about each character’s dialogue. I will also filter out character names (and stemmed variants of character names, where possible), since the presence of those tells us more about who a character spends time with than about their language choices.

names <- unique(dt_scripts$character)
names_stemmed <- tokens(names) %>% tokens_wordstem()

dfm_top_filtered <- dfm_top %>%
    dfm_remove(names) %>% 
    dfm_remove(names_stemmed) %>% 
    dfm_tfidf()

features_top15 <- topfeatures(dfm_top_filtered, 10, group = "character")
dt_features <- as.data.table(as.data.frame(sapply(features_top15, 
                                                  function(x) names(x))))

AMY	BILL	CLARA	DOCTOR 10	DOCTOR 11	DOCTOR 12	DOCTOR 13	DOCTOR 9	DONNA	GRAHAM	MARTHA	RIVER	RORY	ROSE	RYAN
angel	nardol	pink	viperox	pond	nardol	yaz	satellit	temp	doc	indigo	sweeti	jennif	what’r	yaz
pond	puddl	oswald	reinett	angel	vardi	pting	narrow	binari	yaz	shakespear	ramon	nurs	wilson	nan
cube	ars	souffl	allon	georg	oswald	kerblam	rift	gramp	cockl	annalis	spoiler	leadworth	tyler	logan
raggedi	sutcliff	gallifrey	wormhol	madg	pott	praxeus	nanogen	lanc	poli	jone	angel	nephew	shareen	sire
william	lectur	porridg	racnoss	cube	hybrid	resus	charl	clement	cos	osterhagen	aplan	doorknob	isolus	reload
melodi	um	maisi	infostamp	sophi	bank	ux	tyler	chiswick	nan	dalekanium	archaeologist	beard	moonlight	hann
pregnant	portion	franni	jone	dear	karabraxo	shaw	chula	dumbo	steve	arr	pandorica	drove	union	sinclair
mel	defect	blackpool	adelaid	gillyflow	gemston	orb	fantast	chaplin	grandson	mast	diamond	ma’am	jacki	luther
paisley	pott	muh	void	wifi	pe	dreg	what’r	hotter	sire	infinit	dear	astronaut	danc	cos
petrichor	verita	moat	torchwood	snow	rob	cos	calcium	neri	son	sir	darillium	attention	anymor	bike

This is a much more interesting list, and it tells us a lot more about differences between the characters. For example, looking at River’s top-ranked terms we see ‘sweetie’, ‘archaeologist’ and ‘spoiler’, which are terms that I certainly associate with her. There are still some names appearing such as “williams” (Rory’s last name), but this is an improvement over the initial list.

Examining the relationship between the Doctor’s sentiment and overall episode sentiment

Sentiment analysis is often incorporated into text mining and analysis. This involves attempting to identify emotions associated with a particular text. Here I’m use the library sentimentr, since unlike many other libraries that can be used for sentiment analysis it takes into account valence shifters. This means that instead of detecting the word ‘good’ in a sentence and concluding that the sentiment is positive, it can take into account whether this is part of the phrase ‘not good’ and thereby has a negative sentiment. sentimentr can also identify emotions and profanity, which can also be interesting to look at though I don’t here.

I’ll first get the average sentiment score for every line of dialogue and look at the highest and lowest ranked to get some sense of how this works.

library(sentimentr)

dt_sentiment_short <- with(dt_scripts,
                           sentiment_by(
                             get_sentences(dialogue),
                             list(character, episode_num, 
                                  episode, scene, dialogue)
                           ))

dt_sentiment_short[order(ave_sentiment)] %>% head()

##    character episode_num                    episode scene
## 1:    VASTRA         114                Deep Breath    11
## 2: DOCTOR 10          55                   Midnight    19
## 3:  DOCTOR 9          10          The Doctor Dances     3
## 4: DOCTOR 12         139 The Husbands of River Song     4
## 5:     DANNY         123 In the Forest of the Night    36
## 6:     DONNA          54         Forest of the Dead    18
##                               dialogue word_count sd ave_sentiment
## 1: I don't know, but I fear devilment.          7 NA     -1.488235
## 2:                       Knock, knock.          2 NA     -1.414214
## 3:                             I wish.          2 NA     -1.414214
## 4:         Slash murderer slash thief.          4 NA     -1.400000
## 5:           You're worrying too much.          4 NA     -1.375000
## 6:             Sorry, but you're dead.          4 NA     -1.375000

dt_sentiment_short[order(ave_sentiment)] %>% tail()

##       character episode_num                   episode scene
## 1:    DOCTOR 11          73                Cold Blood    21
## 2:        RIVER          92 The Wedding of River Song    23
## 3:    DOCTOR 10         112     The Day of the Doctor    14
## 4: PHARMACIST 1          33                  Gridlock     5
## 5:         RORY          76       The Pandorica Opens    16
## 6:    DOCTOR 10          50            The Poison Sky   105
##                                                     dialogue word_count
## 1: Not to interrupt, but just a quick reminder to stay calm.         11
## 2:                 Please, my love, please, please just run!          7
## 3:            Whoa, whoa, whoa, whoa, whoa. Oh, very clever.          8
## 4:                         Happy, Happy, lovely happy Happy!          5
## 5:                                         Whoa, whoa, whoa.          3
## 6:   Please, please, please, please, please, please, please.          7
##          sd ave_sentiment
## 1:       NA      1.413334
## 2:       NA      1.417367
## 3: 1.030004      1.507745
## 4:       NA      1.677051
## 5:       NA      1.732051
## 6:       NA      2.645751

Amongst the examples of dialogue with the lowest sentiment scores are “I don’t know, but I fear devilment”, “Sorry, but you’re dead”, and “Slash murderer slash thief”, which do seem solidly negative. Surprisingly, we also have “knock, knock” and “I wish”.

Similarly, the dialogue with highest sentiment scores includes “Whoa, whoa, whoa, whoa, whoa. Oh, very clever” and “Happy, Happy, lovely happy Happy!, but also”Please, my love, please, please just run!" and “Not to interrupt, but just a quick reminder to stay calm”.

Now I’m going to look at sentiment scores between Doctors to identify possible differences.

library(ggbeeswarm)

dt_sentiment <- with(dt_scripts, 
                     sentiment_by(
                         get_sentences(dialogue), 
                         list(character, episode_num, episode)))

dt_sentiment <- dt_sentiment[word_count > 50]
dt_doctors <- dt_sentiment[character %in% paste("DOCTOR", 9:13)]
dt_doctors[, character := factor(character, levels = paste("DOCTOR", 9:13))]

ggplot(dt_doctors, aes(character, ave_sentiment)) +
  geom_beeswarm(aes(colour = character)) +
  stat_summary(fun.y = "median", fun.ymax = "median", fun.ymin = "median", 
               geom = "crossbar",
               width = 0.3) +
  scale_color_viridis_d(end = 0.9) +
  theme_classic() +
  ggtitle("How does average sentiment vary by Doctor?") +
  ylab("Average sentiment") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "none",
        axis.title.x = element_blank())

This beeswarm plot shows the average sentiment for each episode for each Doctor. I’ve excluded instances where the Doctor has fewer than 50 words in an episode, since in those cases the averages tend to be very high or very low.

There don’t seem to be large overall differences between Doctors, though Doctors 10 and 11 are a little cheerier than the others.

We can identify the episodes where the average sentiment for the Doctor is highest and lowest.

dt_doctors[which.min(ave_sentiment)]

##    character episode_num     episode word_count        sd ave_sentiment
## 1: DOCTOR 12         137 Heaven Sent       2587 0.2358146   -0.06130917

dt_doctors[which.max(ave_sentiment)]

##    character episode_num   episode word_count        sd ave_sentiment
## 1: DOCTOR 10          56 Turn Left        171 0.2484174     0.2429945

The episode with the lowest average sentiment is Heaven Sent, which is the episode after Clara dies. This makes sense, as it is a fairly dark episode. Surprisingly though, Turn Left is the episode with the highest average sentiment. Turn Left is about a dystopian alternative reality, so it’s not a particularly happy episode. It’s appearance here could be because this analysis focuses only on the Doctor’s dialogue, and since the episode actually focuses on Donna we don’t capture the overall tone of the episode. Indeed, the Doctor has only 171 words.

How does average sentiment per episode compare to sentiment of the Doctor’s dialogue only? Are Doctors 10 and 11 simply more chipper than the other because the episodes they appear in are overall cheerier? One way to approach this is by examining the relationship between the average sentiment of the Doctor’s dialogue and that of all other dialogue.

There are several possibilities for how the Doctor’s mood or sentiment in an episode could relate to the overall mood of the episode (as reflected in dialogue):

The sentiment of the Doctor’s dialogue is not strongly related to that of others. This would suggest that the differences we see in the sentiment scores of different Doctors are due mostly to differences in personality
The Doctor’s mood drives of the overall mood of the episode. That is, a happy Doctor leads to other characters’ dialogue also being more cheerful
The Doctor’s mood is mainly influenced by the circumstances and tone of the episode

In the first case, we should see only a weak correlation between the sentiment of the Doctor’s dialogue and that of other characters. However, the final two cases will be difficult to distinguish. In both cases, we would see a strong correlation but we would not necessarily be able to say why this occurs.

library(ggpubr)

dt_sentiment_nodoc <- with(dt_scripts[!(character %like% "DOCTOR")], 
                             sentiment_by(
                               get_sentences(dialogue),
                               list(episode_num, episode, doctor)))

dt_sentiment_episode_doctor <- dt_sentiment_nodoc[
  dt_doctors, on = "episode_num"
  ][
    , doctor := factor(paste("DOCTOR", doctor), levels = paste("DOCTOR", 9:13))
    ]

ggplot(dt_sentiment_episode_doctor, aes(ave_sentiment, i.ave_sentiment)) +
  geom_point(aes(color  = doctor)) +
  geom_smooth(method = "lm", se = FALSE, colour = "black") +
  stat_regline_equation(label.y = -0.042) +
  stat_cor(label.y = -0.055) +
  scale_colour_viridis_d() +
  ggtitle("How does the Doctor's sentiment inform the episode's overall sentiment?") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5),
        legend.title = element_blank()) +
  xlab("Average sentiment (Doctor)") +
  ylab("Average sentiment (all others)")

The relationship between between the average sentiment of the Doctor and that overall across episodes is quite weak; that is, the Doctor’s sentiment doesn’t track very well with the overall sentiment of the episode. This suggests that the differences in the sentiment of Doctors that we saw above are reflective of real differences in characterisation, not just differences in the tone of episodes. In general, we see that the Doctor’s overall sentiment is lower than that of other characters in a given episode.

Summary

This Doctor Who script dataset was a lot of fun to play around with and to practice my scraping skills. My analysis here has been very exploratory, but this dataset could also be used to answer specific questions about the show.

Resources

You can find the website I scraped the scripts from here.
I discovered another analysis of Doctor Who scripts while I was in the late stages of writing up this post. This analysis by Jean-Michel D tackles different questions than I do here, so the two posts complement each other quite well.
Debbie Liske’s three-part tutorial on using NLP to analyse the lyrics of Prince’s music is a great read and gave me a lot of ideas on the kinds of questions you can use NLP to answer
Although I didn’t use the tidytext package, I did consult Text Mining with R by Julia Silge and David Robinson a lot for background on text mining and NLP. This is a fantastic resource.
The Quanteda tutorials site walks through many example of text analysis using quanteda.
I found this tutorial about igraph by Katya Ognyanova extremely useful as I was figuring out igraph.
You can find the script I used to scrape and clean the Doctor Who script data on Gitlab here.