6. Token distribution analysis in The Lord of the Rings

This project documents the occurances of a multitude of words in The Lord of the Rings, in three distinct sections: The Fellowship of the Ring, The Two Towers, and The Return of the King. Section 4 synthesizes the findings from all three texts. It is also designed to test the scalability of project 5 (how can we use the same code for three texts, testing for the same qualities to compare in a synthesis?) The R code (via GitHub) for the entirety of project 6 can be found here.

1. THE FELLOWSHIP OF THE RING ________________________________________________________

1.1. The text file (.txt) for The Fellowship of the Ring can be found here.

1.2. I cleaned the text of The Fellowship of the Ring (shortened to FOTR).

removing all punctation
removing spaces between words
putting all text into lowercase
putting all text into a single string (of a vector)
indexing the string list of words for easy searching

1.3. I performed basic statistical analyses on the text

identifying how many times "adventure" occurs and at which index positions
identifying how many times "home" occurs and at which index positions
identifying the number of unique word types in the text
plotting the frequency of words, and comparing the top 10 frequently-occuring words with Zipf's law)

Zipf's law says that the frequency of any word in a corpus is inversely proportional to its rank or position in the overall frequency distirbution. In other words, the second most frequent word will occur about half as often as the most frequent word.

Unsurprisingly, "the" occured the most and "and" occured second most. FOTR doesn't follow Zipf's law very well, since since 7500 is about half of 11734, but 5085 is not is half of 7563, etc. Even if we just ignored the numbers given to us by R by the printing of word frequencies in a chart, we can visually inspect the graph to see that the relationship is certainly not decreasing by half each time.

1.4. Accessing and comparing word frequency data

Below is the table that R produces for absolute frequencies of every single word in FOTR (i.e. number of occurances):

We can also calculate the relative frequency from this absolute frequency table.

But as our real interest lies in the terms "adventure" and "home," we see that "home" occurs 5 times more frequently than "adventure," since "home" has 75 absolute occurances while "adventure" only has 15 absolute occurances.

Now we have to see where those two terms are occuring to support any close reading analysis.

1.5. Token distribution and regular expressions

Creating a dispersion plot, or a 'token distribution plot,' will help us visually determine where in the story the words "adventure" and "home" tend to occur.

When we say "where" in the story, we really mean "when"; that is, the further

along in a text a word is, the more time has passed. Therefore, we will

call the x-axis of such plots "novelistic time" and the y-axis will be

the YES/NO occurances of such words. There are no numerical values

for the x-axis since a black line indicates the YES, and a blank/white line

indicates a NO.

As we can see, "adventure" occurs in a larger clump in the first half of Fellowship of the Ring while "home" occurs in a larger clump in the second half of FOTR. But still, "adventure" seems to occur a lot more often overall, anyways!

2. THE TWO TOWERS ____________________________________________________________________

2.1. The text file (.txt) for The Two Towers can be found here.

2.2. I cleaned the text of The Two Towers (shortened to TT).

removing all punctation
removing spaces between words
putting all text into lowercase
putting all text into a single string (of a vector)
indexing the string list of words for easy searching

2.3. I performed basic statistical analyses on the text

identifying how many times "adventure" occurs and at which index positions
identifying how many times "home" occurs and at which index positions
identifying the number of unique word types in the text
plotting the frequency of words, and comparing the top 10 frequently-occuring words with Zipf's law)

This data does not match Zipf's law.

2.4 Accessing and comparing word frequency data

Below is the table that R produces for absolute frequencies of every single word in TT (i.e. number of occurances):

We can also calculate the relative frequency from this absolute frequency table.

But as our real interest lies in the terms "home" and "adventure," we see that "home" occurs 29 times while "adventure" occurs 0 times.

Now we have to see where those two terms are occuring to support any close reading analysis.

2.5 Token distributions and regular expressions

It is here that I noticed a source of error in the data collection method. I should have searched for "adventure AND adventures AND adventuring AND adventured," or collected the words in such a way as to account for all the word endings of that "adventur-" stem. In future tests, I could use a wildcard expression, such as "adventur*" or write a specific regular expression to find this. [This error is noted in Project 1 under "non-wildcard search methods." Instead of using "which" to search, I may use "grep," which is a function gused for searching a text for patterns that match a regular expression.

3. THE RETURN OF THE KING ____________________________________________________________

3.1. The text file (.txt) for The Return of the King can be found here.

3.2. I cleaned the text file

removing all punctation
removing spaces between words
putting all text into lowercase
putting all text into a single string (of a vector)
indexing the string list of words for easy searching

3.3. I performed basic statistical analyses on the text

identifying how many times "adventure" occurs and at which index positions
identifying how many times "home" occurs and at which index positions
identifying the number of unique word types in the text
plotting the frequency of words, and comparing the top 10 frequently-occuring words with Zipf's law)

This data does not match Zipf's law.

3.4. Accessing and comparing word frequency data

Below is the table that R produces for absolute frequencies of every single word in ROTK (i.e. number of occurances):

We can also calculate the relative frequency from this absolute frequency table.

But our real interest lies in the terms "adventure" and "home," so we see that "home" occurs 28 times while "adventure" occurs only 2 times; i.e. "home" occurs about 16 times more than "adventure."

3.5. Token distribution and regular expressions

As we can see, there is little to compare with the distribution of "adventure" and "home," because "home" occurs too few times to form a clear cluster.

4. SYNTHESIZING THE RESULTS FROM ALL THREE TEXTS___________________________________

Below I have compiled all the statisitcs gained from sections 1 - 3 above. It is important to emphasize that this project (Project 6) did not follow the "hypothesis -> test with R" method, since I am not looking for a particular statistic to support an overarching argument. Instead, it is used as a way to demonstrate the scalability of project 5's code, proving that the same tests run across three different texts can provide useful literary insights after the fact. From this test, we learn what questions we can ask texts next time, before even designing and running R code; that is, we can follow the "create a hypothesis first, then test it with R" method.

For example, here are some immediate conclusions we can see from this synthesis, and examples of how we can use them to support hypotheses in future:

ROTK has a lot more sentences than FOTR and TT, but less words in the text (and therefore, less average number of words per sentence). Thus, Tolkien uses shorter sentences in the last book of the trilogy; an argument could be made that this increases tension in the plot, and serves a greater purpose in driving the story along.
We can see that the top 10 words of all three sections of The Lord of the Rings are the same words, save that they are in different orders. In fact, for the first two sections, the order is the exact same. This data could be used to support arguments about the difference of word importance between the first two texts vs. the last text.
Seeing that the word "home" appears many more times than the word "adventure" in all three texts, we could support arguments about the importance of such themes in the three texts, or we might say that the lack of "adventure" actually draws more attention to it.
Seeing that the word "he" appears many more times than the word "she" in all three texts, we might make an argument about the role of men and women in the texts. Of course this is subject to close reading analysis - "she" and "he" aren't necessarily always referring to people, but to inanimate objects as well.
I could calculate the type-token ratio of the text (number of unique word types / total number of words) to calculate vocabulary richness, and compare that value between the three texts - or, on a more detailed scale, I could do it for just for Bilbo's birthday party (FOTR) and the Scouring of the Shire (ROTK). These two scenes, both set in the familiar Shire, have two very different emotional backgrounds - perhaps reading the vocabulary richness can tell us about that.

In fact, all of the hypotheses (arguments) I suggest above are suggestions only; all are subject to close reading analysis before hasty conclusions are made. But they offer a good idea of the direction and scope of such possibilities.

Speaking of sources of error, here are some that I have spotted (or avoided) in this test while running the code:

I checked that R (and the text file) has the ability to handle accented letters, like "Sméagol" or "Éomer." In later projects, this should also hold true for other types of accents on different letters. [This potential source of error is noted in Project 1, under "Accented letters."] To be extra thorough, in Project 1, I also run a test on all the various accents that Tolkien is known to use, such as the acute and diaeresis marks.

A source of error occurs when searching each text for spaces to remove. As I have noted as comments in the code itself, the program oddly identifies "" as a word. I am unsure what "" is as it looks like a 'nothing,' so I am unsure why R is picking this up. Again, this is something to be aware of when creating margins of error for serious scholarly research in future.

Link: 'LotR' sentiment analysis code

6. Token distribution analysis in The Lord of the Rings

Back to top