top of page
IMG_7849.JPG

5. Token distribution analysis in Leaf by Niggle

Link: Token distribution code

This project looks at the occurances of "picture" (i.e. the painting) and "tree" in Leaf by Niggle in an attempt to understand how when they occur might reflect the path of the plot, from earth to paradise. A tentative argument could be made, subject to close reading analysis, that Tolkien shifts from emphasizing the more artificial word picture to the realized vision, tree, chronologically as our Niggle travels from his imperfect home to a vision of perfection.

​

​

1. The original short story can be found as a text (.txt) file here.

 

2. I cleaned the text

  • removing all punctation

  • removing spaces between words

  • putting all text into lowercase

  • putting all text into a single string, and

  • indexing the string list of words for easy searching.

​

3. I performed basic statistical analyses on the text

  • identifying how many times "picture" occurs and at which index positions

  • identifying how many times "tree" occurs and at which index positions

  • identifying the number of unique word types in short story, and

  • plotting the frequency of words, and comparing the top 10 frequently-occuring words with Zipf's law).

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

Zipf's law says that the frequency of any word in a corpus is inversely proportional to its rank or position in the overall frequency distirbution. In other words, the second most frequent word will occur about half as often as the most frequent word.

​

Unsurprisingly, "the" occured most frequently, and "he" occured second most frequently. Leaf by Niggle does not follow Zipf's law very well, since 265 is not half of 354, and 235 is not half of 265 (these being the numbers given to us by R in the chart). Even if we just ignored these chart numbers, we can visually inspect the graph to see that the relationship is certainly not decreasing by half each time.

​

4. Accessing and comparing word frequency data

 

Below is the table that R produces for relative frequencies of every single word in Leaf by Niggle (i.e. number of occurances divided total number of words):

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

When comparing the word frequency of the top two words in the text, "the" and "he," we see that "the" occurs    3.98021138 / 2.97953677 = 1.335849055 times relatively more frequently than "he." Plotting the relative frequencies of the top 10 words in Leaf by Niggle, we see:

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

 

But as our real interest lies in the terms "picture" and "tree," we can ask R to do that for us. We see that "picture" occurs 1.2 times more frequently than "tree."

​

Now we have to see where those two terms are occuring to support our

argument about the journey from earth to heaven.

​

5. Token distribution and regular expressions

​

Creating a dispersion plot, or a 'token distribution plot,' will help us visually determine where in the story the words "picture" and "tree" tend to occur. 

When we say "where" in the story, we really mean "when;" that is, the further 

along in a text a word is, the more time has passed. Therefore, we will 

call the x-axis of such plots "novelistic time" and the y-axis will be

the YES/NO occurances of such words. There are no numerical values 

for the x-axis since a black line indicates the YES, and a blank/white line

indicates a NO.

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

​

As we can see, "picture" occurs in a larger clump in the first half of Leaf by Niggle while "tree" occurs in a larger clump in the second half of Leaf by Niggle.

 

6. Literary analysis   (Click here for the Word document)

​

fa89449f-586c-4aea-a72e-ea269fa73007.png
Relative word frequency data.png
ae5a3be5-1d41-4f6a-8199-1ba98c53dc9e.png
picture.png
tree.png
bottom of page