An Extensive Example for Sets



Python and the Best Novel

Analysing James Joyce

This chapter deals with natural languages and literature. It will be also an extensive example and use case for Python sets. Novices in Python often think that sets are just a toy for mathematicians and that there is no real use case in programming. The contrary is true. There are multiple use cases for sets. They are used for example to get rid of doublets - multiple occurrences of elements - in a list, i.e. to make a list unique.

In the following example we will use sets to determine the different words occurring in a novel. Our use case is build around a novel which has been praised by many as the best novel in the English language and also as the hardest to read. We are talking about the novel "Ulysses" by James Joyce. We will not talk about or examine the beauty of the language or the language style. We will study the novel by having a close look at the words used in the novel. Our approach will be purely statitically. The claim is that James Joyce used in his novel more words than any other author. Actually his vocabulary is above and beyond all other authors, maybe even Shakespeare.

Besides Ulysses we will use the novels "Sons and Lovers" by D.H. Lawrence, "The Way of All Flesh" by Samuel Butler, "Robinson Crusoe" by Daniel Defoe, "To the Lighthouse" by Virginia Woolf, "Moby Dick" by Herman Melville and the Short Story "Metamorphosis" by Franz Kafka.

Before you continue with this chapter of our tutorial it might be a good idea to read the chapter Sets and Frozen Sets and the two chapter on regular expressions and advanced regular expressions.

Different Words of a Text

To cut out all the words of the novel "Ulysses" we can use the function findall from the module "re":

import re
# we don't care about case sensitivity and therefore use lower:
ulysses_txt = open("books/james_joyce_ulysses.txt").read().lower()
words = re.findall(r"\b[\w-]+\b", ulysses_txt)
print("The novel ulysses contains " + str(len(words)))
The novel ulysses contains 272452

This number is the sum of all the words and many words occur multiple time:

for word in ["the", "while", "good", "bad", "ireland", "irish"]:
    print("The word '" + word + "' occurs " + \
          str(words.count(word)) + " times in the novel!" )
The word 'the' occurs 15112 times in the novel!
The word 'while' occurs 123 times in the novel!
The word 'good' occurs 321 times in the novel!
The word 'bad' occurs 90 times in the novel!
The word 'ireland' occurs 90 times in the novel!
The word 'irish' occurs 117 times in the novel!

272452 surely is a huge number of words for a novel, but on the other hand there are lots of novels with even more words. More interesting and saying more about the quality of a novel is the number of different words. This is the moment where we will finally need "set". We will turn the list of words "words" into a set. Applying "len" to this set will give us the number of different words:

diff_words = set(words)
print("'Ulysses' contains " + str(len(diff_words)) + " different words!")
'Ulysses' contains 29422 different words!

This is indeed an impressive number. You can see this, if you look at the other novels in our folder books:

novels = ['sons_and_lovers_lawrence.txt', 
          'metamorphosis_kafka.txt', 
          'the_way_of_all_flash_butler.txt', 
          'robinson_crusoe_defoe.txt', 
          'to_the_lighthouse_woolf.txt', 
          'james_joyce_ulysses.txt', 
          'moby_dick_melville.txt']
for novel in novels:
    txt = open("books/" + novel).read().lower()
    words = re.findall(r"\b[\w-]+\b", txt)
    diff_words = set(words)
    n = len(diff_words)
    print("{name:38s}: {n:5d}".format(name=novel[:-4], n=n))
sons_and_lovers_lawrence              : 10822
metamorphosis_kafka                   :  3027
the_way_of_all_flash_butler           : 11434
robinson_crusoe_defoe                 :  6595
to_the_lighthouse_woolf               : 11415
james_joyce_ulysses                   : 29422
moby_dick_melville                    : 18922

Special Words in Ulysses

We will subtract all the words occurring in the other novels from "Ulysses" in the following little Python program. It is amazing how many words are used by James Joyce and by none of the other authors:

words_in_novel = {}
for novel in novels:
    txt = open("books/" + novel).read().lower()
    words = re.findall(r"\b[\w-]+\b", txt)
    words_in_novel[novel] = words
    
words_only_in_ulysses =  set(words_in_novel['james_joyce_ulysses.txt'])
novels.remove('james_joyce_ulysses.txt')
for novel in novels:
    words_only_in_ulysses -= set(words_in_novel[novel])
    
with open("books/words_only_in_ulysses.txt", "w") as fh:
    txt = " ".join(words_only_in_ulysses)
    fh.write(txt)
    
print(len(words_only_in_ulysses))
    
15314

By the way, Dr. Seuss wrote a book with only 50 different words: Green Eggs and Ham

The file with the words only occurring in Ulysses contains strange or seldom used words like:

huntingcrop tramtrack pappin kithogue pennyweight undergarments scission nagyaságos wheedling begad dogwhip hawthornden turnbull calumet covey repudiated pendennis waistcoatpocket nostrum

Common Words

It is also possible to find the words which occur in every book. To accomplish this, we need the set intersection:

# we start with the words in ulysses
common_words = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
    common_words &= set(words_in_novel[novel])
    
print(len(common_words))
1745

Doing it Right

We made a slight mistake in the previous calculations. If you look at the texts, you will notice that they have a header and footer part added by Project Gutenberg, which doesn't belong to the texts. The texts are positioned between the lines:

***START OF THE PROJECT GUTENBERG EBOOK THE WAY OF ALL FLESH***

and

***END OF THE PROJECT GUTENBERG EBOOK THE WAY OF ALL FLESH***

or

*** START OF THIS PROJECT GUTENBERG EBOOK ULYSSES ***

and

*** END OF THIS PROJECT GUTENBERG EBOOK ULYSSES ***

The function read_text takes care of this:

def read_text(fname):
    beg_e = re.compile(r"\*\*\* ?start of (this|the) project gutenberg ebook[^*]*\*\*\*")
    end_e = re.compile(r"\*\*\* ?end of (this|the) project gutenberg ebook[^*]*\*\*\*")
    txt = open("books/" + fname).read().lower()
    beg = beg_e.search(txt).end()
    end = end_e.search(txt).start()
    return txt[beg:end]
words_in_novel = {}
for novel in novels + ['james_joyce_ulysses.txt']:
    txt = read_text(novel)
    words = re.findall(r"\b[\w-]+\b", txt)
    words_in_novel[novel] = words
words_in_ulysses =  set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
    words_in_ulysses -= set(words_in_novel[novel])
    
with open("books/words_in_ulysses.txt", "w") as fh:
    txt = " ".join(words_in_ulysses)
    fh.write(txt)
    
print(len(words_in_ulysses))
15341
# we start with the words in ulysses
common_words = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
    common_words &= set(words_in_novel[novel])
    
print(len(common_words))
1279

The words of the set "common_words" are words belong to the most frequently used words of the English language. Let's have a look at 30 arbitrary words of this set:

counter = 0
for word in common_words:
    print(word, end=", ")
    counter += 1
    if counter == 30:
        break
ancient, broke, breathing, laugh, divided, forced, wealth, ring, outside, throw, person, spend, better, errand, school, sought, knock, tell, inner, run, packed, another, since, touched, bearing, repeated, bitter, experienced, often, one,