## An Extensive Example for Sets

### Python and the Best Novel

This chapter deals with natural languages and literature. It will be also an extensive example and use case for Python sets. Novices in Python often think that sets are just a toy for mathematicians and that there is no real use case in programming. The contrary is true. There are multiple use cases for sets. They are used, for example, to get rid of doublets - multiple occurrences of elements - in a list, i.e. to make a list unique.

In the following example we will use sets to determine the different words occurring in a novel. Our use case is build around a novel which has been praised by many, and regarded as the best novel in the English language and also as the hardest to read. We are talking about the novel "Ulysses" by James Joyce. We will not talk about or examine the beauty of the language or the language style. We will study the novel by having a close look at the words used in the novel. Our approach will be purely statitical. The claim is that James Joyce used in his novel more words than any other author. Actually his vocabulary is above and beyond all other authors, maybe even Shakespeare.

Besides Ulysses we will use the novels "Sons and Lovers" by D.H. Lawrence, "The Way of All Flesh" by Samuel Butler, "Robinson Crusoe" by Daniel Defoe, "To the Lighthouse" by Virginia Woolf, "Moby Dick" by Herman Melville and the Short Story "Metamorphosis" by Franz Kafka.

Before you continue with this chapter of our tutorial it might be a good idea to read the chapter Sets and Frozen Sets and the two chapter on regular expressions and advanced regular expressions.

### Different Words of a Text

To cut out all the words of the novel "Ulysses" we can use the function findall from the module "re":

import re

# we don't care about case sensitivity and therefore use lower:

words = re.findall(r"\b[\w-]+\b", ulysses_txt)
print("The novel ulysses contains " + str(len(words)))

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-1-18646764c8d9> in <module>
2
3 # we don't care about case sensitivity and therefore use lower:
----> 4 ulysses_txt = open("books/james_joyce_ulysses.txt").read().lower()
5
6 words = re.findall(r"\b[\w-]+\b", ulysses_txt)

~\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
21 class IncrementalDecoder(codecs.IncrementalDecoder):
22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24
25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 270653: character maps to <undefined>

This number is the sum of all the words, together with the many words that occur multiple times:

for word in ["the", "while", "good", "bad", "ireland", "irish"]:
print("The word '" + word + "' occurs " + \
str(words.count(word)) + " times in the novel!" )

The word 'the' occurs 15112 times in the novel!
The word 'while' occurs 123 times in the novel!
The word 'good' occurs 321 times in the novel!
The word 'bad' occurs 90 times in the novel!
The word 'ireland' occurs 90 times in the novel!
The word 'irish' occurs 117 times in the novel!


272452 surely is a huge number of words for a novel, but on the other hand there are lots of novels with even more words. More interesting and saying more about the quality of a novel is the number of different words. This is the moment where we will finally need "set". We will turn the list of words "words" into a set. Applying "len" to this set will give us the number of different words:

diff_words = set(words)
print("'Ulysses' contains " + str(len(diff_words)) + " different words!")

'Ulysses' contains 29422 different words!


This is indeed an impressive number. You can see this, if you look at the other novels below:

novels = ['sons_and_lovers_lawrence.txt',
'metamorphosis_kafka.txt',
'the_way_of_all_flash_butler.txt',
'robinson_crusoe_defoe.txt',
'to_the_lighthouse_woolf.txt',
'james_joyce_ulysses.txt',
'moby_dick_melville.txt']

for novel in novels:
txt = open("books/" + novel).read().lower()
words = re.findall(r"\b[\w-]+\b", txt)
diff_words = set(words)
n = len(diff_words)
print("{name:38s}: {n:5d}".format(name=novel[:-4], n=n))

sons_and_lovers_lawrence              : 10822
metamorphosis_kafka                   :  3027
the_way_of_all_flash_butler           : 11434
robinson_crusoe_defoe                 :  6595
to_the_lighthouse_woolf               : 11415
james_joyce_ulysses                   : 29422
moby_dick_melville                    : 18922


### Special Words in Ulysses

We will subtract all the words occurring in the other novels from "Ulysses" in the following little Python program. It is amazing how many words are used by James Joyce and by none of the other authors:

words_in_novel = {}
for novel in novels:
txt = open("books/" + novel).read().lower()
words = re.findall(r"\b[\w-]+\b", txt)
words_in_novel[novel] = words

words_only_in_ulysses =  set(words_in_novel['james_joyce_ulysses.txt'])
novels.remove('james_joyce_ulysses.txt')
for novel in novels:
words_only_in_ulysses -= set(words_in_novel[novel])

with open("books/words_only_in_ulysses.txt", "w") as fh:
txt = " ".join(words_only_in_ulysses)
fh.write(txt)

print(len(words_only_in_ulysses))

15314


By the way, Dr. Seuss wrote a book with only 50 different words: "Green Eggs and Ham"

The file with the words only occurring in Ulysses contains strange or seldom used words like:

huntingcrop tramtrack pappin kithogue pennyweight undergarments scission nagyaságos wheedling begad dogwhip hawthornden turnbull calumet covey repudiated pendennis waistcoatpocket nostrum

### Common Words

It is also possible to find the words which occur in every book. To accomplish this, we need the set intersection:

# we start with the words in ulysses
common_words = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
common_words &= set(words_in_novel[novel])

print(len(common_words))

1745


### Doing it Right

We made a slight mistake in the previous calculations. If you look at the texts, you will notice that they have a header and footer part added by Project Gutenberg, which doesn't belong to the texts. The texts are positioned between the lines:

START OF THE PROJECT GUTENBERG EBOOK THE WAY OF ALL FLESH

and

END OF THE PROJECT GUTENBERG EBOOK THE WAY OF ALL FLESH

or

START OF THIS PROJECT GUTENBERG EBOOK ULYSSES

and

END OF THIS PROJECT GUTENBERG EBOOK ULYSSES

The function read_text takes care of this:

def read_text(fname):
beg_e = re.compile(r"\*\*\* ?start of (this|the) project gutenberg ebook[^*]*\*\*\*")
end_e = re.compile(r"\*\*\* ?end of (this|the) project gutenberg ebook[^*]*\*\*\*")
txt = open("books/" + fname).read().lower()
beg = beg_e.search(txt).end()
end = end_e.search(txt).start()
return txt[beg:end]

words_in_novel = {}
for novel in novels + ['james_joyce_ulysses.txt']:
words = re.findall(r"\b[\w-]+\b", txt)
words_in_novel[novel] = words

words_in_ulysses =  set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
words_in_ulysses -= set(words_in_novel[novel])

with open("books/words_in_ulysses.txt", "w") as fh:
txt = " ".join(words_in_ulysses)
fh.write(txt)

print(len(words_in_ulysses))

# we start with the words in ulysses
common_words = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
common_words &= set(words_in_novel[novel])

print(len(common_words))

15341
1279


The words of the set "common_words" are words belong to the most frequently used words of the English language. Let's have a look at 30 arbitrary words of this set:

counter = 0
for word in common_words:
print(word, end=", ")
counter += 1
if counter == 30:
break

send, such, mentioned, writing, found, speak, fond, food, their, mother, household, through, prepared, flew, gently, work, station, naturally, near, empty, filled, move, unknown, left, alarm, listening, waited, showed, broke, laugh,

ancient, broke, breathing, laugh, divided, forced, wealth, ring, outside, throw, person, spend, better, errand, school, sought, knock, tell, inner, run, packed, another, since, touched, bearing, repeated, bitter, experienced, often, one,