An Extensive Example for Sets
Python and the Best Novel
This chapter deals with natural languages and literature. It will be also an extensive example and use case for Python sets. Novices in Python often think that sets are just a toy for mathematicians and that there is no real use case in programming. The contrary is true. There are multiple use cases for sets. They are used, for example, to get rid of doublets - multiple occurrences of elements - in a list, i.e. to make a list unique.
In the following example we will use sets to determine the different words occurring in a novel. Our use case is build around a novel which has been praised by many, and regarded as the best novel in the English language and also as the hardest to read. We are talking about the novel "Ulysses" by James Joyce. We will not talk about or examine the beauty of the language or the language style. We will study the novel by having a close look at the words used in the novel. Our approach will be purely statitical. The claim is that James Joyce used in his novel more words than any other author. Actually his vocabulary is above and beyond all other authors, maybe even Shakespeare.
Besides Ulysses we will use the novels "Sons and Lovers" by D.H. Lawrence, "The Way of All Flesh" by Samuel Butler, "Robinson Crusoe" by Daniel Defoe, "To the Lighthouse" by Virginia Woolf, "Moby Dick" by Herman Melville and the Short Story "Metamorphosis" by Franz Kafka.
Before you continue with this chapter of our tutorial it might be a good idea to read the chapter Sets and Frozen Sets and the two chapter on regular expressions and advanced regular expressions.
Different Words of a Text
To cut out all the words of the novel "Ulysses" we can use the function findall from the module "re":
import re
# we don't care about case sensitivity and therefore use lower:
ulysses_txt = open("books/james_joyce_ulysses.txt").read().lower()
words = re.findall(r"\b[\w-]+\b", ulysses_txt)
print("The novel ulysses contains " + str(len(words)))
This number is the sum of all the words, together with the many words that occur multiple times:
for word in ["the", "while", "good", "bad", "ireland", "irish"]:
print("The word '" + word + "' occurs " + \
str(words.count(word)) + " times in the novel!" )
272452 surely is a huge number of words for a novel, but on the other hand there are lots of novels with even more words. More interesting and saying more about the quality of a novel is the number of different words. This is the moment where we will finally need "set". We will turn the list of words "words" into a set. Applying "len" to this set will give us the number of different words:
diff_words = set(words)
print("'Ulysses' contains " + str(len(diff_words)) + " different words!")
This is indeed an impressive number. You can see this, if you look at the other novels below:
novels = ['sons_and_lovers_lawrence.txt',
'metamorphosis_kafka.txt',
'the_way_of_all_flash_butler.txt',
'robinson_crusoe_defoe.txt',
'to_the_lighthouse_woolf.txt',
'james_joyce_ulysses.txt',
'moby_dick_melville.txt']
for novel in novels:
txt = open("books/" + novel).read().lower()
words = re.findall(r"\b[\w-]+\b", txt)
diff_words = set(words)
n = len(diff_words)
print("{name:38s}: {n:5d}".format(name=novel[:-4], n=n))
words_in_novel = {}
for novel in novels:
txt = open("books/" + novel).read().lower()
words = re.findall(r"\b[\w-]+\b", txt)
words_in_novel[novel] = words
words_only_in_ulysses = set(words_in_novel['james_joyce_ulysses.txt'])
novels.remove('james_joyce_ulysses.txt')
for novel in novels:
words_only_in_ulysses -= set(words_in_novel[novel])
with open("books/words_only_in_ulysses.txt", "w") as fh:
txt = " ".join(words_only_in_ulysses)
fh.write(txt)
print(len(words_only_in_ulysses))
By the way, Dr. Seuss wrote a book with only 50 different words: "Green Eggs and Ham"
The file with the words only occurring in Ulysses contains strange or seldom used words like:
huntingcrop tramtrack pappin kithogue pennyweight undergarments scission nagyaságos wheedling begad dogwhip hawthornden turnbull calumet covey repudiated pendennis waistcoatpocket nostrum
Common Words
It is also possible to find the words which occur in every book. To accomplish this, we need the set intersection:
# we start with the words in ulysses
common_words = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
common_words &= set(words_in_novel[novel])
print(len(common_words))
Doing it Right
We made a slight mistake in the previous calculations. If you look at the texts, you will notice that they have a header and footer part added by Project Gutenberg, which doesn't belong to the texts. The texts are positioned between the lines:
START OF THE PROJECT GUTENBERG EBOOK THE WAY OF ALL FLESH
and
END OF THE PROJECT GUTENBERG EBOOK THE WAY OF ALL FLESH
or
START OF THIS PROJECT GUTENBERG EBOOK ULYSSES
and
END OF THIS PROJECT GUTENBERG EBOOK ULYSSES
The function read_text takes care of this:
def read_text(fname):
beg_e = re.compile(r"\*\*\* ?start of (this|the) project gutenberg ebook[^*]*\*\*\*")
end_e = re.compile(r"\*\*\* ?end of (this|the) project gutenberg ebook[^*]*\*\*\*")
txt = open("books/" + fname).read().lower()
beg = beg_e.search(txt).end()
end = end_e.search(txt).start()
return txt[beg:end]
words_in_novel = {}
for novel in novels + ['james_joyce_ulysses.txt']:
txt = read_text(novel)
words = re.findall(r"\b[\w-]+\b", txt)
words_in_novel[novel] = words
words_in_ulysses = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
words_in_ulysses -= set(words_in_novel[novel])
with open("books/words_in_ulysses.txt", "w") as fh:
txt = " ".join(words_in_ulysses)
fh.write(txt)
print(len(words_in_ulysses))
# we start with the words in ulysses
common_words = set(words_in_novel['james_joyce_ulysses.txt'])
for novel in novels:
common_words &= set(words_in_novel[novel])
print(len(common_words))
The words of the set "common_words" are words belong to the most frequently used words of the English language. Let's have a look at 30 arbitrary words of this set:
counter = 0
for word in common_words:
print(word, end=", ")
counter += 1
if counter == 30:
break
ancient, broke, breathing, laugh, divided, forced, wealth, ring, outside, throw, person, spend, better, errand, school, sought, knock, tell, inner, run, packed, another, since, touched, bearing, repeated, bitter, experienced, often, one,