## Encoding Text for Machine Learning

### Introduction

We mentioned in the introductory chapter of our tutorial that a spam filter for emails is a typical example of machine learning. Emails are based on text, which is why a classifier to classify emails must be able to process text as input. If we look at the previous examples with neural networks, they always run directly with numerical values and have a fixed input length. In the end, the characters of a text also consist of numerical values, but it is obvious that we cannot simply use a text as it is as input for a neural network. This means that the text have to be converted into a numerical representation, e.g. vectors or arrays of numbers.

We will learn in this tutorial how to encode text in a way which is suitable for machine processing.

### Bag-of-Words Model

If we want to use texts in machine learning, we need a representation of the text which is usable for Machine Learning purposes. This means we need a numerical representation. We cannot use texts directly.

In natural language processing and information retrievel the bag-of-words model is of crucial importance. The bag-of-words model can be used to represent text data in a way which is suitable for machine learning algorithms. Furthermore, this model is easy and efficient to implement. In the bag-of-words model, a text (such as a sentence or a document) is represented as the so-called bag (a set or multiset) of its words. Grammar and word order are ignored.

We will use in the following a list of three strings to demonstrate the bag-of-words approach. In linguistics, the collection of texts used for the experiments or tests is usually called a corpus:

corpus = ["To be, or not to be, that is the question:",
"Whether 'tis nobler in the mind to suffer",
"The slings and arrows of outrageous fortune,"]


We will use the submodule text from sklearn.feature_extraction. This module contains utilities to build feature vectors from text documents.

from sklearn.feature_extraction import text


CountVectorizer is a class in the module sklearn.feature_extraction.text. It's a class useful for building a corpus vocabulary. In addition, it produces the numerical representation of text that we need, i.e. Numpy vectors.

First we need an instance of this class. When we instantiate a CountVectorizer, we can pass some optional parameters, but it is possible to call it with no arguments, as we will do in the following. Printing the vectorizer gives us useful information about the default values used when the instance was created:

vectorizer = text.CountVectorizer()
print(vectorizer)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)


We have now an instance of CountVectorizer, but it has not seen any texts so far. We will use the method fit to process our previously defined corpus. We learn a vocabulary dictionary of all the tokens (strings) of the corpus:

x = vectorizer.fit(corpus)
print(x)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)


fit created the vocabulary structure vocabulary_. This contains the words of the text as keys and a unique integer value for each word. As the default value for the parameter lowercase is set to True, the To in the beginning of the text has been turned into to. You may also notice that the vocabulary contains only words without any punctuation or special character. You can change this behaviour by assigning a regular expression to the keyword parameter token_pattern of the fit method. The dault is set to (?u)\\b\\w\\w+\\b. The (?u) part of this regular expression is not necessary because it switches on the re.U (re.UNICODE) flag for this expression, which is the default in Python anyway. The minimal word length will be two characters:

print("Vocabulary: ", vectorizer.vocabulary_)

Vocabulary:  {'to': 18, 'be': 2, 'or': 10, 'not': 8, 'that': 15, 'is': 5, 'the': 16, 'question': 12, 'whether': 19, 'tis': 17, 'nobler': 7, 'in': 4, 'mind': 6, 'suffer': 14, 'slings': 13, 'and': 0, 'arrows': 1, 'of': 9, 'outrageous': 11, 'fortune': 3}


If you only want to see the words without the indices, you can your the method feature_names:

print(vectorizer.get_feature_names())

['and', 'arrows', 'be', 'fortune', 'in', 'is', 'mind', 'nobler', 'not', 'of', 'or', 'outrageous', 'question', 'slings', 'suffer', 'that', 'the', 'tis', 'to', 'whether']


Alternatively, you can apply keys to the vocaulary to keep the ordering:

print(list(vectorizer.vocabulary_.keys()))

['to', 'be', 'or', 'not', 'that', 'is', 'the', 'question', 'whether', 'tis', 'nobler', 'in', 'mind', 'suffer', 'slings', 'and', 'arrows', 'of', 'outrageous', 'fortune']


With the aid of transform we will extract the token counts out of the raw text documents. The call will use the vocabulary which we created with fit:

token_count_matrix = vectorizer.transform(corpus)
print(token_count_matrix)

  (0, 2)	2
(0, 5)	1
(0, 8)	1
(0, 10)	1
(0, 12)	1
(0, 15)	1
(0, 16)	1
(0, 18)	2
(1, 4)	1
(1, 6)	1
(1, 7)	1
(1, 14)	1
(1, 16)	1
(1, 17)	1
(1, 18)	1
(1, 19)	1
(2, 0)	1
(2, 1)	1
(2, 3)	1
(2, 9)	1
(2, 11)	1
(2, 13)	1
(2, 16)	1


The connection between the corpus, the Vocabulary vocabulary_ and the vector created by transform can be seen in the following image:

We will apply the method toarray on our object token_count_matrix. It will return a dense ndarray representation of this matrix.

Just in case: You might see that people use sometimes todense instead of toarray.
Do not use todense!1

dense_tcm = token_count_matrix.toarray()
dense_tcm

Output: :
array([[0, 0, 2, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 2, 0],
[0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1],
[1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0]])

The rows of this array correspond to the strings of our corpus. The length of a row corresponds to the length of the vocabulary. The i'th value in a row corresponds to the i'th entry of the list returned by CountVectorizer method get_feature_names. If the value of dense_tcm[i][j] is equal to k, we know the word with the indexj in the vocabulary occurs k times in the string with the index i in the corpus.

This is visualized in the following diagram:

feature_names = vectorizer.get_feature_names()
for el in vectorizer.vocabulary_:
print(el)

to
be
or
not
that
is
the
question
whether
tis
nobler
in
mind
suffer
slings
and
arrows
of
outrageous
fortune

Output: :
list
import pandas as pd

pd.DataFrame(data=dense_tcm,
index=['corpus_0', 'corpus_1', 'corpus_2'],
columns=vectorizer.get_feature_names())

Output: :
and arrows be fortune in is mind nobler not of or outrageous question slings suffer that the tis to whether
corpus_0 0 0 2 0 0 1 0 0 1 0 1 0 1 0 0 1 1 0 2 0
corpus_1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1
corpus_2 1 1 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0
word = "be"
i = 1
j = vectorizer.vocabulary_[word]
print("number of times '" + word + "' occurs in:")
for i in range(len(corpus)):
print("    '" + corpus[i] + "': " + str(dense_tcm[i][j]))

number of times 'be' occurs in:
'To be, or not to be, that is the question:': 2
'Whether 'tis nobler in the mind to suffer': 0
'The slings and arrows of outrageous fortune,': 0


We will extract the token counts out of new text documents. Let's use a literally doubtful variation of Hamlet's famous monologue and check what transform has to say about it. transform will use the vocabulary which was previously fitted with fit.

txt = "That is the question and it is nobler in the mind."
vectorizer.transform([txt]).toarray()

Output: :
array([[1, 0, 0, 0, 1, 2, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 0, 0]])
print(vectorizer.get_feature_names())

['and', 'arrows', 'be', 'fortune', 'in', 'is', 'mind', 'nobler', 'not', 'of', 'or', 'outrageous', 'question', 'slings', 'suffer', 'that', 'the', 'tis', 'to', 'whether']

print(vectorizer.vocabulary_)

{'to': 18, 'be': 2, 'or': 10, 'not': 8, 'that': 15, 'is': 5, 'the': 16, 'question': 12, 'whether': 19, 'tis': 17, 'nobler': 7, 'in': 4, 'mind': 6, 'suffer': 14, 'slings': 13, 'and': 0, 'arrows': 1, 'of': 9, 'outrageous': 11, 'fortune': 3}


### Word Importance

If you look at words like "the", "and" or "of", you will see see that they will occur in nearly all English texts. If you keep in mind that our ultimate goal will be to differentiate between texts and attribute them to classes, words like the previously mentioned ones will bear hardly any meaning. If you look at the following corpus, you can see words like "you", "I" or important words like "Python", "lottery" or "Programmer":

from sklearn.feature_extraction import text
corpus = ["It does not matter what you are doing, just do it!",
"Would you work if you won the lottery?",
"You like Python, he likes Python, we like Python, everybody loves Python!"
"You said: 'I wish I were a Python programmer'",
"You can stay here, if you want to. I would, if I were you."
]

vectorizer = text.CountVectorizer()
vectorizer.fit(corpus)

token_count_matrix = vectorizer.transform(corpus)
print(token_count_matrix)

  (0, 0)	1
(0, 2)	1
(0, 3)	1
(0, 4)	1
(0, 9)	2
(0, 10)	1
(0, 15)	1
(0, 16)	1
(0, 26)	1
(0, 31)	1
(1, 8)	1
(1, 13)	1
(1, 21)	1
(1, 28)	1
(1, 29)	1
(1, 30)	1
(1, 31)	2
(2, 5)	1
(2, 6)	1
(2, 11)	2
(2, 12)	1
(2, 14)	1
(2, 17)	1
(2, 18)	5
(2, 19)	1
(2, 24)	1
(2, 25)	1
(2, 27)	1
(2, 31)	2
(3, 1)	1
(3, 7)	1
(3, 8)	2
(3, 20)	1
(3, 22)	1
(3, 23)	1
(3, 25)	1
(3, 30)	1
(3, 31)	3

tf_idf = text.TfidfTransformer()
tf_idf.fit(token_count_matrix)

tf_idf.idf_

Output: :
array([1.69314718, 1.69314718, 1.69314718, 1.69314718, 1.69314718,
1.69314718, 1.69314718, 1.69314718, 1.69314718, 1.69314718,
1.69314718, 1.69314718, 1.69314718, 1.69314718, 1.69314718,
1.69314718, 1.        , 1.69314718, 1.28768207, 1.69314718])
tf_idf.idf_[vectorizer.vocabulary_['python']]

Output: :
1.916290731874155
da = vectorizer.transform(corpus).toarray()
i = 0

# check how often the word 'would' occurs in the the i'th sentence:
#vectorizer.vocabulary_['would']
word_ind = vectorizer.vocabulary_['would']
da[i][word_ind]
da[:,word_ind]

Output: :
array([0, 1, 0, 1])
word_weight_list = list(zip(vectorizer.get_feature_names(), tf_idf.idf_))

word_weight_list.sort(key=lambda x:x[1])  # sort list by the weights (2nd component)
for word, idf_weight in word_weight_list:
print(f"{word:15s}: {idf_weight:4.3f}")

you            : 1.000
if             : 1.511
were           : 1.511
would          : 1.511
are            : 1.916
can            : 1.916
do             : 1.916
does           : 1.916
doing          : 1.916
everybody      : 1.916
he             : 1.916
here           : 1.916
it             : 1.916
just           : 1.916
like           : 1.916
likes          : 1.916
lottery        : 1.916
loves          : 1.916
matter         : 1.916
not            : 1.916
programmer     : 1.916
python         : 1.916
said           : 1.916
stay           : 1.916
the            : 1.916
to             : 1.916
want           : 1.916
we             : 1.916
what           : 1.916
wish           : 1.916
won            : 1.916
work           : 1.916

from numpy import log
from sklearn.feature_extraction import text

corpus = ["It does not matter what you are doing, just do it!",
"Would you work if you won the lottery?",
"You like Python, he likes Python, we like Python, everybody loves Python!"
"You said: 'I wish I were a Python programmer'",
"You can stay here, if you want to. I would, if I were you."
]

n = len(corpus)

# the following variables are used globally (as free variables) in the functions :-(
vectorizer = text.CountVectorizer()
vectorizer.fit(corpus)
da = vectorizer.transform(corpus).toarray()


#### Term Frequency

We will first define a function for the term frequency.

Some notations:

• $f_{t, d}$ denotes the number of times that a term t occurs in document d
• ${wc}_d$ denotes the number of words in a document $d$

The simplest choice to define tf(t,d) is to use the raw count of a term in a document, i.e., the number of times that term t occurs in document d, which we can denote as $f_{t, d}$

We can define tf(t, d) in different ways:

• raw count of a term: $tf(t, d) = f_{t, d}$
• term frequency adjusted for document length: $tf(t, d) = \frac{f_{t, d}}{{wc}_d}$
• logarithmically scaled frequency: $tf(t, d) = \log(1 + f_{t, d})$
• augmented frequency, to prevent a bias towards longer documents, e.g. raw frequency of the term divided by the raw frequency of the most occurring term in the document: $tf(t, d) = 0.5 + 0.5 \cdot \frac{f_{t, d}}{\max_{t' \in d}{\{f_{t', d}\}} }$
def tf(t, d, mode="raw"):
""" The Term Frequency 'tf' calculates how often a term 't'
occurs in a document 'd'.  ('d': document index)
If t_in_d =  Number of times a term t appears in a document d
and no_terms_d = Total number of terms in the document,
tf(t, d) = t_in_d / no_terms_d

"""

if t in vectorizer.vocabulary_:
word_ind = vectorizer.vocabulary_[t]
t_occurences = da[d, word_ind]    # 'd' is the document index
else:
t_occurences = 0
if mode == "raw":
result =  t_occurences
elif mode == "length":
all_terms = (da[d] > 0).sum()  # calculate number of different terms in d
result = t_occurences / all_terms
elif mode == "log":
result = log(1 + t_occurences)
elif mode == "augfreq":
result = 0.5 + 0.5 * t_occurences / da[d].max()

return result


We will check the word frequencies for some words:

print("   raw    length log   augmented freq")
for term in ['matter', 'python', 'would']:
for docu_index in range(len(corpus)):
d = corpus[docu_index]
print(f"\n'{term}' in '{d}''")
for mode in ['raw', 'length', 'log', 'augfreq']:
x = tf(term, docu_index, mode=mode)
print(f"{x:7.2f}", end="")

   raw    length log   augmented freq

'matter' in 'It does not matter what you are doing, just do it!''
1.00   0.10   0.69   0.75
'matter' in 'Would you work if you won the lottery?''
0.00   0.00   0.00   0.50
'matter' in 'You like Python, he likes Python, we like Python, everybody loves Python!You said: 'I wish I were a Python programmer'''
0.00   0.00   0.00   0.50
'matter' in 'You can stay here, if you want to. I would, if I were you.''
0.00   0.00   0.00   0.50
'python' in 'It does not matter what you are doing, just do it!''
0.00   0.00   0.00   0.50
'python' in 'Would you work if you won the lottery?''
0.00   0.00   0.00   0.50
'python' in 'You like Python, he likes Python, we like Python, everybody loves Python!You said: 'I wish I were a Python programmer'''
5.00   0.42   1.79   1.00
'python' in 'You can stay here, if you want to. I would, if I were you.''
0.00   0.00   0.00   0.50
'would' in 'It does not matter what you are doing, just do it!''
0.00   0.00   0.00   0.50
'would' in 'Would you work if you won the lottery?''
1.00   0.14   0.69   0.75
'would' in 'You like Python, he likes Python, we like Python, everybody loves Python!You said: 'I wish I were a Python programmer'''
0.00   0.00   0.00   0.50
'would' in 'You can stay here, if you want to. I would, if I were you.''
1.00   0.11   0.69   0.67

#### Document Frequency

The document frequency df of a term t is defined as the number of documents in the document set that contain the term t.

$df(t) = |{\{d \in D: t \in d\}}|$

#### Inverse Document Frequency

The inverse document frequency is a measure of how much information the word provides, i.e., if it's common or rare across all documents. It is the logarithmically scaled inverse fraction of the document frequency. The effect of adding 1 to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored.

$idf(t) = log( \frac{n}{df(t)} ) + 1$

n is the number of documents in the corpus $n = |D|$

(Note that the idf formula above differs from the standard textbook notation that defines the idf as $idf(t) = log( \frac{n}{df(t) + 1})$.)

The formula above is used, when TfidfTransformer() is called with smooth_idf=False! If it is called with smooth_idf=True (the default) the constant 1 is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions:

$idf(t) = log( \frac{n + 1}{df(t) + 1} ) + 1$

#### Term frequency–Inverse document frequency

$tf_idf$ is calculated as the product of $tf(t, d)$ and $idf(t)$:

$tf_idf(t, d) = tf(t, d) \cdot idf(t)$

A high value of tf–idf means that the term has a high "term frequency" in the given document and a low "document frequency" in the other documents of the corpus. This means that this wieght can be used to filter out common terms.

We will program the tf_idf function now:

The helpfile of text.TfidfTransformer explains how tf_idf is calculated:

We will manually program these functions in the following:

def df(t):
""" df(t) is the document frequency of t; the document frequency is
the number of documents  in the document set that contain the term t. """

word_ind = vectorizer.vocabulary_[t]

tf_in_docus = da[:, word_ind] # vector with the freqencies of word_ind in all docus
existence_in_docus = tf_in_docus > 0 # binary vector, existence of word in docus
return existence_in_docus.sum()

#df("would", vectorizer)

def idf(t, smooth_idf=True):
""" idf """
if smooth_idf:
return log((1 + n) / (1 + df(t)) ) + 1
else:
return log(n / df(t) ) + 1

def tf_idf(t, d):
return idf(t) * tf(t, d)

res_idf = []
for word  in vectorizer.get_feature_names():
tf_docus = []
res_idf.append([word, idf(word)])

res_idf.sort(key=lambda x:x[1])
for item  in res_idf:
print(item)

['you', 1.0]
['if', 1.5108256237659907]
['were', 1.5108256237659907]
['would', 1.5108256237659907]
['are', 1.916290731874155]
['can', 1.916290731874155]
['do', 1.916290731874155]
['does', 1.916290731874155]
['doing', 1.916290731874155]
['everybody', 1.916290731874155]
['he', 1.916290731874155]
['here', 1.916290731874155]
['it', 1.916290731874155]
['just', 1.916290731874155]
['like', 1.916290731874155]
['likes', 1.916290731874155]
['lottery', 1.916290731874155]
['loves', 1.916290731874155]
['matter', 1.916290731874155]
['not', 1.916290731874155]
['programmer', 1.916290731874155]
['python', 1.916290731874155]
['said', 1.916290731874155]
['stay', 1.916290731874155]
['the', 1.916290731874155]
['to', 1.916290731874155]
['want', 1.916290731874155]
['we', 1.916290731874155]
['what', 1.916290731874155]
['wish', 1.916290731874155]
['won', 1.916290731874155]
['work', 1.916290731874155]

corpus

Output: :
['It does not matter what you are doing, just do it!',
'Would you work if you won the lottery?',
"You like Python, he likes Python, we like Python, everybody loves Python!You said: 'I wish I were a Python programmer'",
'You can stay here, if you want to. I would, if I were you.']
for word, word_index in vectorizer.vocabulary_.items():
print(f"\n{word:12s}: ", end="")
for d_index in range(len(corpus)):
print(f"{d_index:1d} {tf_idf(word, d_index):3.2f}, ",  end="" )

it          : 0 3.83, 1 0.00, 2 0.00, 3 0.00,
does        : 0 1.92, 1 0.00, 2 0.00, 3 0.00,
not         : 0 1.92, 1 0.00, 2 0.00, 3 0.00,
matter      : 0 1.92, 1 0.00, 2 0.00, 3 0.00,
what        : 0 1.92, 1 0.00, 2 0.00, 3 0.00,
you         : 0 1.00, 1 2.00, 2 2.00, 3 3.00,
are         : 0 1.92, 1 0.00, 2 0.00, 3 0.00,
doing       : 0 1.92, 1 0.00, 2 0.00, 3 0.00,
just        : 0 1.92, 1 0.00, 2 0.00, 3 0.00,
do          : 0 1.92, 1 0.00, 2 0.00, 3 0.00,
would       : 0 0.00, 1 1.51, 2 0.00, 3 1.51,
work        : 0 0.00, 1 1.92, 2 0.00, 3 0.00,
if          : 0 0.00, 1 1.51, 2 0.00, 3 3.02,
won         : 0 0.00, 1 1.92, 2 0.00, 3 0.00,
the         : 0 0.00, 1 1.92, 2 0.00, 3 0.00,
lottery     : 0 0.00, 1 1.92, 2 0.00, 3 0.00,
like        : 0 0.00, 1 0.00, 2 3.83, 3 0.00,
python      : 0 0.00, 1 0.00, 2 9.58, 3 0.00,
he          : 0 0.00, 1 0.00, 2 1.92, 3 0.00,
likes       : 0 0.00, 1 0.00, 2 1.92, 3 0.00,
we          : 0 0.00, 1 0.00, 2 1.92, 3 0.00,
everybody   : 0 0.00, 1 0.00, 2 1.92, 3 0.00,
loves       : 0 0.00, 1 0.00, 2 1.92, 3 0.00,
said        : 0 0.00, 1 0.00, 2 1.92, 3 0.00,
wish        : 0 0.00, 1 0.00, 2 1.92, 3 0.00,
were        : 0 0.00, 1 0.00, 2 1.51, 3 1.51,
programmer  : 0 0.00, 1 0.00, 2 1.92, 3 0.00,
can         : 0 0.00, 1 0.00, 2 0.00, 3 1.92,
stay        : 0 0.00, 1 0.00, 2 0.00, 3 1.92,
here        : 0 0.00, 1 0.00, 2 0.00, 3 1.92,
want        : 0 0.00, 1 0.00, 2 0.00, 3 1.92,
to          : 0 0.00, 1 0.00, 2 0.00, 3 1.92, 

#### Another Simple Example

We will use another simple example to illustrate the previously introduced concepts. We use a sentence which contains solely different words. The corpus consists of this sentence and reduced versions of it, i.e. cutting of words from the end of the sentence.

from sklearn.feature_extraction import text

words = "Cold wind blows over the cornfields".split()
corpus = []
for i in range(1, len(words)+1):
corpus.append(" ".join(words[:i]))

print(corpus)

['Cold', 'Cold wind', 'Cold wind blows', 'Cold wind blows over', 'Cold wind blows over the', 'Cold wind blows over the cornfields']

vectorizer = text.CountVectorizer()

vectorizer = vectorizer.fit(corpus)
vectorized_text = vectorizer.transform(corpus)

tf_idf = text.TfidfTransformer()
tf_idf.fit(vectorized_text)

tf_idf.idf_

Output: :
array([1.33647224, 1.        , 2.25276297, 1.55961579, 1.84729786,
1.15415068])
word_weight_list = list(zip(vectorizer.get_feature_names(), tf_idf.idf_))
word_weight_list.sort(key=lambda x:x[1])  # sort list by the weights (2nd component)
for word, idf_weight in word_weight_list:
print(f"{word:15s}: {idf_weight:4.3f}")

cold           : 1.000
wind           : 1.154
blows          : 1.336
over           : 1.560
the            : 1.847
cornfields     : 2.253

TfidF = text.TfidfTransformer(smooth_idf=True, use_idf=True)
tfidf = TfidF.fit_transform(vectorized_text)

word_weight_list = list(zip(vectorizer.get_feature_names(), tf_idf.idf_))
word_weight_list.sort(key=lambda x:x[1])  # sort list by the weights (2nd component)
for word, idf_weight in word_weight_list:
print(f"{word:15s}: {idf_weight:4.3f}")

cold           : 1.000
wind           : 1.154
blows          : 1.336
over           : 1.560
the            : 1.847
cornfields     : 2.253


### Working With Real Data

scikit-learn contains a dataset from real newsgroups, which can be used for our purposes:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

import numpy as np

# Create our vectorizer
vectorizer = CountVectorizer()

# Let's fetch all the possible text data
newsgroups_data = fetch_20newsgroups()


Let us have a closer look at this data. As with all the other data sets in sklearn we can find the actual data under the attribute data:

print(newsgroups_data.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
---- brought to you by your neighborhood Lerxst ----


print(newsgroups_data.data[200])

Subject: Re: "Proper gun control?" What is proper gun cont
From: kim39@scws8.harvard.edu (John Kim)
Organization: Harvard University Science Center
Nntp-Posting-Host: scws8.harvard.edu
Lines: 17

In article <C5JGz5.34J@SSD.intel.com> hays@ssd.intel.com (Kirk Hays) writes:
>I'd like to point out that I was in error - "Terminator" began posting only
>six months before he purchased his first firearm, according to private email
>from him.
>I can't produce an archived posting of his earlier than January 1992,
>and he purchased his first firearm in March 1992.
>I guess it only seemed like years.
>Kirk Hays - NRA Life, seventh generation.

I first read and consulted rec.guns in the summer of 1991.  I
just purchased my first firearm in early March of this year.

NOt for lack of desire for a firearm, you understand.  I could
have purchased a rifle or shotgun but didn't want one.
-Case Kim



We create the vectorizer:

vectorizer.fit(newsgroups_data.data)

Output: :
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)

Let's have a look at the first n words:

counter = 0
n = 10
for word, index in vectorizer.vocabulary_.items():
print(word, index)
counter += 1
if counter > n:
break

from 56979
lerxst 75358
wam 123162
umd 118280
edu 50527
where 124031
my 85354
thing 114688
subject 111322
what 123984
car 37780


We can turn the newsgroup postings into arrays. We do it with the first one:

a = vectorizer.transform([newsgroups_data.data[0]]).toarray()[0]
print(a)

[0 0 0 ... 0 0 0]


The vocabulary is huge This is why we see mostly zeros.

len(vectorizer.vocabulary_)

Output: :
130107

There are a lot of 'rubbish' words in this vocabulary. rubish means seen from the perspective of machine learning. For machine learning purposes words like 'Subject', 'From', 'Organization', 'Nntp-Posting-Host', 'Lines' and many others are useless, because they occur in all or in most postings. The technical 'garbage' from the newsgroup can be easily stripped off. We can fetch it differently. Stating that we do not want 'headers', 'footers' and 'quotes':

newsgroups_data_cleaned = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))

print(newsgroups_data_cleaned.data[0])

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.


Let's have a look at the complete posting:

print(newsgroups_data.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
---- brought to you by your neighborhood Lerxst ----


vectorizer_cleaned = vectorizer.fit(newsgroups_data_cleaned.data)
len(vectorizer_cleaned.vocabulary_)

Output: :
101631

So, we got rid of more than 30000 words, but with more than a 100000 words is it still very large.

We can also directly separate the newsgroup feeds into a train and test set:

newsgroups_train = fetch_20newsgroups(subset='train',
newsgroups_test = fetch_20newsgroups(subset='test',

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

vectorizer = CountVectorizer()

In [ ]:
train_data = vectorizer.fit_transform(newsgroups_train.data)

# creating a classifier
classifier = MultinomialNB(alpha=.01)
classifier.fit(train_data, newsgroups_train.target)

test_data = vectorizer.transform(newsgroups_test.data)

predictions = classifier.predict(test_data)
accuracy_score = metrics.accuracy_score(newsgroups_test.target,
predictions)
f1_score = metrics.f1_score(newsgroups_test.target,
predictions,
average='macro')

print("Accuracy score: ", accuracy_score)
print("F1 score: ", f1_score)


### Stop Words

So far we added all the words to the vocabulary. However, it is questionable whether words like "the", "am", "were" or similar words should be included at all, since they usually do not provide any significant semantic contribution for a text. In other words: They have limited predictive power. It would therefore make sense to exclude such words from editing, i.e. inclusion in the dictionary. This means we have to provide a list of words which should be neglected, i.e. being filtered out before or after processing text. In natural text recognition such words are usually called "stop words". There is no single universal list of stop words defined, which could be used by all natural language processing tools. Usually, stop words consist of the most frequently used words in a language. "Stop words" can be individually chosen for a given task.

By the way, stop words are an idea which is quite old. It goes back to 1959 and Hans Peter Luhn, one of the pioneers in information retrieval.

There are different ways to provide stop words in sklearn:

• Explicit list of stop words
• Automatically created stop words

#### Indivudual Stop Words

from sklearn.feature_extraction.text import CountVectorizer

corpus = ["A horse, a horse, my kingdom for a horse!",
"Horse sense is the thing a horse has which keeps it from betting on people."
"I’ve often said there is nothing better for the inside of the man, than the outside of the horse.",
"A man on a horse is spiritually, as well as physically, bigger then a man on foot.",
"No heaven can heaven be, if my horse isn’t there to welcome me."]

cv = CountVectorizer(corpus,
stop_words=["my", "for","the", "has", "than", "if",
"from", "on", "of", "it", "there", "ve",
"as", "no", "be", "which", "isn", "to",
"me", "is", "can", "then"])
count_vector = cv.fit_transform(corpus)
count_vector.shape

cv.vocabulary_

Output: :
{'horse': 5,
'kingdom': 8,
'sense': 16,
'thing': 18,
'keeps': 7,
'betting': 1,
'people': 13,
'often': 11,
'said': 15,
'nothing': 10,
'better': 0,
'inside': 6,
'man': 9,
'outside': 12,
'spiritually': 17,
'well': 20,
'physically': 14,
'bigger': 2,
'foot': 3,
'heaven': 4,
'welcome': 19}

sklearn contains default stop words, which are implemented as a frozenset and it can be accessed with text.ENGLISH_STOP_WORDS:

from sklearn.feature_extraction import text
n = 25
print(str(n) + " arbitrary words from ENGLISH_STOP_WORDS:")
counter = 0
for word in text.ENGLISH_STOP_WORDS:
if counter == n - 1:
print(word)
break
print(word, end=", ")
counter += 1

25 arbitrary words from ENGLISH_STOP_WORDS:
has, hereafter, thick, are, nothing, before, you, this, however, any, often, third, other, own, yourself, they, least, yet, something, fill, cannot, as, within, whereupon, where


We can use stop words in our 20newsgroups classification problem:

In [ ]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

vectorizer = CountVectorizer(stop_words=text.ENGLISH_STOP_WORDS)

vectors = vectorizer.fit_transform(newsgroups_train.data)

# creating a classifier
classifier = MultinomialNB(alpha=.01)
classifier.fit(vectors, newsgroups_train.target)

vectors_test = vectorizer.transform(newsgroups_test.data)

predictions = classifier.predict(vectors_test)
accuracy_score = metrics.accuracy_score(newsgroups_test.target,
predictions)
f1_score = metrics.f1_score(newsgroups_test.target,
predictions,
average='macro')

print("accuracy score: ", accuracy_score)
print("F1-score: ", f1_score)


#### Automatically Created Stop Words

As in many other cases, it is a good idea to look for ways to automatically define a list of stop words. A list that is or should be ideally adapted to the problem.

To automatically create a stop word list, we will start with the parameter min_df of CountVectorizer. When you set this threshold parameter, terms that have a document frequency strictly lower than the given threshold will be ignored. This value is also called cut-off in the literature. If a float value in the range of [0.0, 1.0] is used, the parameter represents a proportion of documents. An integer will be treated as absolute counts. This parameter is ignored if vocabulary is not None.

In [ ]:
corpus = ["""People say you cannot live without love,
but I think oxygen is more important""",
"Sometimes, when you close your eyes, you cannot see."
"A horse, a horse, my kingdom for a horse!",
"""Horse sense is the thing a horse has which
keeps it from betting on people."""
"""I’ve often said there is nothing better for
the inside of the man, than the outside of the horse.""",
"""A man on a horse is spiritually, as well as physically,
bigger then a man on foot.""",
"""No heaven can heaven be, if my horse isn’t there
to welcome me."""]

cv = CountVectorizer(corpus,
min_df=2)
count_vector = cv.fit_transform(corpus)
cv.vocabulary_


Hardly any words from our corpus text are left. Because we have only few documents (strings) in our corpus and also because these texts are very short, the number of words which occur in less then two documents is very high. We eliminated all the words which occur in less two documents.

We can also see the words which have been chosen as stopwords by looking at cv.stop_words_:

In [ ]:
cv.stop_words_

In [ ]:
print("number of docus, size of vocabulary, stop_words list size")
for i in range(len(corpus)):
cv = CountVectorizer(corpus,
min_df=i)
count_vector = cv.fit_transform(corpus)
len_voc = len(cv.vocabulary_)
len_stop_words = len(cv.stop_words_)
print(f"{i:10d} {len_voc:15d} {len_stop_words:19d}")



Another parameter of CountVectorizer with which we can create a corpus-specific stop_words_list is max_df. It can be a float values between 0.0 and 1.0 or an integer. the default value is 1.0, i.e. the float value 1.0 and not an integer 1! When building the vocabulary all terms that have a document frequency strictly higher than the given threshold will be ignored. If this parameter is given as a float betwenn 0.0 and 1.0., the parameter represents a proportion of documents. This parameter is ignored if vocabulary is not None.

Let us use again our previous corpus for an example.

In [ ]:
cv = CountVectorizer(corpus,
max_df=0.20)
count_vector = cv.fit_transform(corpus)
cv.stop_words_


### Exercises

#### Exercise 1

In the subdirectory 'books' you will find some books:

Use these novels as the corpus and create a word count vector.

#### Exercise 2

Turn the previously calculated 'word count vector' into a dense ndarray representation.

#### Exercise 3

Let us have another example with a different corpus. The five strings are famous quotes from

1. William Shakespeare
2. W.C. Fields
3. Ronald Reagan
4. John Steinbeck
5. Author unknown

Compute the IDF values!

In [ ]:
quotes = ["A horse, a horse, my kingdom for a horse!",
"Horse sense is the thing a horse has which keeps it from betting on people."
"I’ve often said there is nothing better for the inside of the man, than the outside of the horse.",
"A man on a horse is spiritually, as well as physically, bigger then a man on foot.",
"No heaven can heaven be, if my horse isn’t there to welcome me."]


### Solutions

#### Solution to Exercise 1

In [ ]:
corpus = []
books = ["night_and_day_virginia_woolf.txt",
"the_way_of_all_flash_butler.txt",
"moby_dick_melville.txt",
"sons_and_lovers_lawrence.txt",
"robinson_crusoe_defoe.txt",
"james_joyce_ulysses.txt"]
path = "books"

corpus = []
for book in books:
txt = open(path + "/" + book).read()
corpus.append(txt)

[book[:30] for book in corpus]


We have to get rid of the Gutenberg header and footer, because it doesn't belong to the novels. We can see by looking at the texts that the authors works begins after lines of the following kind

***START OF THIS PROJECT GUTENBERG ... ***

***END OF THIS PROJECT GUTENBERG EBOOK ...***

There may or may not be a space after the first three stars or instead of "the" there may be "this".

We can use regular expressions to find the starting point of the novels:

In [ ]:
from sklearn.feature_extraction import text
import re

corpus = []
books = ["night_and_day_virginia_woolf.txt",
"the_way_of_all_flash_butler.txt",
"moby_dick_melville.txt",
"sons_and_lovers_lawrence.txt",
"robinson_crusoe_defoe.txt",
"james_joyce_ulysses.txt"]
path = "books"

corpus = []
for book in books:
txt = open(path + "/" + book).read()
text_begin = re.search(r"\*\*\* ?START OF (THE|THIS) PROJECT.*?\*\*\*", txt, re.DOTALL)
text_end = re.search(r"\*\*\* ?END OF (THE|THIS) PROJECT.*?\*\*\*", txt, re.DOTALL)
corpus.append(txt[text_begin.end():text_end.start()])

In [ ]:
vectorizer = text.CountVectorizer()
vectorizer.fit(corpus)
token_count_matrix = vectorizer.transform(corpus)
print(token_count_matrix)

In [ ]:
print("Number of words in vocabulary: ", len(vectorizer.vocabulary_))


#### Solution to Exercise 2

All you have to do is applying the method toarray to the token_count_matrix:

In [ ]:
token_count_matrix.toarray()


#### Solution to Exercise 3

In [ ]:
from sklearn.feature_extraction import text

# our corpus:
quotes = ["A horse, a horse, my kingdom for a horse!",
"Horse sense is the thing a horse has which keeps it from betting on people."
"I’ve often said there is nothing better for the inside of the man, than the outside of the horse.",
"A man on a horse is spiritually, as well as physically, bigger then a man on foot.",
"No heaven can heaven be, if my horse isn’t there to welcome me."]

vectorizer = text.CountVectorizer()
vectorizer.fit(quotes)
vectorized_text = vectorizer.fit_transform(quotes)

tfidf_transformer = text.TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(vectorized_text)

"""
alternative way to output the data:
import pandas as pd
df_idf = pd.DataFrame(tfidf_transformer.idf_,
index=vectorizer.get_feature_names(),
columns=["idf_weight"])
df_idf.sort_values(by=['idf_weights'])  # sorting data
print(df_idf)

"""
print(f"{'word':15s}: idf_weight")
word_weight_list = list(zip(vectorizer.get_feature_names(), tfidf_transformer.idf_))
word_weight_list.sort(key=lambda x:x[1])  # sort list by the weights (2nd component)
for word, idf_weight in word_weight_list:
print(f"{word:15s}: {idf_weight:4.3f}")


### Footnotes

1Logically ndarray and todense are the same thing, but toarray returns an ndarray whereas todense returns a matrix. If you consider, what the official Numpy documentation has to say about the numpy.matrix class, you shouldn't use todense! "It is no longer recommended to use this class, even for linear algebra. Instead use regular arrays. The class may be removed in the future." (numpy.matrix) (back)