Encoding Text for Machine Learning

Introduction

bookshelf

We mentioned in the introductory chapter of our tutorial that a spam filter for emails is a typical example of machine learning. Emails are based on text, which is why a classifier to classify emails must be able to process text as input. If we look at the previous examples with neural networks, they always run directly with numerical values and have a fixed input length. In the end, the characters of a text also consist of numerical values, but it is obvious that we cannot simply use a text as it is as input for a neural network. This means that the text have to be converted into a numerical representation, e.g. vectors or arrays of numbers.

We will learn in this tutorial how to encode text in a way which is suitable for machine processing.

Bag-of-Words Model

If we want to use texts in machine learning, we need a representation of the text which is usable for Machine Learning purposes. This means we need a numerical representation. We cannot use texts directly.

Bag of Words

In natural language processing and information retrievel the bag-of-words model is of crucial importance. The bag-of-words model can be used to represent text data in a way which is suitable for machine learning algorithms. Furthermore, this model is easy and efficient to implement. In the bag-of-words model, a text (such as a sentence or a document) is represented as the so-called bag (a set or multiset) of its words. Grammar and word order are ignored.

We will use in the following a list of three strings to demonstrate the bag-of-words approach. In linguistics, the collection of texts used for the experiments or tests is usually called a corpus:

corpus = ["To be, or not to be, that is the question:",
          "Whether 'tis nobler in the mind to suffer",
          "The slings and arrows of outrageous fortune,"]

We will use the submodule text from sklearn.feature_extraction. This module contains utilities to build feature vectors from text documents.

from sklearn.feature_extraction import text

CountVectorizer is a class in the module sklearn.feature_extraction.text. It's a class useful for building a corpus vocabulary. In addition, it produces the numerical representation of text that we need, i.e. Numpy vectors.

First we need an instance of this class. When we instantiate a CountVectorizer, we can pass some optional parameters, but it is possible to call it with no arguments, as we will do in the following. Printing the vectorizer gives us useful information about the default values used when the instance was created:

vectorizer = text.CountVectorizer()
print(vectorizer)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

We have now an instance of CountVectorizer, but it has not seen any texts so far. We will use the method fit to process our previously defined corpus. We learn a vocabulary dictionary of all the tokens (strings) of the corpus:

x = vectorizer.fit(corpus)
print(x)
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

fit created the vocabulary structure vocabulary_. This contains the words of the text as keys and a unique integer value for each word. As the default value for the parameter lowercase is set to True, the To in the beginning of the text has been turned into to. You may also notice that the vocabulary contains only words without any punctuation or special character. You can change this behaviour by assigning a regular expression to the keyword parameter token_pattern of the fit method. The dault is set to (?u)\\b\\w\\w+\\b. The (?u) part of this regular expression is not necessary because it switches on the re.U (re.UNICODE) flag for this expression, which is the default in Python anyway. The minimal word length will be two characters:

print("Vocabulary: ", vectorizer.vocabulary_)
Vocabulary:  {'to': 18, 'be': 2, 'or': 10, 'not': 8, 'that': 15, 'is': 5, 'the': 16, 'question': 12, 'whether': 19, 'tis': 17, 'nobler': 7, 'in': 4, 'mind': 6, 'suffer': 14, 'slings': 13, 'and': 0, 'arrows': 1, 'of': 9, 'outrageous': 11, 'fortune': 3}

If you only want to see the words without the indices, you can your the method feature_names:

print(vectorizer.get_feature_names())
['and', 'arrows', 'be', 'fortune', 'in', 'is', 'mind', 'nobler', 'not', 'of', 'or', 'outrageous', 'question', 'slings', 'suffer', 'that', 'the', 'tis', 'to', 'whether']

Alternatively, you can apply keys to the vocaulary to keep the ordering:

print(list(vectorizer.vocabulary_.keys()))
['to', 'be', 'or', 'not', 'that', 'is', 'the', 'question', 'whether', 'tis', 'nobler', 'in', 'mind', 'suffer', 'slings', 'and', 'arrows', 'of', 'outrageous', 'fortune']

With the aid of transform we will extract the token counts out of the raw text documents. The call will use the vocabulary which we created with fit:

token_count_matrix = vectorizer.transform(corpus)
print(token_count_matrix)
  (0, 2)	2
  (0, 5)	1
  (0, 8)	1
  (0, 10)	1
  (0, 12)	1
  (0, 15)	1
  (0, 16)	1
  (0, 18)	2
  (1, 4)	1
  (1, 6)	1
  (1, 7)	1
  (1, 14)	1
  (1, 16)	1
  (1, 17)	1
  (1, 18)	1
  (1, 19)	1
  (2, 0)	1
  (2, 1)	1
  (2, 3)	1
  (2, 9)	1
  (2, 11)	1
  (2, 13)	1
  (2, 16)	1

The connection between the corpus, the Vocabulary vocabulary_ and the vector created by transform can be seen in the following image:

BoW, Corpus and Transform vector

We will apply the method toarray on our object token_count_matrix. It will return a dense ndarray representation of this matrix.

Just in case: You might see that people use sometimes todense instead of toarray.
Do not use todense!1

dense_tcm = token_count_matrix.toarray()
dense_tcm
Output: :
array([[0, 0, 2, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 2, 0],
       [0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1],
       [1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0]])

The rows of this array correspond to the strings of our corpus. The length of a row corresponds to the length of the vocabulary. The i'th value in a row corresponds to the i'th entry of the vocabulary. If the value of dense_tcm[i][j] is equal to k, we know the word with the indexj in the vocabulary occurs k times in the string with the index i in the corpus.

This is visualized in the following diagram:

Dense Array and words of vocabulary

word = "be"
i = 1
j = vectorizer.vocabulary_[word]
print("number of times '" + word + "' occurs in:")
for i in range(len(corpus)):
     print("    '" + corpus[i] + "': " + str(dense_tcm[i][j]))
number of times 'be' occurs in:
    'To be, or not to be, that is the question:': 2
    'Whether 'tis nobler in the mind to suffer': 0
    'The slings and arrows of outrageous fortune,': 0

We will extract the token counts out of new text documents. Let's use a literally doubtful variation of Hamlet's famous monologue and check what transform has to say about it. transform will use the vocabulary which was previously fitted with fit.

txt = "That is the question and it is nobler in the mind."
vectorizer.transform([txt]).toarray()
Output: :
array([[1, 0, 0, 0, 1, 2, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 2, 0, 0, 0]])
print(vectorizer.get_feature_names())
['and', 'arrows', 'be', 'fortune', 'in', 'is', 'mind', 'nobler', 'not', 'of', 'or', 'outrageous', 'question', 'slings', 'suffer', 'that', 'the', 'tis', 'to', 'whether']
print(vectorizer.vocabulary_)
{'to': 18, 'be': 2, 'or': 10, 'not': 8, 'that': 15, 'is': 5, 'the': 16, 'question': 12, 'whether': 19, 'tis': 17, 'nobler': 7, 'in': 4, 'mind': 6, 'suffer': 14, 'slings': 13, 'and': 0, 'arrows': 1, 'of': 9, 'outrageous': 11, 'fortune': 3}

Word Importance

If you look at words like "the", "and" or "of", you will see see that they will occur in nearly all English texts. If you keep in mind that our ultimate goal will be to differentiate between texts and attribute them to classes, words like the previously mentioned ones will bear hardly any meaning. If you look at the following corpus, you can see that

from sklearn.feature_extraction import text
corpus = ["It does not matter what you are doing, just do it!",
          "Would you work if you won the lottery?",
          "You like Python, he likes Python, we like Python, everybody loves Python!"
          "You said: 'I wish I were a Python programmer'",
          "You can stay here, if you want to. I would, if I were you."
         ]
vectorizer = text.CountVectorizer()
vectorizer.fit(corpus)

token_count_matrix = vectorizer.transform(corpus)
print(token_count_matrix)
  (0, 0)	1
  (0, 2)	1
  (0, 3)	1
  (0, 4)	1
  (0, 9)	2
  (0, 10)	1
  (0, 15)	1
  (0, 16)	1
  (0, 26)	1
  (0, 31)	1
  (1, 8)	1
  (1, 13)	1
  (1, 21)	1
  (1, 28)	1
  (1, 29)	1
  (1, 30)	1
  (1, 31)	2
  (2, 5)	1
  (2, 6)	1
  (2, 11)	2
  (2, 12)	1
  (2, 14)	1
  (2, 17)	1
  (2, 18)	5
  (2, 19)	1
  (2, 24)	1
  (2, 25)	1
  (2, 27)	1
  (2, 31)	2
  (3, 1)	1
  (3, 7)	1
  (3, 8)	2
  (3, 20)	1
  (3, 22)	1
  (3, 23)	1
  (3, 25)	1
  (3, 30)	1
  (3, 31)	3
tf_idf = text.TfidfTransformer()
tf_idf.fit(token_count_matrix)

tf_idf.idf_
Output: :
array([1.91629073, 1.91629073, 1.91629073, 1.91629073, 1.91629073,
       1.91629073, 1.91629073, 1.91629073, 1.51082562, 1.91629073,
       1.91629073, 1.91629073, 1.91629073, 1.91629073, 1.91629073,
       1.91629073, 1.91629073, 1.91629073, 1.91629073, 1.91629073,
       1.91629073, 1.91629073, 1.91629073, 1.91629073, 1.91629073,
       1.51082562, 1.91629073, 1.91629073, 1.91629073, 1.91629073,
       1.51082562, 1.        ])
tf_idf.idf_[vectorizer.vocabulary_['python']]
Output: :
1.916290731874155
da = vectorizer.transform(corpus).toarray()
i = 0

# check how often the word 'would' occurs in the the i'th sentence:
#vectorizer.vocabulary_['would']
word_ind = vectorizer.vocabulary_['would']
da[i][word_ind]
da[:,word_ind]
Output: :
array([0, 1, 0, 1])
word_weight_list = list(zip(vectorizer.get_feature_names(), tf_idf.idf_))

word_weight_list.sort(key=lambda x:x[1])  # sort list by the weights (2nd component)
for word, idf_weight in word_weight_list:
    print(f"{word:15s}: {idf_weight:4.3f}")
you            : 1.000
if             : 1.511
were           : 1.511
would          : 1.511
are            : 1.916
can            : 1.916
do             : 1.916
does           : 1.916
doing          : 1.916
everybody      : 1.916
he             : 1.916
here           : 1.916
it             : 1.916
just           : 1.916
like           : 1.916
likes          : 1.916
lottery        : 1.916
loves          : 1.916
matter         : 1.916
not            : 1.916
programmer     : 1.916
python         : 1.916
said           : 1.916
stay           : 1.916
the            : 1.916
to             : 1.916
want           : 1.916
we             : 1.916
what           : 1.916
wish           : 1.916
won            : 1.916
work           : 1.916

The helpfile of text.TfidfTransformer explains how tf_idf is calculated:

The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False), where n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t. The effect of adding "1" to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored.

(Note that the idf formula above differs from the standard textbook notation that defines the idf as idf(t) = log [ n / (df(t) + 1) ]). If smooth_idf=True (the default), the constant "1" is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(d, t) = log [ (1 + n) / (1 + df(d, t)) ] + 1. </cite>

We will manually program this in the following:

# manually calculating the weights
from numpy import log

n = len(corpus)


# the following variables are used globally (as free variables) in the functions :-(
vectorizer = text.CountVectorizer()
vectorizer.fit(corpus)
da = vectorizer.transform(corpus).toarray() 


def df(t):
    """ df(t) is the document frequency of t; the document frequency is 
        the number of documents  in the document set that contain the term t. """
    
    word_ind = vectorizer.vocabulary_[t]

    tf_in_docus = da[:, word_ind] # vector with the freqencies of word_ind in all docus
    existence_in_docus = da[:, word_ind]>0 # binary vector existence of word in docus
    return existence_in_docus.sum()
    
#df("would", vectorizer) 

def idf(t, smooth_idf=True):
    """ idf """
    if smooth_idf:
        return log((1 + n) / (1 + df(t)) ) + 1
    else:
        return log(n / df(t) ) + 1

def tf(t, d):
    """ The Term Frequency 'tf' calculates how often a term 't' 
        occurs in a document 'd'.  ('d': document index)
        If t_in_d =  Number of times a term t appears in a document d
        and no_terms_d = Total number of terms in the document, 
        tf(t, d) = t_in_d / no_terms_d
        
    """
    
    word_ind = vectorizer.vocabulary_[t]
    t_occurences = da[d, word_ind]    # 'd' is the document index
    all_terms = (da[d] > 0).sum()  # calculate number of different terms in d
    return t_occurences / all_terms
    
    
def tf_idf(t, d):
    return idf(t) * tf(t, d)
    

res_idf = []
for word  in vectorizer.get_feature_names():
    tf_docus = []
    res_idf.append([word, idf(word)])

res_idf.sort(key=lambda x:x[1])
for item  in res_idf:
    print(item)
['if', 1.2231435513142097]
['were', 1.2231435513142097]
['would', 1.2231435513142097]
['you', 1.5108256237659907]
['and', 1.916290731874155]
['beggars', 1.916290731874155]
['can', 1.916290731874155]
['he', 1.916290731874155]
['here', 1.916290731874155]
['horses', 1.916290731874155]
['like', 1.916290731874155]
['likes', 1.916290731874155]
['lottery', 1.916290731874155]
['programmer', 1.916290731874155]
['python', 1.916290731874155]
['ride', 1.916290731874155]
['said', 1.916290731874155]
['she', 1.916290731874155]
['stay', 1.916290731874155]
['the', 1.916290731874155]
['to', 1.916290731874155]
['want', 1.916290731874155]
['we', 1.916290731874155]
['wish', 1.916290731874155]
['wishes', 1.916290731874155]
['won', 1.916290731874155]
['work', 1.916290731874155]
corpus
Output: :
["Wishes and Beggars: 'If Wishes Were Horses, Beggars Would Ride'",
 'Would you work if you won the lottery?',
 "She likes Python, he likes Python, we like Python!He said: 'I wish I were a Python programmer'",
 'You can stay here, if you want to. I would, if I were you.']
for word, word_index in vectorizer.vocabulary_.items():
    print(f"\n{word:12s}: ", end="")
    for d_index in range(len(corpus)):
        print(f"{d_index:1d} {tf_idf(word, d_index):3.2f}, ",  end="" )
wishes      : 0 0.48, 1 0.00, 2 0.00, 3 0.00, 
and         : 0 0.24, 1 0.00, 2 0.00, 3 0.00, 
beggars     : 0 0.48, 1 0.00, 2 0.00, 3 0.00, 
if          : 0 0.15, 1 0.17, 2 0.00, 3 0.27, 
were        : 0 0.15, 1 0.00, 2 0.12, 3 0.14, 
horses      : 0 0.24, 1 0.00, 2 0.00, 3 0.00, 
would       : 0 0.15, 1 0.17, 2 0.00, 3 0.14, 
ride        : 0 0.24, 1 0.00, 2 0.00, 3 0.00, 
you         : 0 0.00, 1 0.43, 2 0.00, 3 0.50, 
work        : 0 0.00, 1 0.27, 2 0.00, 3 0.00, 
won         : 0 0.00, 1 0.27, 2 0.00, 3 0.00, 
the         : 0 0.00, 1 0.27, 2 0.00, 3 0.00, 
lottery     : 0 0.00, 1 0.27, 2 0.00, 3 0.00, 
she         : 0 0.00, 1 0.00, 2 0.19, 3 0.00, 
likes       : 0 0.00, 1 0.00, 2 0.38, 3 0.00, 
python      : 0 0.00, 1 0.00, 2 0.77, 3 0.00, 
he          : 0 0.00, 1 0.00, 2 0.38, 3 0.00, 
we          : 0 0.00, 1 0.00, 2 0.19, 3 0.00, 
like        : 0 0.00, 1 0.00, 2 0.19, 3 0.00, 
said        : 0 0.00, 1 0.00, 2 0.19, 3 0.00, 
wish        : 0 0.00, 1 0.00, 2 0.19, 3 0.00, 
programmer  : 0 0.00, 1 0.00, 2 0.19, 3 0.00, 
can         : 0 0.00, 1 0.00, 2 0.00, 3 0.21, 
stay        : 0 0.00, 1 0.00, 2 0.00, 3 0.21, 
here        : 0 0.00, 1 0.00, 2 0.00, 3 0.21, 
want        : 0 0.00, 1 0.00, 2 0.00, 3 0.21, 
to          : 0 0.00, 1 0.00, 2 0.00, 3 0.21, 
tf_idf = text.TfidfTransformer()
tf_idf.fit(token_count_matrix)

da = vectorizer.transform(corpus).toarray()
print(da)

for word, word_index in vectorizer.vocabulary_.items():
    print(f"\n{word:12s}: ", end="")
    print(tf_idf.idf_[vectorizer.vocabulary_['beggars']])
    all_terms = (da[:] > 0).sum(axis=1)
    print(all_terms, da[:, word_index])
    tf_word = da[:, word_index] / all_terms
    print(tf_word)
    print(tf_idf.idf_[vectorizer.vocabulary_['beggars']])
    print(tf_word * tf_idf.idf_[vectorizer.vocabulary_['beggars']])
    
    
[[1 2 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 2 0 0 1 0]
 [0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 2]
 [0 0 0 2 0 0 0 1 2 0 1 4 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0]
 [0 0 1 0 1 0 2 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 3]]

wishes      : 1.916290731874155
[ 8  7 10  9] [2 0 0 0]
[0.25 0.   0.   0.  ]
1.916290731874155
[0.47907268 0.         0.         0.        ]

and         : 1.916290731874155
[ 8  7 10  9] [1 0 0 0]
[0.125 0.    0.    0.   ]
1.916290731874155
[0.23953634 0.         0.         0.        ]

beggars     : 1.916290731874155
[ 8  7 10  9] [2 0 0 0]
[0.25 0.   0.   0.  ]
1.916290731874155
[0.47907268 0.         0.         0.        ]

if          : 1.916290731874155
[ 8  7 10  9] [1 1 0 2]
[0.125      0.14285714 0.         0.22222222]
1.916290731874155
[0.23953634 0.27375582 0.         0.42584238]

were        : 1.916290731874155
[ 8  7 10  9] [1 0 1 1]
[0.125      0.         0.1        0.11111111]
1.916290731874155
[0.23953634 0.         0.19162907 0.21292119]

horses      : 1.916290731874155
[ 8  7 10  9] [1 0 0 0]
[0.125 0.    0.    0.   ]
1.916290731874155
[0.23953634 0.         0.         0.        ]

would       : 1.916290731874155
[ 8  7 10  9] [1 1 0 1]
[0.125      0.14285714 0.         0.11111111]
1.916290731874155
[0.23953634 0.27375582 0.         0.21292119]

ride        : 1.916290731874155
[ 8  7 10  9] [1 0 0 0]
[0.125 0.    0.    0.   ]
1.916290731874155
[0.23953634 0.         0.         0.        ]

you         : 1.916290731874155
[ 8  7 10  9] [0 2 0 3]
[0.         0.28571429 0.         0.33333333]
1.916290731874155
[0.         0.54751164 0.         0.63876358]

work        : 1.916290731874155
[ 8  7 10  9] [0 1 0 0]
[0.         0.14285714 0.         0.        ]
1.916290731874155
[0.         0.27375582 0.         0.        ]

won         : 1.916290731874155
[ 8  7 10  9] [0 1 0 0]
[0.         0.14285714 0.         0.        ]
1.916290731874155
[0.         0.27375582 0.         0.        ]

the         : 1.916290731874155
[ 8  7 10  9] [0 1 0 0]
[0.         0.14285714 0.         0.        ]
1.916290731874155
[0.         0.27375582 0.         0.        ]

lottery     : 1.916290731874155
[ 8  7 10  9] [0 1 0 0]
[0.         0.14285714 0.         0.        ]
1.916290731874155
[0.         0.27375582 0.         0.        ]

she         : 1.916290731874155
[ 8  7 10  9] [0 0 1 0]
[0.  0.  0.1 0. ]
1.916290731874155
[0.         0.         0.19162907 0.        ]

likes       : 1.916290731874155
[ 8  7 10  9] [0 0 2 0]
[0.  0.  0.2 0. ]
1.916290731874155
[0.         0.         0.38325815 0.        ]

python      : 1.916290731874155
[ 8  7 10  9] [0 0 4 0]
[0.  0.  0.4 0. ]
1.916290731874155
[0.         0.         0.76651629 0.        ]

he          : 1.916290731874155
[ 8  7 10  9] [0 0 2 0]
[0.  0.  0.2 0. ]
1.916290731874155
[0.         0.         0.38325815 0.        ]

we          : 1.916290731874155
[ 8  7 10  9] [0 0 1 0]
[0.  0.  0.1 0. ]
1.916290731874155
[0.         0.         0.19162907 0.        ]

like        : 1.916290731874155
[ 8  7 10  9] [0 0 1 0]
[0.  0.  0.1 0. ]
1.916290731874155
[0.         0.         0.19162907 0.        ]

said        : 1.916290731874155
[ 8  7 10  9] [0 0 1 0]
[0.  0.  0.1 0. ]
1.916290731874155
[0.         0.         0.19162907 0.        ]

wish        : 1.916290731874155
[ 8  7 10  9] [0 0 1 0]
[0.  0.  0.1 0. ]
1.916290731874155
[0.         0.         0.19162907 0.        ]

programmer  : 1.916290731874155
[ 8  7 10  9] [0 0 1 0]
[0.  0.  0.1 0. ]
1.916290731874155
[0.         0.         0.19162907 0.        ]

can         : 1.916290731874155
[ 8  7 10  9] [0 0 0 1]
[0.         0.         0.         0.11111111]
1.916290731874155
[0.         0.         0.         0.21292119]

stay        : 1.916290731874155
[ 8  7 10  9] [0 0 0 1]
[0.         0.         0.         0.11111111]
1.916290731874155
[0.         0.         0.         0.21292119]

here        : 1.916290731874155
[ 8  7 10  9] [0 0 0 1]
[0.         0.         0.         0.11111111]
1.916290731874155
[0.         0.         0.         0.21292119]

want        : 1.916290731874155
[ 8  7 10  9] [0 0 0 1]
[0.         0.         0.         0.11111111]
1.916290731874155
[0.         0.         0.         0.21292119]

to          : 1.916290731874155
[ 8  7 10  9] [0 0 0 1]
[0.         0.         0.         0.11111111]
1.916290731874155
[0.         0.         0.         0.21292119]

Another Simple Example

We will use another simple example to illustrate the previously introduced concepts. We use a sentence which contains solely different words. The corpus consists of this sentence and reduced versions of it, i.e. cutting of words from the end of the sentence.

from sklearn.feature_extraction import text

words = "Cold wind blows over the cornfields".split()
corpus = []
for i in range(1, len(words)+1):
    corpus.append(" ".join(words[:i]))
    
print(corpus)
['Cold', 'Cold wind', 'Cold wind blows', 'Cold wind blows over', 'Cold wind blows over the', 'Cold wind blows over the cornfields']
vectorizer = text.CountVectorizer()

vectorizer = vectorizer.fit(corpus)
vectorized_text = vectorizer.transform(corpus)
Output: :
<6x6 sparse matrix of type '<class 'numpy.int64'>'
	with 21 stored elements in Compressed Sparse Row format>
tf_idf = text.TfidfTransformer()
tf_idf.fit(vectorized_text)

tf_idf.idf_
Output: :
array([1.33647224, 1.        , 2.25276297, 1.55961579, 1.84729786,
       1.15415068])
word_weight_list = list(zip(vectorizer.get_feature_names(), tf_idf.idf_))
word_weight_list.sort(key=lambda x:x[1])  # sort list by the weights (2nd component)
for word, idf_weight in word_weight_list:
    print(f"{word:15s}: {idf_weight:4.3f}")
cold           : 1.000
wind           : 1.154
blows          : 1.336
over           : 1.560
the            : 1.847
cornfields     : 2.253
TfidF = text.TfidfTransformer(smooth_idf=True, use_idf=True)
tfidf = TfidF.fit_transform(vectorized_text)


word_weight_list = list(zip(vectorizer.get_feature_names(), tf_idf.idf_))
word_weight_list.sort(key=lambda x:x[1])  # sort list by the weights (2nd component)
for word, idf_weight in word_weight_list:
    print(f"{word:15s}: {idf_weight:4.3f}")
foot           : 1.223
me             : 1.223
often          : 1.223
on             : 1.511
as             : 1.916
be             : 1.916
better         : 1.916
betting        : 1.916
bigger         : 1.916
can            : 1.916
for            : 1.916
from           : 1.916
has            : 1.916
heaven         : 1.916
horse          : 1.916
if             : 1.916
inside         : 1.916
is             : 1.916
isn            : 1.916
it             : 1.916
keeps          : 1.916
kingdom        : 1.916
man            : 1.916
my             : 1.916
no             : 1.916
nothing        : 1.916
of             : 1.916

Working With Real Data

scikit-learn contains a dataset from real newsgroups, which can be used for our purposes:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

import numpy as np

# Create our vectorizer
vectorizer = CountVectorizer()

# Let's fetch all the possible text data
newsgroups_data = fetch_20newsgroups()

Let us have a closer look at this data. As with all the other data sets in sklearn we can find the actual data under the attribute data:

print(newsgroups_data.data[0])
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





print(newsgroups_data.data[200])
Subject: Re: "Proper gun control?" What is proper gun cont
From: kim39@scws8.harvard.edu (John Kim)
Organization: Harvard University Science Center
Nntp-Posting-Host: scws8.harvard.edu
Lines: 17

In article <C5JGz5.34J@SSD.intel.com> hays@ssd.intel.com (Kirk Hays) writes:
>I'd like to point out that I was in error - "Terminator" began posting only 
>six months before he purchased his first firearm, according to private email
>from him.
>I can't produce an archived posting of his earlier than January 1992,
>and he purchased his first firearm in March 1992.
>I guess it only seemed like years.
>Kirk Hays - NRA Life, seventh generation.

I first read and consulted rec.guns in the summer of 1991.  I
just purchased my first firearm in early March of this year.

 NOt for lack of desire for a firearm, you understand.  I could 
have purchased a rifle or shotgun but didn't want one.
-Case Kim



We create the vectorizer:

vectorizer.fit(newsgroups_data.data)
Output: :
CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

Let's have a look at the first n words:

counter = 0
n = 10
for word, index in vectorizer.vocabulary_.items():
    print(word, index)
    counter += 1
    if counter > n:
        break
from 56979
lerxst 75358
wam 123162
umd 118280
edu 50527
where 124031
my 85354
thing 114688
subject 111322
what 123984
car 37780

We can turn the newsgroup postings into arrays. We do it with the first one:

a = vectorizer.transform([newsgroups_data.data[0]]).toarray()[0]
print(a)
[0 0 0 ... 0 0 0]

The vocabulary is huge This is why we see mostly zeros.

len(vectorizer.vocabulary_)
Output: :
130107

There are a lot of 'rubbish' words in this vocabulary. rubish means seen from the perspective of machine learning. For machine learning purposes words like 'Subject', 'From', 'Organization', 'Nntp-Posting-Host', 'Lines' and many others are useless, because they occur in all or in most postings. The technical 'garbage' from the newsgroup can be easily stripped off. We can fetch it differently. Stating that we do not want 'headers', 'footers' and 'quotes':

newsgroups_data_cleaned = fetch_20newsgroups(remove=('headers', 'footers', 'quotes'))
print(newsgroups_data_cleaned.data[0])
I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Let's have a look at the complete posting:

print(newsgroups_data.data[0])
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





vectorizer_cleaned = vectorizer.fit(newsgroups_data_cleaned.data)
len(vectorizer_cleaned.vocabulary_)
Output: :
101631

So, we got rid of more than 30000 words, but with more than a 100000 words is it still very large.

We can also directly separate the newsgroup feeds into a train and test set:

newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'))
newsgroups_test = fetch_20newsgroups(subset='test',
                                     remove=('headers', 'footers', 'quotes'))
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

vectorizer = CountVectorizer()
train_data = vectorizer.fit_transform(newsgroups_train.data)

# creating a classifier
classifier = MultinomialNB(alpha=.01)
classifier.fit(train_data, newsgroups_train.target)

test_data = vectorizer.transform(newsgroups_test.data)

predictions = classifier.predict(test_data)
accuracy_score = metrics.accuracy_score(newsgroups_test.target, 
                                        predictions)
f1_score = metrics.f1_score(newsgroups_test.target, 
                            predictions, 
                            average='macro')

print("Accuracy score: ", accuracy_score)
print("F1 score: ", f1_score)
Accuracy score:  0.6460435475305364
F1 score:  0.6203806145034193

Stop Words

So far we added all the words to the vocabulary. However, it is questionable whether words like "the", "am", "were" or similar words should be included at all, since they usually do not provide any significant semantic contribution for a text. In other words: The have limited predictive power. It would therefore make sense to exclude such words from editing, i.e. inclusion in the dictionary. This means we have to provide a list of words which should be neglected, i.e. being filtered out before or after processing text. In natural text recognition such words are usually called "stop words". There is no single universal list of stop words defined, which could be used by all natural language processing tools. Usually, stop words consist of the most frequently used words in a language. "Stop words" can be individually chosen for a given task.

By the way, stop words are an idea which is quite old. It goes back to 1959 and Hans Peter Luhn, one of the pioneers in information retrieval.

There are different ways to provide stop words in sklearn:

  • Explicit list of stop words
  • Automatically created stop words

We will start with individual stop words:

Indivudual Stop Words

corpus = ["A horse, a horse, my kingdom for a horse!",
          "Horse sense is the thing a horse has which keeps it from betting on people."
          "I’ve often said there is nothing better for the inside of the man, than the outside of the horse.",
          "A man on a horse is spiritually, as well as physically, bigger then a man on foot.",
          "No heaven can heaven be, if my horse isn’t there to welcome me."]

cv = CountVectorizer(corpus,
                     stop_words=["my", "for","the", "has", 
                                 "from", "on", "of", "it",
                                 "ve", "as", "no", "be",
                                 "isn", "to", "me", "is"])
count_vector = cv.fit_transform(corpus)
count_vector.shape

cv.vocabulary_
Output: :
{'horse': 6,
 'kingdom': 10,
 'sense': 18,
 'thing': 23,
 'which': 26,
 'keeps': 9,
 'betting': 1,
 'people': 15,
 'often': 13,
 'said': 17,
 'there': 22,
 'nothing': 12,
 'better': 0,
 'inside': 8,
 'man': 11,
 'than': 20,
 'outside': 14,
 'spiritually': 19,
 'well': 25,
 'physically': 16,
 'bigger': 2,
 'then': 21,
 'foot': 4,
 'heaven': 5,
 'can': 3,
 'if': 7,
 'welcome': 24}

sklearn contains default stop words, which are implemented as a frozenset and it can be accessed with text.ENGLISH_STOP_WORDS:

from sklearn.feature_extraction import text
n = 25
print(str(n) + " arbitrary words from ENGLISH_STOP_WORDS:")
counter = 0
for word in text.ENGLISH_STOP_WORDS:
    if counter == n - 1:
        print(word)
        break
    print(word, end=", ")
    counter += 1
25 arbitrary words from ENGLISH_STOP_WORDS:
somewhere, are, couldnt, her, neither, sixty, further, a, found, was, call, ltd, same, as, perhaps, mine, inc, above, the, these, anything, should, below, even, since

We can use stop words in our 20newsgroups classification problem:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

vectorizer = CountVectorizer(stop_words=text.ENGLISH_STOP_WORDS)

vectors = vectorizer.fit_transform(newsgroups_train.data)

# creating a classifier
classifier = MultinomialNB(alpha=.01)
classifier.fit(vectors, newsgroups_train.target)

vectors_test = vectorizer.transform(newsgroups_test.data)

predictions = classifier.predict(vectors_test)
accuracy_score = metrics.accuracy_score(newsgroups_test.target, 
                                        predictions)
f1_score = metrics.f1_score(newsgroups_test.target, 
                            predictions, 
                            average='macro')

print("accuracy score: ", accuracy_score)
print("F1-score: ", f1_score)
accuracy score:  0.6526818906001062
F1-score:  0.6268816896587931

Automatically Created Stop Words

As in many other cases, it is a good idea to look for ways to automatically define a list of stop words. A list that is or should be ideally adapted to the problem.

To automatically create a stop word list, we will start with the parameter min_df of CountVectorizer. When you set this threshold parameter, terms that have a document frequency strictly lower than the given threshold will be ignored. This value is also called cut-off in the literature. If a float value in the range of [0.0, 1.0] is used, the parameter represents a proportion of documents. An integer will be treated as absolute counts. This parameter is ignored if vocabulary is not None.

corpus = ["""People say you cannot live without love, 
             but I think oxygen is more important""",
          "Sometimes, when you close your eyes, you cannot see."
          "A horse, a horse, my kingdom for a horse!",
          """Horse sense is the thing a horse has which 
          keeps it from betting on people."""
          """I’ve often said there is nothing better for 
          the inside of the man, than the outside of the horse.""",
          """A man on a horse is spiritually, as well as physically, 
          bigger then a man on foot.""",
          """No heaven can heaven be, if my horse isn’t there 
          to welcome me."""]

cv = CountVectorizer(corpus,
                     min_df=2)
count_vector = cv.fit_transform(corpus)
cv.vocabulary_
Output: :
{'people': 7,
 'you': 9,
 'cannot': 0,
 'is': 3,
 'horse': 2,
 'my': 5,
 'for': 1,
 'on': 6,
 'there': 8,
 'man': 4}

Hardly any words from our corpus text are left. Because we have only few documents (strings) in our corpus and also because these texts are very short, the number of words which occur in less then two documents is very high. We eliminated all the words which occur in less two documents.

We can also see the words which have been chosen as stopwords by looking at cv.stop_words_:

cv.stop_words_
Output: :
{'as',
 'be',
 'better',
 'betting',
 'bigger',
 'but',
 'can',
 'close',
 'eyes',
 'foot',
 'from',
 'has',
 'heaven',
 'if',
 'important',
 'inside',
 'isn',
 'it',
 'keeps',
 'kingdom',
 'live',
 'love',
 'me',
 'more',
 'no',
 'nothing',
 'of',
 'often',
 'outside',
 'oxygen',
 'physically',
 'said',
 'say',
 'see',
 'sense',
 'sometimes',
 'spiritually',
 'than',
 'the',
 'then',
 'thing',
 'think',
 'to',
 've',
 'welcome',
 'well',
 'when',
 'which',
 'without',
 'your'}
print("number of docus, size of vocabulary, stop_words list size")
for i in range(len(corpus)):
    cv = CountVectorizer(corpus,
                         min_df=i)
    count_vector = cv.fit_transform(corpus)
    len_voc = len(cv.vocabulary_)
    len_stop_words = len(cv.stop_words_)
    print(f"{i:10d} {len_voc:15d} {len_stop_words:19d}")
    
number of docus, size of vocabulary, stop_words list size
         0              60                   0
         1              60                   0
         2              10                  50
         3               2                  58
         4               1                  59

Another parameter of CountVectorizer with which we can create a corpus-specific stop_words_list is max_df. It can be a float values between 0.0 and 1.0 or an integer. the default value is 1.0, i.e. the float value 1.0 and not an integer 1! When building the vocabulary all terms that have a document frequency strictly higher than the given threshold will be ignored. If this parameter is given as a float betwenn 0.0 and 1.0., the parameter represents a proportion of documents. This parameter is ignored if vocabulary is not None.

Let us use again our previous corpus for an example.

cv = CountVectorizer(corpus,
                     max_df=0.20)
count_vector = cv.fit_transform(corpus)
cv.stop_words_
Output: :
{'cannot', 'for', 'horse', 'is', 'man', 'my', 'on', 'people', 'there', 'you'}

Exercises

Exercise 1

In the subdirectory 'books' you will find some books:

Use these novels as the corpus and create a word count vector.

Exercise 2

Turn the previously calculated 'word count vector' into a dense ndarray representation.

Exercise 3

Let us have another example with a different corpus. The five strings are famous quotes from

  1. William Shakespeare
  2. W.C. Fields
  3. Ronald Reagan
  4. John Steinbeck
  5. Author unknown

Compute the IDF values!

quotes = ["A horse, a horse, my kingdom for a horse!",
          "Horse sense is the thing a horse has which keeps it from betting on people."
          "I’ve often said there is nothing better for the inside of the man, than the outside of the horse.",
          "A man on a horse is spiritually, as well as physically, bigger then a man on foot.",
          "No heaven can heaven be, if my horse isn’t there to welcome me."]

Solutions

Solution to Exercise 1

corpus = []
books = ["night_and_day_virginia_woolf.txt",
         "the_way_of_all_flash_butler.txt",
         "moby_dick_melville.txt",
         "sons_and_lovers_lawrence.txt",
         "robinson_crusoe_defoe.txt",
         "james_joyce_ulysses.txt"]
path = "books"

corpus = []
for book in books:
    txt = open(path + "/" + book).read()
    corpus.append(txt)
    
[book[:30] for book in corpus]
Output: :
['The Project Gutenberg EBook of',
 'The Project Gutenberg eBook, T',
 '\nThe Project Gutenberg EBook o',
 'The Project Gutenberg EBook of',
 'The Project Gutenberg eBook, T',
 '\nThe Project Gutenberg EBook o']

We have to get rid of the Gutenberg header and footer, because it doesn't belong to the novels. We can see by looking at the texts that the authors works begins after lines of the following kind

***START OF THIS PROJECT GUTENBERG ... ***

The footer of the texts start with this line:

***END OF THIS PROJECT GUTENBERG EBOOK ...***

There may or may not be a space after the first three stars or instead of "the" there may be "this".

We can use regular expressions to find the starting point of the novels:

from sklearn.feature_extraction import text
import re

corpus = []
books = ["night_and_day_virginia_woolf.txt",
         "the_way_of_all_flash_butler.txt",
         "moby_dick_melville.txt",
         "sons_and_lovers_lawrence.txt",
         "robinson_crusoe_defoe.txt",
         "james_joyce_ulysses.txt"]
path = "books"

corpus = []
for book in books:
    txt = open(path + "/" + book).read()
    text_begin = re.search(r"\*\*\* ?START OF (THE|THIS) PROJECT.*?\*\*\*", txt, re.DOTALL)
    text_end = re.search(r"\*\*\* ?END OF (THE|THIS) PROJECT.*?\*\*\*", txt, re.DOTALL)
    corpus.append(txt[text_begin.end():text_end.start()])
vectorizer = text.CountVectorizer()
vectorizer.fit(corpus)
token_count_matrix = vectorizer.transform(corpus)
print(token_count_matrix)
  (0, 4)	2
  (0, 35)	1
  (0, 60)	1
  (0, 79)	1
  (0, 131)	1
  (0, 221)	1
  (0, 724)	6
  (0, 731)	5
  (0, 734)	1
  (0, 743)	5
  (0, 761)	1
  (0, 773)	1
  (0, 779)	1
  (0, 780)	1
  (0, 781)	23
  (0, 790)	1
  (0, 804)	1
  (0, 809)	412
  (0, 810)	36
  (0, 817)	2
  (0, 823)	4
  (0, 824)	19
  (0, 825)	3
  (0, 828)	11
  (0, 829)	1
  :	:
  (5, 42156)	5
  (5, 42157)	1
  (5, 42158)	1
  (5, 42159)	2
  (5, 42160)	2
  (5, 42161)	106
  (5, 42165)	1
  (5, 42166)	2
  (5, 42167)	1
  (5, 42172)	2
  (5, 42173)	4
  (5, 42174)	1
  (5, 42175)	1
  (5, 42176)	1
  (5, 42177)	1
  (5, 42178)	3
  (5, 42181)	1
  (5, 42182)	1
  (5, 42183)	3
  (5, 42184)	1
  (5, 42185)	2
  (5, 42186)	1
  (5, 42187)	1
  (5, 42188)	2
  (5, 42189)	1
print("Number of words in vocabulary: ", len(vectorizer.vocabulary_))
Number of words in vocabulary:  42192

Solution to Exercise 2

All you have to do is applying the method toarray to the token_count_matrix:

token_count_matrix.toarray()
Output: :
array([[ 0,  0,  0, ...,  0,  0,  0],
       [19,  0,  0, ...,  0,  0,  0],
       [20,  0,  0, ...,  0,  1,  1],
       [ 0,  0,  1, ...,  0,  0,  0],
       [ 0,  0,  0, ...,  0,  0,  0],
       [11,  1,  0, ...,  1,  0,  0]])

Solution to Exercise 3

from sklearn.feature_extraction import text


# our corpus:
quotes = ["A horse, a horse, my kingdom for a horse!",
          "Horse sense is the thing a horse has which keeps it from betting on people."
          "I’ve often said there is nothing better for the inside of the man, than the outside of the horse.",
          "A man on a horse is spiritually, as well as physically, bigger then a man on foot.",
          "No heaven can heaven be, if my horse isn’t there to welcome me."]

vectorizer = text.CountVectorizer()
vectorizer.fit(quotes)
vectorized_text = vectorizer.fit_transform(quotes)

tfidf_transformer = text.TfidfTransformer(smooth_idf=True,use_idf=True) 
tfidf_transformer.fit(vectorized_text)


"""
alternative way to output the data:
import pandas as pd
df_idf = pd.DataFrame(tfidf_transformer.idf_, 
                      index=vectorizer.get_feature_names(),
                      columns=["idf_weight"]) 
df_idf.sort_values(by=['idf_weights'])  # sorting data
print(df_idf)

"""
print(f"{'word':15s}: idf_weight")
word_weight_list = list(zip(vectorizer.get_feature_names(), tfidf_transformer.idf_))
word_weight_list.sort(key=lambda x:x[1])  # sort list by the weights (2nd component)
for word, idf_weight in word_weight_list:
    print(f"{word:15s}: {idf_weight:4.3f}")
word           : idf_weight
horse          : 1.000
for            : 1.511
is             : 1.511
man            : 1.511
my             : 1.511
on             : 1.511
there          : 1.511
as             : 1.916
be             : 1.916
better         : 1.916
betting        : 1.916
bigger         : 1.916
can            : 1.916
foot           : 1.916
from           : 1.916
has            : 1.916
heaven         : 1.916
if             : 1.916
inside         : 1.916
isn            : 1.916
it             : 1.916
keeps          : 1.916
kingdom        : 1.916
me             : 1.916
no             : 1.916
nothing        : 1.916
of             : 1.916
often          : 1.916
outside        : 1.916
people         : 1.916
physically     : 1.916
said           : 1.916
sense          : 1.916
spiritually    : 1.916
than           : 1.916
the            : 1.916
then           : 1.916
thing          : 1.916
to             : 1.916
ve             : 1.916
welcome        : 1.916
well           : 1.916
which          : 1.916

Footnotes

1Logically ndarray and todense are the same thing, but toarray returns an ndarray whereas todense returns a matrix. If you consider, what the official Numpy documentation has to say about the numpy.matrix class, you shouldn't use todense! "It is no longer recommended to use this class, even for linear algebra. Instead use regular arrays. The class may be removed in the future." (numpy.matrix) (back)