Numerical & Scientific Computing with Python: Introduction into NumPy

Numpy Tutorial


Introduction

Visualision of a Matrix using
 a Hinton diagram

NumPy is an acronym for "Numeric Python" or "Numerical Python". It is an open source extension module for Python, which provides fast precompiled functions for mathematical and numerical routines. Furthermore, NumPy enriches the programming language Python with powerful data structures for efficient computation of multi-dimensional arrays and matrices. The implementation is even aiming at huge matrices and arrays. Besides that the module supplies a large library of high-level mathematical functions to operate on these matrices and arrays.

SciPy (Scientific Python) is often mentioned in the same breath with NumPy. SciPy extends the capabilities of NumPy with further useful functions for minimization, regression, Fourier-transformation and many others.

Both NumPy and SciPy are usually not installed by default. NumPy has to be installed before installing SciPy. Numpy can be downloaded from the website:

http://www.numpy.org

(Comment: The diagram of the image on the right side is the graphical visualisation of a matrix with 14 rows and 20 columns. It's a so-called Hinton diagram. The size of a square within this diagram corresponds to the size of the value of the depicted matrix. The colour determines, if the value is positive or negative. In our example: the colour red denotes negative values and the colour green denotes positive values.)

NumPy is based on two earlier Python modules dealing with arrays. One of these is Numeric. Numeric is like NumPy a Python module for high-performance, numeric computing, but it is obsolete nowadays. Another predecessor of NumPy is Numarray, which is a complete rewrite of Numeric but is deprecated as well. NumPy is a merger of those two, i.e. it is build on the code of Numeric and the features of Numarray.


The Python Alternative to Matlab

Python in combination with Numpy, Scipy and Matplotlib can be used as a replacement for MATLAB. The combination of NumPy, SciPy and Matplotlib is a free (meaning both "free" as in "free beer" and "free" as in "freedom") alternative to MATLAB. Even though MATLAB has a huge number of additional toolboxes available, NumPy has the advantage that Python is a more modern and complete programming language and - as we have said already before - is open source. SciPy adds even more MATLAB-like functionalities to Python. Python is rounded out in the direction of MATLAB with the module Matplotlib, which provides MATLAB-like plotting functionality.

Overview diagram: Comparison between Python and Matlab


Comparison between Core Python and Numpy

When we say "Core Python", we mean Python without any special modules, i.e. especially without NumPy.

The advantages of Core Python:

Advantages of using Numpy with Python:


A Simple Numpy Example

Before we can use NumPy we will have to import it. It has to be imported like any other module:

import numpy

But you will hardly ever see this. Numpy is usually renamed to np:

import numpy as np

We have a list with values, e.g. temperatures in Celsius:

cvalues = [20.1, 20.8, 21.9, 22.5, 22.7, 22.3, 21.8, 21.2, 20.9, 20.1]

We will turn our list "cvalues" into a one-dimensional numpy array:

C = np.array(cvalues)
print(C)
After having executed the Python code above we received the following output:
[ 20.1  20.8  21.9  22.5  22.7  22.3  21.8  21.2  20.9  20.1]
Let's assume, we want to turn the values into degrees Fahrenheit. This is very easy to accomplish with a numpy array. The solution to our problem can be achieved by simple scalar multiplication:
print(C * 9 / 5 + 32)
The above Python code returned the following:
[ 68.18  69.44  71.42  72.5   72.86  72.14  71.24  70.16  69.62  68.18]

The array C has not been changed by this expression:

print(C)
This gets us the following output:
[ 20.1  20.8  21.9  22.5  22.7  22.3  21.8  21.2  20.9  20.1]

Compared to this, the solution for our Python list looks awkward:

fvalues = [ x*9/5 + 32 for x in cvalues] 
print(fvalues)
The above Python code returned the following:
[68.18, 69.44, 71.42, 72.5, 72.86, 72.14, 71.24000000000001, 70.16, 69.62, 68.18]

So far, we referred to C as an array. The internal type is "ndarray" or to be even more precise "C is an instance of the class numpy.ndarray":

type(C)
The above code returned the following result:
numpy.ndarray

In the following, we will use the terms "array" and "ndarray" in most cases synonymously.



Graphical Representation of the Values

Even though we want to cover the module matplotlib not until a later chapter, we want to demonstrate how we can use this module to depict our temperature values. To do this, we us the package pyplot from matplotlib.

If you use the jupyter notebook, you might be well advised to include the following line of code to prevent an external window to pop up and to have your diagram included in the notebook:

%matplotlib inline

The code to generate a plot for our values looks like this:

import matplotlib.pyplot as plt
plt.plot(C)
plt.show()

The function plot uses the values of the array C for the values of the ordinate, i.e. the y-axes. The indices of the array C are taken as values for the abscissa, i.e. the x-axes.



Memory Consumption: ndarray and list

The main benefits of using numpy arrays should be memory consumption and better runtime behaviour. We want to look at the memory usage of numpy arrays in this subchapter and compare it to the memory consumption of Python lists.

Python lists: internal memory structure

To calculate the memory consumption of the list from the above picture, we will use the function getsizeof from the module sys.

from sys import getsizeof as size
lst = [24, 12, 57]
size_of_list_object = size(lst)   # only green box
size_of_elements = len(lst) * size(lst[0]) # 24, 12, 57
total_list_size = size_of_list_object + size_of_elements
print("Size without the size of the elements: ", size_of_list_object)
print("Size of all the elements: ", size_of_elements)
print("Total size of list, including elements: ", total_list_size)
This gets us the following output:
Size without the size of the elements:  88
Size of all the elements:  84
Total size of list, including elements:  172

The size of a Python list consists of the general list information, the size needed for the references to the elements and the size of all the elements of the list. If we apply sys.getsizeof to a list, we get only the size without the size of the elements.

We will check now, how the memory usage changes, if we add another integer element to the list. We also look at an empty list:

lst = [24, 12, 57, 42]
size_of_list_object = size(lst)   # only green box
size_of_elements = len(lst) * size(lst[0]) # 24, 12, 57, 42
total_list_size = size_of_list_object + size_of_elements
print("Size without the size of the elements: ", size_of_list_object)
print("Size of all the elements: ", size_of_elements)
print("Total size of list, including elements: ", total_list_size)
 
lst = []
print("Emtpy list size: ", size(lst))
The above code returned the following result:
Size without the size of the elements:  96
Size of all the elements:  112
Total size of list, including elements:  208
Emtpy list size:  64

We can conclude from this that for every new element, we need another eight bytes for the reference to the new object. The new integer object itself consumes 28 bytes. The size of a list "lst" without the size of the elements can be calculated with:

64 + 8 * len(lst)

To get the complete size of an arbitrary list of integers, we have to add the sum of all the sizes of the integers.

We will examine now the memory consumption of a numpy.array. To this purpose, we will have a look at the implementation in the following picture:

Numpy arrays: internal memory structure

We will create this numpy array and calculate the memory usage:

a = np.array([24, 12, 57])
print(size(a))
After having executed the Python code above we received the following:
120

We get the the memory usage for the general array information by creating an empty array:

e = np.array([])
print(size(e))
The above Python code returned the following:
96

We can see that the difference between the empty array "e" and the array "a" with three integers consists in 24 Bytes. This means that an arbitrary integer array of length "n" in numpy needs

96 + n * 8 Bytes

whereas a list of integers needs, as we have seen before

64 + 8 len(lst) + len(lst) 28

This is a minimum estimation, as Python integers can use more than 28 bytes.

When we define a Numpy array, numpy automatically chooses a fixed integer size. In our example "int64". We can determine the size of the integers, when we define an array:

a = np.array([24, 12, 57], np.int8)
print(size(a) - 96)
a = np.array([24, 12, 57], np.int16)
print(size(a) - 96)
a = np.array([24, 12, 57], np.int32)
print(size(a) - 96)
a = np.array([24, 12, 57], np.int64)
print(size(a) - 96)
After having executed the Python code above we received the following:
3
6
12
24



Time Comparison between Python Lists and Numpy Arrays

One of the main advantages of NumPy is its advantage in time compared to standard Python. Let's look at the following functions:

import time
size_of_vec = 1000
def pure_python_version():
    t1 = time.time()
    X = range(size_of_vec)
    Y = range(size_of_vec)
    Z = [X[i] + Y[i] for i in range(len(X)) ]
    return time.time() - t1
def numpy_version():
    t1 = time.time()
    X = np.arange(size_of_vec)
    Y = np.arange(size_of_vec)
    Z = X + Y
    return time.time() - t1

Let's call these functions and see the time consumption:

t1 = pure_python_version()
t2 = numpy_version()
print(t1, t2)
print("Numpy is in this example " + str(t1/t2) + " faster!")
After having executed the Python code above we received the following:
0.00036597251892089844 0.00028777122497558594
Numpy is in this example 1.2717481358740679 faster!

It's an easier and above all better way to measure the times by using the timeit module. We will use the Timer class in the following script.

The constructor of a Timer object takes a statement to be timed, an additional statement used for setup, and a timer function. Both statements default to 'pass'.

The statements may contain newlines, as long as they don't contain multi-line string literals.

import numpy as np
from timeit import Timer
size_of_vec = 1000
X_list = range(size_of_vec)
Y_list = range(size_of_vec)
X = np.arange(size_of_vec)
Y = np.arange(size_of_vec)
def pure_python_version():
    Z = [X_list[i] + Y_list[i] for i in range(len(X_list)) ]
def numpy_version():
    Z = X + Y
#timer_obj = Timer("x = x + 1", "x = 0")
timer_obj1 = Timer("pure_python_version()", "from __main__ import pure_python_version")
timer_obj2 = Timer("numpy_version()", "from __main__ import numpy_version")
print(timer_obj1.timeit(10))
print(timer_obj2.timeit(10))
The code above returned the following:
0.0038678800010529812
0.0001549119988339953
print(timer_obj1.repeat(repeat=3, number=10000))
print(timer_obj2.repeat(repeat=3, number=10000))
The above Python code returned the following output:
[3.86909913800082, 3.7541254040006606, 3.9974926359991514]
[0.01793182599976717, 0.01334530399981304, 0.011095521000243025]
In [ ]: