python-course.eu

32. Dealing with NaN

By Bernd Klein. Last modified: 26 Apr 2023.

Introduction

Dealing with Nan

NaN was introduced, at least officially, by the IEEE Standard for Floating-Point Arithmetic (IEEE 754). It is a technical standard for floating-point computation established in 1985 - many years before Python was invented, and even a longer time befor Pandas was created - by the Institute of Electrical and Electronics Engineers (IEEE). It was introduced to solve problems found in many floating point implementations that made them difficult to use reliably and portably.

This standard added NaN to the arithmetic formats: "arithmetic formats: sets of binary and decimal floating-point data, which consist of finite numbers (including signed zeros and subnormal numbers), infinities, and special 'not a number' values (NaNs)"

Live Python training

instructor-led training course

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Enrol here

'nan' in Python

Python knows NaN values as well. We can create it with "float":

n1 = float("nan")
n2 = float("Nan")
n3 = float("NaN")
n4 = float("NAN")
print(n1, n2, n3, n4)

OUTPUT:

nan nan nan nan

"nan" is also part of the math module since Python 3.5:

import math
n1 = math.nan
print(n1)
print(math.isnan(n1))

OUTPUT:

nan
True

Warning: Do not perform comparison between "NaN" values or "Nan" values and regular numbers. A simple or simplified reasoning is this: Two things are "not a number", so they can be anything but most probably not the same. Above all there is no way of ordering NaNs:

print(n1 == n2)
print(n1 == 0)
print(n1 == 100)
print(n2 < 0)

OUTPUT:

False
False
False
False

NaN in Pandas

Example without NaNs

Before we will work with NaN data, we will process a file without any NaN values. The data file temperatures.csv contains the temperature data of six sensors taken every 15 minuts between 6:00 to 19.15 o'clock.

Reading in the data file can be done with the read_csv function:

import pandas as pd

df = pd.read_csv("../data1/temperatures.csv",
                 sep=";",
                 decimal=",",
                 index_col=0)
df[:5]
sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
06:00:00 14.3 13.7 14.2 14.3 13.5 13.6
06:15:00 14.5 14.5 14.0 15.0 14.5 14.7
06:30:00 14.6 15.1 14.8 15.3 14.0 14.2
06:45:00 14.8 14.5 15.6 15.2 14.7 14.6
07:00:00 15.0 14.9 15.7 15.6 14.0 15.3

We want to calculate the avarage temperatures per measuring point over all the sensors. We can use the DataFrame method 'mean'. If we use 'mean' without parameters it will sum up the sensor columns, which isn't what we want, but it may be interesting as well:

df.mean()

OUTPUT:

sensor1    19.775926
sensor2    19.757407
sensor3    19.840741
sensor4    20.187037
sensor5    19.181481
sensor6    19.437037
dtype: float64
average_temp_series = df.mean(axis=1)
print(average_temp_series[:8])

OUTPUT:

time
06:00:00    13.933333
06:15:00    14.533333
06:30:00    14.666667
06:45:00    14.900000
07:00:00    15.083333
07:15:00    15.116667
07:30:00    15.283333
07:45:00    15.116667
dtype: float64
sensors = df.columns.values[0:]
# all columns will be removed:
df = df.drop(sensors, axis=1)
print(df[:5])

OUTPUT:

Empty DataFrame
Columns: []
Index: [06:00:00, 06:15:00, 06:30:00, 06:45:00, 07:00:00]

We will assign now the average temperature values as a new column 'temperature':

# best practice:
df = df.assign(temperature=average_temp_series)  # inplace option not available

# alternatively:
#df.loc[:,"temperature"] = average_temp_series
df[:3]
temperature
time
06:00:00 13.933333
06:15:00 14.533333
06:30:00 14.666667

Example with NaNs

We will use now a data file similar to the previous temperature csv, but this time we will have to cope with NaN data, when the sensors malfunctioned.

We will create a temperature DataFrame, in which some data is not defined, i.e. NaN.

We will use and change the data from the the temperatures.csv file:

import pandas as pd
temp_df = pd.read_csv("../data1/temperatures.csv",
                      sep=";",
                      index_col=0,
                      decimal=",")

temp_df[:8]
sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
06:00:00 14.3 13.7 14.2 14.3 13.5 13.6
06:15:00 14.5 14.5 14.0 15.0 14.5 14.7
06:30:00 14.6 15.1 14.8 15.3 14.0 14.2
06:45:00 14.8 14.5 15.6 15.2 14.7 14.6
07:00:00 15.0 14.9 15.7 15.6 14.0 15.3
07:15:00 15.2 15.2 14.6 15.3 15.5 14.9
07:30:00 15.4 15.3 15.6 15.6 14.7 15.1
07:45:00 15.5 14.8 15.4 15.5 14.6 14.9

We will randomly assign some NaN values into the data frame. For this purpose, we will use the where method from DataFrame. If we apply where to a DataFrame object df, i.e. df.where(cond, other_df), it will return an object of same shape as df and whose corresponding entries are from df where the corresponding element of cond is True and otherwise are taken from other_df.

Before we continue with our task, we will demonstrate the way of working of where with some simple examples:

s = pd.Series(range(5)) s.where(s > 0)

import numpy as np

A = np.random.randint(1, 30, (4, 2))

df = pd.DataFrame(A, columns=['Foo', 'Bar'])
m = df % 2 == 0
df.where(m, -df, inplace=True)
df
Foo Bar
0 18 -27
1 14 -23
2 28 -21
3 6 -23

For our task, we need to create a DataFrame 'nan_df', which consists purely of NaN values and has the same shape as our temperature DataFrame 'temp_df'. We will use this DataFrame in 'where'. We also need a DataFrame with the conditions "df_bool" as True values. For this purpose we will create a DataFrame with random values between 0 and 1 and by applying 'random_df < 0.8' we get the df_bool DataFrame, in which about 20 % of the values will be True:

random_df = pd.DataFrame(np.random.random(size=temp_df.shape), 
                         columns=temp_df.columns.values, 
                         index=temp_df.index)
nan_df = pd.DataFrame(np.nan,
                      columns=temp_df.columns.values, 
                      index=temp_df.index)
df_bool = random_df<0.8

df_bool[:5]
sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
06:00:00 False True True True True False
06:15:00 False True True True False True
06:30:00 False True True True True True
06:45:00 True True True True True False
07:00:00 True False True True True False

Finally, we have everything toghether to create our DataFrame with distrubed measurements:

disturbed_data = temp_df.where(df_bool, nan_df)

disturbed_data.to_csv("../data1/temperatures_with_NaN.csv")
disturbed_data[:10]
sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
06:00:00 NaN 13.7 14.2 14.3 13.5 NaN
06:15:00 NaN 14.5 14.0 15.0 NaN 14.7
06:30:00 NaN 15.1 14.8 15.3 14.0 14.2
06:45:00 14.8 14.5 15.6 15.2 14.7 NaN
07:00:00 15.0 NaN 15.7 15.6 14.0 NaN
07:15:00 15.2 15.2 14.6 15.3 15.5 14.9
07:30:00 NaN 15.3 15.6 NaN 14.7 15.1
07:45:00 15.5 14.8 15.4 15.5 14.6 14.9
08:00:00 15.7 15.6 15.9 16.2 15.4 15.4
08:15:00 15.9 15.8 NaN 16.9 NaN 16.2

Live Python training

instructor-led training course

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Upcoming online Courses

Enrol here

Using dropna on the DataFrame

'dropna' is a DataFrame method. If we call this method without arguments, it will return an object where every row is ommitted, in which data are missing, i.e. some value is NaN:

df = disturbed_data.dropna()
df
sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
07:15:00 15.2 15.2 14.6 15.3 15.5 14.9
07:45:00 15.5 14.8 15.4 15.5 14.6 14.9
08:00:00 15.7 15.6 15.9 16.2 15.4 15.4
08:30:00 16.1 15.7 16.1 15.9 14.9 15.2
09:15:00 17.1 17.5 17.5 17.3 16.6 16.8
09:45:00 18.4 19.0 19.0 19.4 18.4 18.3
10:30:00 20.4 19.4 20.0 21.0 20.2 19.8
12:15:00 23.8 23.7 24.8 25.1 22.2 22.4
13:30:00 22.9 21.9 22.9 24.3 22.9 23.0
14:15:00 22.3 22.9 21.9 22.3 22.5 21.1
16:00:00 21.1 21.6 20.7 20.6 19.9 21.4
17:45:00 19.9 20.4 19.4 21.1 20.0 20.5
18:30:00 19.5 19.1 19.2 19.7 18.3 18.3

'dropna' can also be used to drop all columns in which some values are NaN. This can be achieved by assigning 1 to the axis parameter. The default value is False, as we have seen in our previous example. As every column from our sensors contain NaN values, they will all disappear:

df = disturbed_data.dropna(axis=1)
df[:5]
time
06:00:00
06:15:00
06:30:00
06:45:00
07:00:00

Let us change our task: We only want to get rid of all the rows, which contain more than one NaN value. The parameter 'thresh' is ideal for this task. It can be set to the minimum number. 'thresh' is set to an integer value, which defines the minimum number of non-NaN values. We have six temperature values in every row. Setting 'thresh' to 5 makes sure that we will have at least 5 valid floats in every remaining row:

cleansed_df = disturbed_data.dropna(thresh=3, axis=0)
cleansed_df[:7]
sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
06:00:00 NaN 13.7 14.2 14.3 13.5 NaN
06:15:00 NaN 14.5 14.0 15.0 NaN 14.7
06:30:00 NaN 15.1 14.8 15.3 14.0 14.2
06:45:00 14.8 14.5 15.6 15.2 14.7 NaN
07:00:00 15.0 NaN 15.7 15.6 14.0 NaN
07:15:00 15.2 15.2 14.6 15.3 15.5 14.9
07:30:00 NaN 15.3 15.6 NaN 14.7 15.1

Now we will calculate the mean values again, but this time on the DataFrame 'cleansed_df', i.e. where we have taken out all the rows, where more than one NaN value occurred.

average_temp_series = cleansed_df.mean(axis=1)
sensors = cleansed_df.columns.values
df = cleansed_df.drop(sensors, axis=1)

# best practice:
df = df.assign(temperature=average_temp_series)  # inplace option not available
df[:6]
temperature
time
06:00:00 13.925000
06:15:00 14.550000
06:30:00 14.680000
06:45:00 14.960000
07:00:00 15.075000
07:15:00 15.116667

Live Python training

instructor-led training course

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Upcoming online Courses

Enrol here