# 28. Dealing with NaN

## Introduction

NaN was introduced, at least officially, by the IEEE Standard for Floating-Point Arithmetic (IEEE 754). It is a technical standard for floating-point computation established in 1985 - many years before Python was invented, and even a longer time befor Pandas was created - by the Institute of Electrical and Electronics Engineers (IEEE). It was introduced to solve problems found in many floating point implementations that made them difficult to use reliably and portably.

This standard added NaN to the arithmetic formats: "arithmetic formats: sets of binary and decimal floating-point data, which consist of finite numbers (including signed zeros and subnormal numbers), infinities, and special 'not a number' values (NaNs)"

## 'nan' in Python

Python knows NaN values as well. We can create it with "float":

n1 = float("nan")
n2 = float("Nan")
n3 = float("NaN")
n4 = float("NAN")
print(n1, n2, n3, n4)


### OUTPUT:

nan nan nan nan


"nan" is also part of the math module since Python 3.5:

import math
n1 = math.nan
print(n1)
print(math.isnan(n1))


### OUTPUT:

nan
True


Warning: Do not perform comparison between "NaN" values or "Nan" values and regular numbers. A simple or simplified reasoning is this: Two things are "not a number", so they can be anything but most probably not the same. Above all there is no way of ordering NaNs:

print(n1 == n2)
print(n1 == 0)
print(n1 == 100)
print(n2 < 0)


### OUTPUT:

False
False
False
False


## NaN in Pandas

### Example without NaNs

Before we will work with NaN data, we will process a file without any NaN values. The data file temperatures.csv contains the temperature data of six sensors taken every 15 minuts between 6:00 to 19.15 o'clock.

Reading in the data file can be done with the read_csv function:

import pandas as pd

sep=";",
decimal=",")
df.loc[:3]

time sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
0 06:00:00 14.3 13.7 14.2 14.3 13.5 13.6
1 06:15:00 14.5 14.5 14.0 15.0 14.5 14.7
2 06:30:00 14.6 15.1 14.8 15.3 14.0 14.2
3 06:45:00 14.8 14.5 15.6 15.2 14.7 14.6

We want to calculate the avarage temperatures per measuring point over all the sensors. We can use the DataFrame method 'mean'. If we use 'mean' without parameters it will sum up the sensor columns, which isn't what we want, but it may be interesting as well:

df.mean()


### OUTPUT:

sensor1    19.775926
sensor2    19.757407
sensor3    19.840741
sensor4    20.187037
sensor5    19.181481
sensor6    19.437037
dtype: float64
average_temp_series = df.mean(axis=1)
print(average_temp_series[:8])


### OUTPUT:

0    13.933333
1    14.533333
2    14.666667
3    14.900000
4    15.083333
5    15.116667
6    15.283333
7    15.116667
dtype: float64

sensors = df.columns.values[1:]
# all columns except the time column will be removed:
df = df.drop(sensors, axis=1)
print(df[:5])


### OUTPUT:

       time
0  06:00:00
1  06:15:00
2  06:30:00
3  06:45:00
4  07:00:00


We will assign now the average temperature values as a new column 'temperature':

# best practice:
df = df.assign(temperature=average_temp_series)  # inplace option not available

# alternatively:
#df.loc[:,"temperature"] = average_temp_series

df[:3]

time temperature
0 06:00:00 13.933333
1 06:15:00 14.533333
2 06:30:00 14.666667

### Example with NaNs

We will use now a data file similar to the previous temperature csv, but this time we will have to cope with NaN data, when the sensors malfunctioned.

We will create a temperature DataFrame, in which some data is not defined, i.e. NaN.

We will use and change the data from the the temperatures.csv file:

import pandas as pd
sep=";",
index_col=0,
decimal=",")

temp_df[:8]

sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
06:00:00 14.3 13.7 14.2 14.3 13.5 13.6
06:15:00 14.5 14.5 14.0 15.0 14.5 14.7
06:30:00 14.6 15.1 14.8 15.3 14.0 14.2
06:45:00 14.8 14.5 15.6 15.2 14.7 14.6
07:00:00 15.0 14.9 15.7 15.6 14.0 15.3
07:15:00 15.2 15.2 14.6 15.3 15.5 14.9
07:30:00 15.4 15.3 15.6 15.6 14.7 15.1
07:45:00 15.5 14.8 15.4 15.5 14.6 14.9

We will randomly assign some NaN values into the data frame. For this purpose, we will use the where method from DataFrame. If we apply where to a DataFrame object df, i.e. df.where(cond, other_df), it will return an object of same shape as df and whose corresponding entries are from df where the corresponding element of cond is True and otherwise are taken from other_df.

Before we continue with our task, we will demonstrate the way of working of where with some simple examples:

s = pd.Series(range(5)) s.where(s > 0)

import numpy as np

A = np.random.randint(1, 30, (4, 2))

df = pd.DataFrame(A, columns=['Foo', 'Bar'])
m = df % 2 == 0
df.where(m, -df, inplace=True)
df

Foo Bar
0 18 -5
1 22 -7
2 14 -3
3 16 -23

For our task, we need to create a DataFrame 'nan_df', which consists purely of NaN values and has the same shape as our temperature DataFrame 'temp_df'. We will use this DataFrame in 'where'. We also need a DataFrame with the conditions "df_bool" as True values. For this purpose we will create a DataFrame with random values between 0 and 1 and by applying 'random_df < 0.8' we get the df_bool DataFrame, in which about 20 % of the values will be True:

random_df = pd.DataFrame(np.random.random(size=temp_df.shape),
columns=temp_df.columns.values,
index=temp_df.index)
nan_df = pd.DataFrame(np.nan,
columns=temp_df.columns.values,
index=temp_df.index)
df_bool = random_df<0.8

df_bool[:5]

sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
06:00:00 True True True True True True
06:15:00 True True True False True True
06:30:00 True True True True True False
06:45:00 True True True True False True
07:00:00 True True True True True True

Finally, we have everything toghether to create our DataFrame with distrubed measurements:

disturbed_data = temp_df.where(df_bool, nan_df)

disturbed_data.to_csv("/data1/temperatures_with_NaN.csv")
disturbed_data[:10]

sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
06:00:00 14.3 13.7 14.2 14.3 13.5 13.6
06:15:00 14.5 14.5 14.0 NaN 14.5 14.7
06:30:00 14.6 15.1 14.8 15.3 14.0 NaN
06:45:00 14.8 14.5 15.6 15.2 NaN 14.6
07:00:00 15.0 14.9 15.7 15.6 14.0 15.3
07:15:00 NaN 15.2 14.6 15.3 15.5 14.9
07:30:00 15.4 15.3 15.6 15.6 14.7 NaN
07:45:00 15.5 14.8 15.4 15.5 14.6 14.9
08:00:00 15.7 15.6 15.9 16.2 15.4 15.4
08:15:00 15.9 15.8 15.9 NaN 16.0 16.2

## Using dropna on the DataFrame

'dropna' is a DataFrame method. If we call this method without arguments, it will return an object where every row is ommitted, in which data are missing, i.e. some value is NaN:

df = disturbed_data.dropna()
df

sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
06:00:00 14.3 13.7 14.2 14.3 13.5 13.6
07:00:00 15.0 14.9 15.7 15.6 14.0 15.3
07:45:00 15.5 14.8 15.4 15.5 14.6 14.9
08:00:00 15.7 15.6 15.9 16.2 15.4 15.4
09:45:00 18.4 19.0 19.0 19.4 18.4 18.3
10:00:00 19.0 19.7 18.8 18.9 17.5 18.9
11:45:00 24.2 23.1 25.3 23.7 24.5 24.8
12:15:00 23.8 23.7 24.8 25.1 22.2 22.4
12:45:00 23.4 22.6 23.7 24.4 21.8 23.8
15:00:00 21.8 22.9 22.2 22.8 21.0 21.9
16:00:00 21.1 21.6 20.7 20.6 19.9 21.4
17:00:00 20.4 19.5 20.3 21.8 19.9 19.2
17:15:00 20.3 20.7 19.6 21.3 19.8 19.0
17:45:00 19.9 20.4 19.4 21.1 20.0 20.5
18:15:00 19.6 19.9 19.2 19.9 20.0 18.6
18:30:00 19.5 19.1 19.2 19.7 18.3 18.3

'dropna' can also be used to drop all columns in which some values are NaN. This can be achieved by assigning 1 to the axis parameter. The default value is False, as we have seen in our previous example. As every column from our sensors contain NaN values, they will all disappear:

df = disturbed_data.dropna(axis=1)
df[:5]


time
06:00:00
06:15:00
06:30:00
06:45:00
07:00:00

Let us change our task: We only want to get rid of all the rows, which contain more than one NaN value. The parameter 'thresh' is ideal for this task. It can be set to the minimum number. 'thresh' is set to an integer value, which defines the minimum number of non-NaN values. We have six temperature values in every row. Setting 'thresh' to 5 makes sure that we will have at least 5 valid floats in every remaining row:

cleansed_df = disturbed_data.dropna(thresh=5, axis=0)
cleansed_df[:7]

sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
06:00:00 14.3 13.7 14.2 14.3 13.5 13.6
06:15:00 14.5 14.5 14.0 NaN 14.5 14.7
06:30:00 14.6 15.1 14.8 15.3 14.0 NaN
06:45:00 14.8 14.5 15.6 15.2 NaN 14.6
07:00:00 15.0 14.9 15.7 15.6 14.0 15.3
07:15:00 NaN 15.2 14.6 15.3 15.5 14.9
07:30:00 15.4 15.3 15.6 15.6 14.7 NaN

Now we will calculate the mean values again, but this time on the DataFrame 'cleansed_df', i.e. where we have taken out all the rows, where more than one NaN value occurred.

average_temp_series = cleansed_df.mean(axis=1)
sensors = cleansed_df.columns.values
df = cleansed_df.drop(sensors, axis=1)

# best practice:
df = df.assign(temperature=average_temp_series)  # inplace option not available
df[:6]

temperature
time
06:00:00 13.933333
06:15:00 14.440000
06:30:00 14.760000
06:45:00 14.940000
07:00:00 15.083333
07:15:00 15.100000