Dealing with NaN



Introduction

Dealing with Nan

NaN was introduced, at least officially, by the IEEE Standard for Floating-Point Arithmetic (IEEE 754). It is a technical standard for floating-point computation established in 1985 - many years before Python was invented, and even a longer time befor Pandas was created - by the Institute of Electrical and Electronics Engineers (IEEE). It was introduced to solve problems found in many floating point implementations that made them difficult to use reliably and portably.

This standard added NaN to the arithmetic formats: "arithmetic formats: sets of binary and decimal floating-point data, which consist of finite numbers (including signed zeros and subnormal numbers), infinities, and special 'not a number' values (NaNs)"



'nan' in Python

Python knows NaN values as well. We can create it with "float":

n1 = float("nan")
n2 = float("Nan")
n3 = float("NaN")
n4 = float("NAN")
print(n1, n2, n3, n4)
nan nan nan nan

"nan" is also part of the math module since Python 3.5:

import math
n1 = math.nan
print(n1)
print(math.isnan(n1))
nan
True

Warning: Do not perform comparison between "NaN" values or "Nan" values and regular numbers. A simple or simplified reasoning is this: Two things are "not a number", so they can be anything but most probably not the same. Above all there is no way of ordering NaNs:

print(n1 == n2)
print(n1 == 0)
print(n1 == 100)
print(n2 < 0)
False
False
False
False



NaN in Pandas

Example without NaNs

Before we will work with NaN data, we will process a file without any NaN values. The data file temperatures.csv contains the temperature data of six sensors taken every 15 minuts between 6:00 to 19.15 o'clock.

Reading in the data file can be done with the read_csv function:

import pandas as pd
df = pd.read_csv("data1/temperatures.csv",
                 sep=";",
                 decimal=",")
df.loc[:3]
The previous code returned the following result:
time sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
0 06:00:00 14.3 13.7 14.2 14.3 13.5 13.6
1 06:15:00 14.5 14.5 14.0 15.0 14.5 14.7
2 06:30:00 14.6 15.1 14.8 15.3 14.0 14.2
3 06:45:00 14.8 14.5 15.6 15.2 14.7 14.6

We want to calculate the avarage temperatures per measuring point over all the sensors. We can use the DataFrame method 'mean'. If we use 'mean' without parameters it will sum up the sensor columns, which isn't what we want, but it may be interesting as well:

df.mean()
The above Python code returned the following result:
sensor1    19.775926
sensor2    19.757407
sensor3    19.840741
sensor4    20.187037
sensor5    19.181481
sensor6    19.437037
dtype: float64
average_temp_series = df.mean(axis=1)
print(average_temp_series[:8])
0    13.933333
1    14.533333
2    14.666667
3    14.900000
4    15.083333
5    15.116667
6    15.283333
7    15.116667
dtype: float64
sensors = df.columns.values[1:]
# all columns except the time column will be removed:
df = df.drop(sensors, axis=1)
print(df[:5])
       time
0  06:00:00
1  06:15:00
2  06:30:00
3  06:45:00
4  07:00:00

We will assign now the average temperature values as a new column 'temperature':

# best practice:
df = df.assign(temperature=average_temp_series)  # inplace option not available
# alternatively:
#df.loc[:,"temperature"] = average_temp_series
df[:3]
The above code returned the following result:
time temperature
0 06:00:00 13.933333
1 06:15:00 14.533333
2 06:30:00 14.666667

Example with NaNs

We will use now a data file similar to the previous temperature csv, but this time we will have to cope with NaN data, when the sensors malfunctioned.

We will create a temperature DataFrame, in which some data is not defined, i.e. NaN.

We will use and change the data from the the temperatures.csv file:

temp_df = pd.read_csv("data1/temperatures.csv",
                      sep=";",
                      index_col=0,
                      decimal=",")

We will randomly assign some NaN values into the data frame. For this purpose, we will use the where method from DataFrame. If we apply where to a DataFrame object df, i.e. df.where(cond, other_df), it will return an object of same shape as df and whose corresponding entries are from df where the corresponding element of cond is True and otherwise are taken from other_df.

Before we continue with our task, we will demonstrate the way of working of where with some simple examples:

s = pd.Series(range(5)) s.where(s > 0)

import numpy as np
A = np.random.randint(1, 30, (4, 2))
df = pd.DataFrame(A, columns=['Foo', 'Bar'])
m = df % 2 == 0
df.where(m, -df, inplace=True)
df
After having executed the Python code above we received the following:
Foo Bar
0 -9 4
1 -7 24
2 -29 12
3 -15 2

For our task, we need to create a DataFrame 'nan_df', which consists purely of NaN values and has the same shape as our temperature DataFrame 'temp_df'. We will use this DataFrame in 'where'. We also need a DataFrame with the conditions "df_bool" as True values. For this purpose we will create a DataFrame with random values between 0 and 1 and by applying 'random_df < 0.8' we get the df_bool DataFrame, in which about 20 % of the values will be True:

random_df = pd.DataFrame(np.random.random(size=temp_df.shape), 
                         columns=temp_df.columns.values, 
                         index=temp_df.index)
nan_df = pd.DataFrame(np.nan,
                      columns=temp_df.columns.values, 
                      index=temp_df.index)
df_bool = random_df<0.8
df_bool[:5]
The above Python code returned the following output:
sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
06:00:00 False True True True True True
06:15:00 True True True True True True
06:30:00 True True False False False False
06:45:00 True False True True True True
07:00:00 False True True False True True

Finally, we have everything toghether to create our DataFrame with distrubed measurements:

disturbed_data = temp_df.where(df_bool, nan_df)
disturbed_data.to_csv("data1/temperatures_with_NaN.csv")
disturbed_data[:10]
We received the following result:
sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
06:00:00 NaN 13.7 14.2 14.3 13.5 13.6
06:15:00 14.5 14.5 14.0 15.0 14.5 14.7
06:30:00 14.6 15.1 NaN NaN NaN NaN
06:45:00 14.8 NaN 15.6 15.2 14.7 14.6
07:00:00 NaN 14.9 15.7 NaN 14.0 15.3
07:15:00 NaN 15.2 14.6 15.3 15.5 14.9
07:30:00 15.4 15.3 15.6 15.6 NaN 15.1
07:45:00 15.5 14.8 15.4 15.5 14.6 14.9
08:00:00 15.7 15.6 15.9 NaN 15.4 NaN
08:15:00 15.9 15.8 NaN 16.9 NaN 16.2

Using dropna on the DataFrame

'dropna' is a DataFrame method. If we call this method without arguments, it will return an object where every row is ommitted, in which data are missing, i.e. some value is NaN:

df = disturbed_data.dropna()
df
The code above returned the following:
sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
06:15:00 14.5 14.5 14.0 15.0 14.5 14.7
07:45:00 15.5 14.8 15.4 15.5 14.6 14.9
09:00:00 16.8 17.3 17.7 17.8 15.9 16.1
10:00:00 19.0 19.7 18.8 18.9 17.5 18.9
12:30:00 23.6 24.2 23.6 24.1 22.1 22.5
13:45:00 22.7 23.3 22.2 22.2 21.0 23.0
14:00:00 22.5 23.0 22.1 24.1 22.5 22.7
15:30:00 21.4 21.3 21.7 21.9 21.0 21.7
16:15:00 20.9 20.6 20.8 20.5 20.3 21.6
17:30:00 20.1 20.5 19.7 19.7 18.7 19.7
19:00:00 19.2 18.7 20.1 19.9 18.3 19.3
19:15:00 19.0 19.7 18.9 19.2 18.5 19.4

'dropna' can also be used to drop all columns in which some values are NaN. This can be achieved by assigning 1 to the axis parameter. The default value is False, as we have seen in our previous example. As every column from our sensors contain NaN values, they will all disappear:

df = disturbed_data.dropna(axis=1)
df[:5]
The above code returned the following:
time
06:00:00
06:15:00
06:30:00
06:45:00
07:00:00

Let us change our task: We only want to get rid of all the rows, which contain more than one NaN value. The parameter 'thresh' is ideal for this task. It can be set to the minimum number. 'thresh' is set to an integer value, which defines the minimum number of non-NaN values. We have six temperature values in every row. Setting 'thresh' to 5 makes sure that we will have at least 5 valid floats in every remaining row:

cleansed_df = disturbed_data.dropna(thresh=5, axis=0)
cleansed_df[:7]
The previous code returned the following output:
sensor1 sensor2 sensor3 sensor4 sensor5 sensor6
time
06:15:00 NaN 14.5 14.0 15.0 14.5 14.7
06:45:00 14.8 NaN 15.6 15.2 14.7 14.6
07:00:00 15.0 14.9 15.7 15.6 14.0 15.3
07:30:00 15.4 15.3 15.6 15.6 14.7 15.1
07:45:00 15.5 14.8 15.4 15.5 14.6 14.9
08:15:00 15.9 15.8 15.9 16.9 NaN 16.2
08:30:00 16.1 15.7 16.1 15.9 14.9 15.2

Now we will calculate the mean values again, but this time on the DataFrame 'cleansed_df', i.e. where we have taken out all the rows, where more than one NaN value occurred.

average_temp_series = cleansed_df.mean(axis=1)
sensors = cleansed_df.columns.values
df = cleansed_df.drop(sensors, axis=1)
# best practice:
df = df.assign(temperature=average_temp_series)  # inplace option not available
df[:6]
The above Python code returned the following output:
temperature
time
06:15:00 14.540000
06:45:00 14.980000
07:00:00 15.083333
07:30:00 15.283333
07:45:00 15.116667
08:15:00 16.140000