Accessing and Changing values of DataFrames

Pandas Replacing Values

We have seen in the previous chapters of our tutorial many ways to create Series and DataFrames. We also learned how to access and replace complete columns. This chapter of our Pandas and Python tutorial will show various ways to access and change selectively values in Pandas DataFrames and Series. We will show ways how to change single value or values matching strings or regular expressions.

For this purpose we will learn to know the methods loc, at and replace

We will work with the following DataFrame structure in our examples:

import pandas as pd

first = ('Mike', 'Dorothee', 'Tom', 'Bill', 'Pete', 'Kate')
last = ('Meyer', 'Maier', 'Meyer', 'Mayer', 'Meyr', 'Mair')
job = ('data analyst', 'programmer', 'computer scientist', 
       'data scientist', 'accountant', 'psychiatrist')
language = ('Python', 'Perl', 'Java', 'Java', 'Cobol', 'Brainfuck')

df = pd.DataFrame(list(zip(last, job, language)), 
                  columns =['last', 'job', 'language'],
                  index=first) 
df
The code above returned the following:
last job language
Mike Meyer data analyst Python
Dorothee Maier programmer Perl
Tom Meyer computer scientist Java
Bill Mayer data scientist Java
Pete Meyr accountant Cobol
Kate Mair psychiatrist Brainfuck

Changing one value in DataFrame

Pandas provides two ways, i.e. loc and at, to access or change a single value of a DataFrame. We will experiment with the height of Bill in the following Python code:

# accessing the job of Bill:
print(df.loc['Bill', 'job'])
# alternative way to access it with at:
print(df.at['Bill', 'job'])

# setting the job of Bill to 'data analyst' with 'loc'
df.loc['Bill', 'job'] = 'data analyst'
# let us check it:
print(df.loc['Bill', 'job'])

# setting the job of Bill to 'computer scientist' with 'at'
df.at['Pete', 'language'] = 'Python'
data scientist
data scientist
data analyst

The following image shows what we have done:

Replacing values in Pandas with loc and at

You will ask yourself now which one you should use? The help on the at method says the following: "Access a single value for a row/column label pair. Similar to loc, in that both provide label-based lookups. Use at if you only need to get or set a single value in a DataFrame or Series." loc on the other hand can be used to access a single value but also to access a group of rows and columns by a label or labels.

Another intestering question is about the speed of both methods in comparison. We will measure the time behaviour in the following code examples:

%timeit df.loc['Bill', 'language'] = 'Python'
179 µs ± 9.02 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit df.at['Bill', 'language'] = 'Python'
7.6 µs ± 1.5 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

When it comes to speed the answer is clear: we should definitely use at.

replace

The syntax of replace:

replace(self, 
        to_replace=None, 
        value=None, 
        inplace=False, 
        limit=None, 
        regex=False, 
        method='pad')

This method replaces values given in to_replace with value. This differs from updating with .loc or .iloc, which requires you to specify a location to update with some value. With replace it is possible to replace values in a Series or DataFrame without knowing where they occur.

replace works both with Series and DataFrames.

We will explain the way of working of the method replace by discussing the different data types of the parameter to_replace individually:

to_replace can be numeric value, string or regular expression:

Data Type processing
numeric numeric values in the DataFrame (or Series) which are equal to to_replace will be replaced with parameter value
str If the parameter to_replace consists of a string, all the strings of the DataFrame (or Series) which exactly match this string will be replaced by the value of the parameter value
regex All strings inside of the DataFrame (or Series) which match the regular expression of to_replace will be replaced with value

Let's start with a Series example. We will change one value into another one. Be aware of the fact that replace by default creates a copy of the object in which all the values are replaced. This means that the parameter inplace is set to False by default.

s = pd.Series([27, 33, 13, 19])
s.replace(13, 42)
The previous code returned the following output:
0    27
1    33
2    42
3    19
dtype: int64

If we really want to change the object s is referencing, we should set the inplace parameter to True:

s = pd.Series([27, 33, 13, 19])
s.replace(13, 42, inplace=True)
s
The previous code returned the following:
0    27
1    33
2    42
3    19
dtype: int64

We can also change multiple values into one single value, as you can see in the following example.

s = pd.Series([0, 1, 2, 3, 4])
s.replace([0, 1, 2], 42, inplace=True)
s
The above code returned the following output:
0    42
1    42
2    42
3     3
4     4
dtype: int64

We will show now, how replace can be used on DataFrames. For this purpose we will recreate the example from the beginning of this chapter of our tutorial. It will not be exactly the same though. More people have learned Python now, but unfortunately somebody has put in some spelling errors in the word Python. Pete has become a programmer now:

import pandas as pd

first = ('Mike', 'Dorothee', 'Tom', 'Bill', 'Pete', 'Kate')
last = ('Meyer', 'Maier', 'Meyer', 'Mayer', 'Meyr', 'Mair')
job = ('data analyst', 'programmer', 'computer scientist', 
       'data scientist', 'programmer', 'psychiatrist')
language = ('Python', 'Perl', 'Java', 'Pithon', 'Pythen', 'Brainfuck')

df = pd.DataFrame(list(zip(last, job, language)), 
                  columns =['last', 'job', 'language'],
                  index=first) 

df
We received the following result:
last job language
Mike Meyer data analyst Python
Dorothee Maier programmer Perl
Tom Meyer computer scientist Java
Bill Mayer data scientist Pithon
Pete Meyr programmer Pythen
Kate Mair psychiatrist Brainfuck

Both ambitious Dorothee and enthusiastic Pete have finished a degree in Computer Science in the meantime. So we have to change our DataFrame accodingly:

df.replace("programmer", 
           "computer scientist", 
           inplace=True)
df
We received the following result:
last job language
Mike Meyer data analyst Python
Dorothee Maier computer scientist Perl
Tom Meyer computer scientist Java
Bill Mayer data scientist Pithon
Pete Meyr computer scientist Pythen
Kate Mair psychiatrist Brainfuck

to_replace can be a list consisting of str, regex or numeric objects:

to_replace can be a list consisting of str, regex or numeric objects. If both the values of to_replace and value are both lists, they must be the same length.

First, if to_replace and value are both lists, they must be the same length. Second, if regex=True then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. We slightly changed our DataFrame for the next example. The first names build a column now!

df = pd.DataFrame(list(zip(first, last, job, language)), 
                  columns =['first', 'last', 'job', 'language']) 
df
The previous code returned the following:
first last job language
0 Mike Meyer data analyst Python
1 Dorothee Maier programmer Perl
2 Tom Meyer computer scientist Java
3 Bill Mayer data scientist Pithon
4 Pete Meyr programmer Pythen
5 Kate Mair psychiatrist Brainfuck

'Mike' and 'Tom' are abbreviations of the real first names, so we will change them now. We are also glad to announce that Dorothee finally migrated from Perl to Python:

df.replace(to_replace=['Mike', 'Tom', 'Perl'],
           value= ['Michael', 'Thomas', 'Python'], 
           inplace=True)
df
The above Python code returned the following:
first last job language
0 Michael Meyer data analyst Python
1 Dorothee Maier programmer Python
2 Thomas Meyer computer scientist Java
3 Bill Mayer data scientist Pithon
4 Pete Meyr programmer Pythen
5 Kate Mair psychiatrist Brainfuck

Using Regular Expressions

Now it is time to do something about the surnames in our DataFrame. All these names sound the same - at least for German speakers. Let us assume that we just found out that they should all be spelled the same, i.e. Mayer. Now the time has come for regular expression. The parameter to_replace can also be used with regular expressions. We will use a regular expression to unify the surnames. We will also fix the misspellings of Python:

df.replace(to_replace=[r'M[ea][iy]e?r', r'P[iy]th[eo]n'],
           value=['Mayer', 'Python'],
           regex=True, 
           inplace=True)
df
The previous code returned the following output:
first last job language
0 Michael Mayer data analyst Python
1 Dorothee Mayer programmer Python
2 Thomas Mayer computer scientist Java
3 Bill Mayer data scientist Python
4 Pete Mayer programmer Python
5 Kate Mayer psychiatrist Brainfuck

to_replace can be a dict

Dictionaries can be used to specify different replacement values for different existing values. For example, {'a': 'b', 'y': 'z'} replaces the value 'a' with 'b' and 'y' with 'z'. To use a dict in this way the value parameter should be None.

Ìf replace is applied on a DataFrame, a dict can specify that different values should be replaced in different columns. For example, {'a': 1, 'b': 'z'} looks for the value 1 in column 'a' and the value 'z' in column 'b' and replaces these values with whatever is specified in value. The value parameter should not be None in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.

If we use nested dictionaries like {'A': {'foo': 'bar'}}, they are interpreted like this: look in column 'A' for the value 'foo' and replace it with 'bar'. The value parameter has to be None in this case.

df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
                   'B': ['foo', 'bar', 'bloo', 'blee', 'bloo'],
                   'C': ['green', 'red', 'blue', 'yellow', 'green']})

df.replace(to_replace={"A": {0: 42, 3: 33}, 'B': {'bloo': 'vloo'}},
           inplace=True)
df
The above Python code returned the following output:
A B C
0 42 foo green
1 1 bar red
2 2 vloo blue
3 33 blee yellow
4 4 vloo green

The method parameter of replace:

When the parameter value is None and the parameter to_replace is a scalar, list or tuple, the method replace will use the parameter method to decide which replacement to perform. The possible values for method are pad, ffill, bfill, None. The default value is pad.

First of all, you have to know that pad and ffill are equivalent.

We will explain the way of working with the following Python example code. We will replace the values NN and LN of the following DataFrame in different ways by using the parameter method.

import pandas as pd

df = pd.DataFrame({
    'name':['Ben', 'Kate', 'Agnes', 'Ashleigh', 'Tom'],
    'job':['programmer', 'NN', 'NN', 'engineer', 'teacher'],
    'language':['Java', 'Python', 'LN', 'LN', 'C']})

Replacing values in Pandas with replace and the parameter method

We will start with changing the 'NN' values by using ffill in the parameter method:


Replacing values in Pandas with replace and the parameter method
The picture above shows how the following call to replace. In words: Every occurrence of NN in the job column will have to be replaced. The value is defined by the parameter method. ffill means 'forward fill', i.e. we will take the preceding value to the first NN as the fill value. This is why we will replace the values by 'programmer':

# method is backfill
df.replace(to_replace='NN', 
           value=None, 
           method='ffill')
The above Python code returned the following output:
name job language
0 Ben programmer Java
1 Kate programmer Python
2 Agnes programmer LN
3 Ashleigh engineer LN
4 Tom teacher C

Instead of using a single value for to_replace we can also use a list or tuple. We will replace in the following all occurrences of NN and LN accordingly, as we can see in the following picture:


Replacing values in Pandas with replace and the parameter method

df.replace(to_replace=['NN', 'LN'],
           value=None,
           method='ffill')
We received the following output:
name job language
0 Ben programmer Java
1 Kate programmer Python
2 Agnes programmer Python
3 Ashleigh engineer Python
4 Tom teacher C

We will show now what happens, if we use bfill (backward fill). Now the occurrences of 'LN' become 'C' instead of Python. We also turn the 'NN's into engineers instead of programmers.


Replacing values in Pandas with replace and the parameter method

df.replace(['NN', 'LN'], value=None, method='bfill')
The above Python code returned the following output:
name job language
0 Ben programmer Java
1 Kate engineer Python
2 Agnes engineer C
3 Ashleigh engineer C
4 Tom teacher C
df.replace('NN', 
           value=None,
           inplace=True,
           method='bfill')
df.replace('LN', 
           value=None,
           inplace=True,
           method='ffill')
df
The above Python code returned the following:
name job language
0 Ben programmer Java
1 Kate engineer Python
2 Agnes engineer Python
3 Ashleigh engineer Python
4 Tom teacher C

Other Examples

We will present some more examples, which are taken from the help file of loc:

df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
                  index=['cobra', 'viper', 'sidewinder'],
                  columns=['max_speed', 'shield'])

df
The Python code above returned the following:
max_speed shield
cobra 1 2
viper 4 5
sidewinder 7 8

Single label. Note this returns the row as a Series:

 df.loc['viper']
The above code returned the following result:
max_speed    4
shield       5
Name: viper, dtype: int64

List of labels. Note using [[]] returns a DataFrame:

df.loc[['viper', 'sidewinder']]
The above Python code returned the following output:
max_speed shield
viper 4 5
sidewinder 7 8

Single label for row and column

df.loc['cobra', 'shield']
The previous Python code returned the following output:
2

Slice with labels for row and single label for column. As mentioned above, note that both the start and stop of the slice are included.

df.loc['cobra':'viper', 'max_speed']
We received the following output:
cobra    1
viper    4
Name: max_speed, dtype: int64

Boolean list with the same length as the row axis

 df.loc[[False, False, True]]
The previous code returned the following:
max_speed shield
sidewinder 7 8

Conditional that returns a boolean Series

df.loc[df['shield'] > 6]
After having executed the Python code above we received the following:
max_speed shield
sidewinder 7 8

Conditional that returns a boolean Series with column labels specified

df.loc[df['shield'] > 6, ['max_speed']]
The previous code returned the following result:
max_speed
sidewinder 7

Callable that returns a boolean Series

df.loc[lambda df: df['shield'] == 8]
This gets us the following:
max_speed shield
sidewinder 7 8

Set value for all items matching the list of labels

df.loc[['viper', 'sidewinder'], ['shield']] = 50
df
The above code returned the following:
max_speed shield
cobra 1 2
viper 4 50
sidewinder 7 50

Set value for an entire row

df.loc['cobra'] = 10
df
We received the following result:
max_speed shield
cobra 10 10
viper 4 50
sidewinder 7 50

Set value for an entire column

df.loc[:, 'max_speed'] = 30
df
The above Python code returned the following:
max_speed shield
cobra 30 10
viper 30 50
sidewinder 30 50

Set value for rows matching callable condition

df.loc[df['shield'] > 35] = 0
df
After having executed the Python code above we received the following result:
max_speed shield
cobra 30 10
viper 0 0
sidewinder 0 0