29. Pandas: groupby
By Bernd Klein. Last modified: 21 Feb 2024.
This chapter of our Pandas tutorial deals with an extremely important functionality, i.e. groupby
. It is not really complicated, but it is not obvious at first glance and is sometimes found to be difficult. Completely wrong, as we shall see. It is also very important to become familiar with 'groupby' because it can be used to solve important problems that would not be possible without it. The Pandas groupby
operation involves some combination of splitting the object, applying a function, and combining the results. We can split a DataFrame object into groups based on various criteria and row and column-wise, i.e. using axis
.
'Applying' means
- to filter the data,
- transform the data or
- aggregate the data.
groupby
can be applied to Pandas Series objects and DataFrame objects! We will learn to understand how it works with many small practical examples in this tutorial.
goupby with Series
We create with the following Python program a Series object with an index of size nvalues
. The index will not be unique, because the strings for the index are taken from the list fruits
, which has less elements than nvalues
:
import pandas as pd
import numpy as np
import random
nvalues = 30
# we create random values, which will be used as the Series values:
values = np.random.randint(1, 20, (nvalues,))
fruits = ["bananas", "oranges", "apples", "clementines", "cherries", "pears"]
fruits_index = np.random.choice(fruits, (nvalues,))
s = pd.Series(values, index=fruits_index)
print(s[:10])
OUTPUT:
pears 3 pears 1 oranges 14 clementines 15 bananas 18 oranges 3 bananas 9 cherries 3 cherries 4 apples 12 dtype: int64
grouped = s.groupby(s.index)
grouped
OUTPUT:
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f85100f1b90>
We can see that we get a SeriesGroupBy
object, if we apply groupby
on the index of our series object s
. The result of this operation grouped
is iterable. In every step we get a tuple object returned, which consists of an index label and a series object. The series object is s
reduced to this label.
grouped = s.groupby(s.index)
for fruit, s_obj in grouped:
print(f"===== {fruit} =====")
print(s_obj)
OUTPUT:
===== apples ===== apples 12 apples 9 apples 11 dtype: int64 ===== bananas ===== bananas 18 bananas 9 bananas 12 bananas 15 bananas 5 dtype: int64 ===== cherries ===== cherries 3 cherries 4 cherries 3 cherries 13 cherries 18 cherries 2 cherries 11 dtype: int64 ===== clementines ===== clementines 15 clementines 7 clementines 18 clementines 9 clementines 18 clementines 4 dtype: int64 ===== oranges ===== oranges 14 oranges 3 oranges 16 oranges 9 oranges 3 dtype: int64 ===== pears ===== pears 3 pears 1 pears 4 pears 19 dtype: int64
We could have got the same result - except for the order - without using `` groupby '' with the following Python code.
for fruit in set(s.index):
print(f"===== {fruit} =====")
print(s[fruit])
OUTPUT:
===== clementines ===== clementines 15 clementines 7 clementines 18 clementines 9 clementines 18 clementines 4 dtype: int64 ===== bananas ===== bananas 18 bananas 9 bananas 12 bananas 15 bananas 5 dtype: int64 ===== pears ===== pears 3 pears 1 pears 4 pears 19 dtype: int64 ===== oranges ===== oranges 14 oranges 3 oranges 16 oranges 9 oranges 3 dtype: int64 ===== cherries ===== cherries 3 cherries 4 cherries 3 cherries 13 cherries 18 cherries 2 cherries 11 dtype: int64 ===== apples ===== apples 12 apples 9 apples 11 dtype: int64
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
groupby with DataFrames
We will start with a very simple DataFrame. The DataFRame has two columns one containing names Name
and the other one Coffee
contains integers which are the number of cups of coffee the person drank.
import pandas as pd
beverages = pd.DataFrame({'Name': ['Robert', 'Melinda', 'Brenda',
'Samantha', 'Melinda', 'Robert',
'Melinda', 'Brenda', 'Samantha'],
'Coffee': [3, 0, 2, 2, 0, 2, 0, 1, 3],
'Tea': [0, 4, 2, 0, 3, 0, 3, 2, 0]})
beverages
Name | Coffee | Tea | |
---|---|---|---|
0 | Robert | 3 | 0 |
1 | Melinda | 0 | 4 |
2 | Brenda | 2 | 2 |
3 | Samantha | 2 | 0 |
4 | Melinda | 0 | 3 |
5 | Robert | 2 | 0 |
6 | Melinda | 0 | 3 |
7 | Brenda | 1 | 2 |
8 | Samantha | 3 | 0 |
It's simple, and we've already seen in the previous chapters of our tutorial how to calculate the total number of coffee cups. The task is to sum a column of a DatFrame, i.e. the 'Coffee' column:
beverages['Coffee'].sum()
OUTPUT:
13
Let's compute now the total number of coffees and teas:
beverages[['Coffee', 'Tea']].sum()
OUTPUT:
Coffee 13 Tea 14 dtype: int64
'groupby' has not been necessary for the previous tasks. Let's have a look at our DataFrame again. We can see that some of the names appear multiple times. So it will be very interesting to see how many cups of coffee and tea each person drank in total. That means we are applying 'groupby' to the 'Name' column. Thereby we split the DatFrame. Then we apply 'sum' to the results of 'groupby':
res = beverages.groupby(['Name']).sum()
print(res)
OUTPUT:
Coffee Tea Name Brenda 3 4 Melinda 0 10 Robert 5 0 Samantha 5 0
We can see that the names are now the index of the resulting DataFrame:
print(res.index)
OUTPUT:
Index(['Brenda', 'Melinda', 'Robert', 'Samantha'], dtype='object', name='Name')
There is only one column left, i.e. the Coffee
column:
print(res.columns)
OUTPUT:
Index(['Coffee', 'Tea'], dtype='object')
We can also calculate the average number of coffee and tea cups the persons had:
beverages.groupby(['Name']).mean()
Coffee | Tea | |
---|---|---|
Name | ||
Brenda | 1.5 | 2.000000 |
Melinda | 0.0 | 3.333333 |
Robert | 2.5 | 0.000000 |
Samantha | 2.5 | 0.000000 |
Another Example
The following Python code is used to create the data, we will use in our next groupby
example. It is not necessary to understand the following Python code for the content following afterwards.
The module faker
has to be installed. In cae of an Anaconda installation this can be done by executing one of the following commands in a shell:
conda install -c conda-forge faker
conda install -c conda-forge/label/gcc7 faker
conda install -c conda-forge/label/cf201901 faker
conda install -c conda-forge/label/cf202003 faker
from faker import Faker
import numpy as np
from itertools import chain
fake = Faker('de_DE')
number_of_names = 10
names = []
for _ in range(number_of_names):
names.append(fake.first_name())
data = {}
workweek = ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
weekend = ("Saturday", "Sunday")
for day in chain(workweek, weekend):
data[day] = np.random.randint(0, 10, (number_of_names,))
data_df = pd.DataFrame(data, index=names)
data_df
Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday | |
---|---|---|---|---|---|---|---|
Kristina | 8 | 9 | 7 | 4 | 0 | 7 | 6 |
Zofia | 2 | 9 | 7 | 8 | 6 | 7 | 9 |
Sahin | 1 | 7 | 4 | 2 | 6 | 1 | 0 |
Giuseppe | 2 | 5 | 8 | 3 | 6 | 1 | 6 |
Ulrich | 7 | 6 | 3 | 0 | 4 | 8 | 2 |
Hans-Michael | 2 | 0 | 8 | 3 | 4 | 7 | 9 |
Karl-Werner | 7 | 0 | 5 | 0 | 4 | 6 | 0 |
Ludmila | 5 | 3 | 6 | 0 | 8 | 0 | 3 |
Peter | 0 | 4 | 5 | 2 | 7 | 1 | 5 |
Margret | 2 | 2 | 3 | 7 | 4 | 5 | 0 |
print(names)
OUTPUT:
['Kristina', 'Zofia', 'Sahin', 'Giuseppe', 'Ulrich', 'Hans-Michael', 'Karl-Werner', 'Ludmila', 'Peter', 'Margret']
names = ('Ortwin', 'Mara', 'Siegrun', 'Sylvester', 'Metin', 'Adeline', 'Utz', 'Susan', 'Gisbert', 'Senol')
data = {'Monday': np.array([0, 9, 2, 3, 7, 3, 9, 2, 4, 9]),
'Tuesday': np.array([2, 6, 3, 3, 5, 5, 7, 7, 1, 0]),
'Wednesday': np.array([6, 1, 1, 9, 4, 0, 8, 6, 8, 8]),
'Thursday': np.array([1, 8, 6, 9, 9, 4, 1, 7, 3, 2]),
'Friday': np.array([3, 5, 6, 6, 5, 2, 2, 4, 6, 5]),
'Saturday': np.array([8, 4, 8, 2, 3, 9, 3, 4, 9, 7]),
'Sunday': np.array([0, 8, 7, 8, 9, 7, 2, 0, 5, 2])}
data_df = pd.DataFrame(data, index=names)
data_df
Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday | |
---|---|---|---|---|---|---|---|
Ortwin | 0 | 2 | 6 | 1 | 3 | 8 | 0 |
Mara | 9 | 6 | 1 | 8 | 5 | 4 | 8 |
Siegrun | 2 | 3 | 1 | 6 | 6 | 8 | 7 |
Sylvester | 3 | 3 | 9 | 9 | 6 | 2 | 8 |
Metin | 7 | 5 | 4 | 9 | 5 | 3 | 9 |
Adeline | 3 | 5 | 0 | 4 | 2 | 9 | 7 |
Utz | 9 | 7 | 8 | 1 | 2 | 3 | 2 |
Susan | 2 | 7 | 6 | 7 | 4 | 4 | 0 |
Gisbert | 4 | 1 | 8 | 3 | 6 | 9 | 5 |
Senol | 9 | 0 | 8 | 2 | 5 | 7 | 2 |
We will demonstrate with this DataFrame how to combine columns by a function.
def is_weekend(day):
if day in {'Saturday', 'Sunday'}:
return "Weekend"
else:
return "Workday"
for res_func, df in data_df.T.groupby(by=is_weekend):
print(df)
OUTPUT:
Ortwin Mara Siegrun Sylvester Metin Adeline Utz Susan \ Saturday 8 4 8 2 3 9 3 4 Sunday 0 8 7 8 9 7 2 0 Gisbert Senol Saturday 9 7 Sunday 5 2 Ortwin Mara Siegrun Sylvester Metin Adeline Utz Susan \ Monday 0 9 2 3 7 3 9 2 Tuesday 2 6 3 3 5 5 7 7 Wednesday 6 1 1 9 4 0 8 6 Thursday 1 8 6 9 9 4 1 7 Friday 3 5 6 6 5 2 2 4 Gisbert Senol Monday 4 9 Tuesday 1 0 Wednesday 8 8 Thursday 3 2 Friday 6 5
data_df.T.groupby(by=is_weekend).sum()
Ortwin | Mara | Siegrun | Sylvester | Metin | Adeline | Utz | Susan | Gisbert | Senol | |
---|---|---|---|---|---|---|---|---|---|---|
Weekend | 8 | 12 | 15 | 10 | 12 | 16 | 5 | 4 | 14 | 9 |
Workday | 12 | 29 | 18 | 30 | 30 | 14 | 27 | 26 | 22 | 24 |
Exercises
Exercise 1
Calculate the average prices of the products of the following DataFrame:
import pandas as pd
d = {"products": ["Oppilume", "Dreaker", "Lotadilo",
"Crosteron", "Wazzasoft", "Oppilume",
"Dreaker", "Lotadilo", "Wazzasoft"],
"colours": ["blue", "blue", "blue",
"green", "blue", "green",
"green", "green", "red"],
"customer_price": [2345.89, 2390.50, 1820.00,
3100.00, 1784.50, 2545.89,
2590.50, 2220.00, 2084.50],
"non_customer_price": [2445.89, 2495.50, 1980.00,
3400.00, 1921.00, 2645.89,
2655.50, 2140.00, 2190.00]}
product_prices = pd.DataFrame(d)
product_prices
products | colours | customer_price | non_customer_price | |
---|---|---|---|---|
0 | Oppilume | blue | 2345.89 | 2445.89 |
1 | Dreaker | blue | 2390.50 | 2495.50 |
2 | Lotadilo | blue | 1820.00 | 1980.00 |
3 | Crosteron | green | 3100.00 | 3400.00 |
4 | Wazzasoft | blue | 1784.50 | 1921.00 |
5 | Oppilume | green | 2545.89 | 2645.89 |
6 | Dreaker | green | 2590.50 | 2655.50 |
7 | Lotadilo | green | 2220.00 | 2140.00 |
8 | Wazzasoft | red | 2084.50 | 2190.00 |
Exercise 2
Calculate the sum of the price according to the colours.
Exercise 3
Read in the project_times.txt
file from the data1
directory. This rows of this file contain comma separated the date, the name of the programmer, the name of the project, the time the programmer spent on the project.
Calculate the time spend on all the projects per day
Exercise 4
Create a DateFrame containing the total times spent on a project per day by all the programmers
Exercise 5
Calculate the total times spent on the projects over the whole month.
Exercise 6
Calculate the monthly times of each programmer regardless of the projects
Exercise 7
Rearrange the DataFrame with a MultiIndex consisting of the date and the project names, the columns should be the programmer names and the data of the columns the time of the programmers spent on the projects.
time
programmer Antonie Elise Fatima Hella Mariola
date project
2020-01-01 BIRDY NaN NaN NaN 1.50 1.75
NSTAT NaN NaN 0.25 NaN 1.25
XTOR NaN NaN NaN 1.00 3.50
2020-01-02 BIRDY NaN NaN NaN 1.75 2.00
NSTAT 0.5 NaN NaN NaN 1.75
Replace the NaN values by 0.
Exercise 8:
The folder data
contains a file donation.txt
with the following data:
firstname,surname,city,job,income,donations
Janett,Schwital,Karlsruhe,Politician,244400,2512
Daniele,Segebahn,Freiburg,Student,16800,336
Kirstin,Klapp,Hamburg,Engineer,116900,1479
Oswald,Segebahn,Köln,Musician,57700,1142
group the data by the job of the persons.
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Upcoming online Courses
Solutions
Solution to Exercise 1
average_prices = product_prices.groupby(["products", "colours"]).mean()
print(average_prices)
OUTPUT:
customer_price non_customer_price products colours Crosteron green 3100.00 3400.00 Dreaker blue 2390.50 2495.50 green 2590.50 2655.50 Lotadilo blue 1820.00 1980.00 green 2220.00 2140.00 Oppilume blue 2345.89 2445.89 green 2545.89 2645.89 Wazzasoft blue 1784.50 1921.00 red 2084.50 2190.00
Solution to Exercise 2
x = product_prices[['colours', 'customer_price', 'non_customer_price']].groupby('colours').sum()
x
customer_price | non_customer_price | |
---|---|---|
colours | ||
blue | 8340.89 | 8842.39 |
green | 10456.39 | 10841.39 |
red | 2084.50 | 2190.00 |
# better
sum_of_prices = product_prices.drop(columns=["products"]).groupby("colours").sum()
print(sum_of_prices)
OUTPUT:
customer_price non_customer_price colours blue 8340.89 8842.39 green 10456.39 10841.39 red 2084.50 2190.00
# or
product_prices[['colours', 'customer_price', 'non_customer_price']].groupby("colours").sum()
customer_price | non_customer_price | |
---|---|---|
colours | ||
blue | 8340.89 | 8842.39 |
green | 10456.39 | 10841.39 |
red | 2084.50 | 2190.00 |
Solution to Exercise 3
import pandas as pd
df = pd.read_csv("../data1/project_times.txt", index_col=0)
df
programmer | project | time | |
---|---|---|---|
date | |||
2020-01-01 | Hella | XTOR | 1.00 |
2020-01-01 | Hella | BIRDY | 1.50 |
2020-01-01 | Fatima | NSTAT | 0.25 |
2020-01-01 | Mariola | NSTAT | 0.50 |
2020-01-01 | Mariola | BIRDY | 1.75 |
... | ... | ... | ... |
2030-01-30 | Antonie | XTOR | 0.50 |
2030-01-31 | Hella | BIRDY | 1.25 |
2030-01-31 | Hella | BIRDY | 1.75 |
2030-01-31 | Mariola | BIRDY | 1.00 |
2030-01-31 | Hella | BIRDY | 1.00 |
17492 rows × 3 columns
times_per_day = df['time'].groupby(df.index).sum()
print(times_per_day[:10])
OUTPUT:
date 2020-01-01 9.25 2020-01-02 6.00 2020-01-03 2.50 2020-01-06 5.75 2020-01-07 15.00 2020-01-08 13.25 2020-01-09 10.25 2020-01-10 17.00 2020-01-13 4.75 2020-01-14 10.00 Name: time, dtype: float64
Solution to Exercise 4
times_per_day_project = df[['project', 'time']].groupby([df.index, 'project']).sum()
print(times_per_day_project[:10])
OUTPUT:
time date project 2020-01-01 BIRDY 3.25 NSTAT 1.50 XTOR 4.50 2020-01-02 BIRDY 3.75 NSTAT 2.25 2020-01-03 BIRDY 1.00 NSTAT 0.25 XTOR 1.25 2020-01-06 BIRDY 2.75 NSTAT 0.75
Solution to Exercise 5
df[['project', 'time']].groupby(['project']).sum()
time | |
---|---|
project | |
BIRDY | 9605.75 |
NSTAT | 8707.75 |
XTOR | 6427.50 |
Solution to Exercise 6
df[['programmer', 'time']].groupby(['programmer']).sum()
time | |
---|---|
programmer | |
Antonie | 1511.25 |
Elise | 80.00 |
Fatima | 593.00 |
Hella | 10642.00 |
Mariola | 11914.75 |
Solution to Exercise 7
x = df.groupby([df.index, 'project', 'programmer']).sum()
x = x.unstack()
x
time | ||||||
---|---|---|---|---|---|---|
programmer | Antonie | Elise | Fatima | Hella | Mariola | |
date | project | |||||
2020-01-01 | BIRDY | NaN | NaN | NaN | 1.50 | 1.75 |
NSTAT | NaN | NaN | 0.25 | NaN | 1.25 | |
XTOR | NaN | NaN | NaN | 1.00 | 3.50 | |
2020-01-02 | BIRDY | NaN | NaN | NaN | 1.75 | 2.00 |
NSTAT | 0.5 | NaN | NaN | NaN | 1.75 | |
... | ... | ... | ... | ... | ... | ... |
2030-01-29 | XTOR | NaN | NaN | NaN | 1.00 | 5.50 |
2030-01-30 | BIRDY | NaN | NaN | NaN | 0.75 | 4.75 |
NSTAT | NaN | NaN | NaN | 3.75 | NaN | |
XTOR | 0.5 | NaN | NaN | 0.75 | NaN | |
2030-01-31 | BIRDY | NaN | NaN | NaN | 4.00 | 1.00 |
7037 rows × 5 columns
x = x.fillna(0)
print(x[:10])
OUTPUT:
time programmer Antonie Elise Fatima Hella Mariola date project 2020-01-01 BIRDY 0.00 0.0 0.00 1.50 1.75 NSTAT 0.00 0.0 0.25 0.00 1.25 XTOR 0.00 0.0 0.00 1.00 3.50 2020-01-02 BIRDY 0.00 0.0 0.00 1.75 2.00 NSTAT 0.50 0.0 0.00 0.00 1.75 2020-01-03 BIRDY 0.00 0.0 1.00 0.00 0.00 NSTAT 0.25 0.0 0.00 0.00 0.00 XTOR 0.00 0.0 0.00 0.50 0.75 2020-01-06 BIRDY 0.00 0.0 0.00 2.50 0.25 NSTAT 0.00 0.0 0.00 0.00 0.75
Solution to Exercise 8:
import pandas as pd
data = pd.read_csv('../data/donations.txt')
data
firstname | surname | city | job | income | donations | |
---|---|---|---|---|---|---|
0 | Janett | Schwital | Karlsruhe | Politician | 244400 | 2512 |
1 | Daniele | Segebahn | Freiburg | Student | 16800 | 336 |
2 | Kirstin | Klapp | Hamburg | Engineer | 116900 | 1479 |
3 | Oswald | Segebahn | Köln | Musician | 57700 | 1142 |
4 | Heinz-Joachim | Wagner | Stuttgart | Engineer | 109300 | 1592 |
... | ... | ... | ... | ... | ... | ... |
95 | Georgine | Köster | Stuttgart | Manager | 364300 | 1487 |
96 | Dagmar | Käster | Stuttgart | Student | 12800 | 256 |
97 | Hubert | Mosemann | Köln | Engineer | 119300 | 1308 |
98 | Susanne | Heidrich | Karlsruhe | Politician | 295600 | 3364 |
99 | Philomena | Grein Groth | Freiburg | Engineer | 100700 | 1288 |
100 rows × 6 columns
data_sum = data[['job', 'income', 'donations']].groupby(['job']).sum()
data_sum.sort_values(by='donations')
income | donations | |
---|---|---|
job | ||
Student | 372900 | 7458 |
Musician | 1448700 | 24376 |
Engineer | 2067200 | 25564 |
Politician | 4118300 | 30758 |
Manager | 12862600 | 87475 |
data_sum['relative'] = data_sum.donations * 100 / data_sum.income
data_sum.sort_values(by='relative')
income | donations | relative | |
---|---|---|---|
job | |||
Manager | 12862600 | 87475 | 0.680072 |
Politician | 4118300 | 30758 | 0.746862 |
Engineer | 2067200 | 25564 | 1.236649 |
Musician | 1448700 | 24376 | 1.682612 |
Student | 372900 | 7458 | 2.000000 |
Live Python training
Enjoying this page? We offer live Python training courses covering the content of this site.
Upcoming online Courses