30. Pandas: groupby

By Bernd Klein. Last modified: 03 Feb 2025.

On this page ➤

This chapter of our Pandas tutorial deals with an extremely important functionality, i.e. groupby. It is not really complicated, but it is not obvious at first glance and is sometimes found to be difficult. Completely wrong, as we shall see. It is also very important to become familiar with 'groupby' because it can be used to solve important problems that would not be possible without it. The Pandas groupby operation involves some combination of splitting the object, applying a function, and combining the results. We can split a DataFrame object into groups based on various criteria and row and column-wise, i.e. using axis.

splitted banana

'Applying' means

to filter the data,
transform the data or
aggregate the data.

groupby can be applied to Pandas Series objects and DataFrame objects! We will learn to understand how it works with many small practical examples in this tutorial.

goupby with Series

We create with the following Python program a Series object with an index of size nvalues. The index will not be unique, because the strings for the index are taken from the list fruits, which has less elements than nvalues:

import pandas as pd
import numpy as np
import random

nvalues = 30
# we create random values, which will be used as the Series values:
values = np.random.randint(1, 20, (nvalues,))
fruits = ["bananas", "oranges", "apples", "clementines", "cherries", "pears"]
fruits_index = np.random.choice(fruits, (nvalues,))

s = pd.Series(values, index=fruits_index)
print(s[:10])

OUTPUT:

pears           3
pears           1
oranges        14
clementines    15
bananas        18
oranges         3
bananas         9
cherries        3
cherries        4
apples         12
dtype: int64

grouped = s.groupby(s.index)
grouped

OUTPUT:

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f85100f1b90>

We can see that we get a SeriesGroupBy object, if we apply groupby on the index of our series object s. The result of this operation grouped is iterable. In every step we get a tuple object returned, which consists of an index label and a series object. The series object is s reduced to this label.

grouped = s.groupby(s.index)

for fruit, s_obj in grouped:
    print(f"===== {fruit} =====")
    print(s_obj)

OUTPUT:

===== apples =====
apples    12
apples     9
apples    11
dtype: int64
===== bananas =====
bananas    18
bananas     9
bananas    12
bananas    15
bananas     5
dtype: int64
===== cherries =====
cherries     3
cherries     4
cherries     3
cherries    13
cherries    18
cherries     2
cherries    11
dtype: int64
===== clementines =====
clementines    15
clementines     7
clementines    18
clementines     9
clementines    18
clementines     4
dtype: int64
===== oranges =====
oranges    14
oranges     3
oranges    16
oranges     9
oranges     3
dtype: int64
===== pears =====
pears     3
pears     1
pears     4
pears    19
dtype: int64

We could have got the same result - except for the order - without using `` groupby '' with the following Python code.

for fruit in set(s.index):
    print(f"===== {fruit} =====")
    print(s[fruit])

OUTPUT:

===== clementines =====
clementines    15
clementines     7
clementines    18
clementines     9
clementines    18
clementines     4
dtype: int64
===== bananas =====
bananas    18
bananas     9
bananas    12
bananas    15
bananas     5
dtype: int64
===== pears =====
pears     3
pears     1
pears     4
pears    19
dtype: int64
===== oranges =====
oranges    14
oranges     3
oranges    16
oranges     9
oranges     3
dtype: int64
===== cherries =====
cherries     3
cherries     4
cherries     3
cherries    13
cherries    18
cherries     2
cherries    11
dtype: int64
===== apples =====
apples    12
apples     9
apples    11
dtype: int64

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

See our Python training courses

See our Machine Learning with Python training courses

groupby with DataFrames

We will start with a very simple DataFrame. The DataFRame has two columns one containing names Name and the other one Coffee contains integers which are the number of cups of coffee the person drank.

import pandas as pd
beverages = pd.DataFrame({'Name': ['Robert', 'Melinda', 'Brenda',
                                   'Samantha', 'Melinda', 'Robert',
                                   'Melinda', 'Brenda', 'Samantha'],
                          'Coffee': [3, 0, 2, 2, 0, 2, 0, 1, 3],
                          'Tea':    [0, 4, 2, 0, 3, 0, 3, 2, 0]})
    
beverages

	Name	Coffee	Tea
0	Robert	3	0
1	Melinda	0	4
2	Brenda	2	2
3	Samantha	2	0
4	Melinda	0	3
5	Robert	2	0
6	Melinda	0	3
7	Brenda	1	2
8	Samantha	3	0

It's simple, and we've already seen in the previous chapters of our tutorial how to calculate the total number of coffee cups. The task is to sum a column of a DatFrame, i.e. the 'Coffee' column:

beverages['Coffee'].sum()

OUTPUT:

Let's compute now the total number of coffees and teas:

beverages[['Coffee', 'Tea']].sum()

OUTPUT:

Coffee    13
Tea       14
dtype: int64

'groupby' has not been necessary for the previous tasks. Let's have a look at our DataFrame again. We can see that some of the names appear multiple times. So it will be very interesting to see how many cups of coffee and tea each person drank in total. That means we are applying 'groupby' to the 'Name' column. Thereby we split the DatFrame. Then we apply 'sum' to the results of 'groupby':

res = beverages.groupby(['Name']).sum()
print(res)

OUTPUT:

          Coffee  Tea
Name                 
Brenda         3    4
Melinda        0   10
Robert         5    0
Samantha       5    0

We can see that the names are now the index of the resulting DataFrame:

print(res.index)

OUTPUT:

Index(['Brenda', 'Melinda', 'Robert', 'Samantha'], dtype='object', name='Name')

There is only one column left, i.e. the Coffee column:

print(res.columns)

OUTPUT:

Index(['Coffee', 'Tea'], dtype='object')

We can also calculate the average number of coffee and tea cups the persons had:

beverages.groupby(['Name']).mean()

	Coffee	Tea
Name
Brenda	1.5	2.000000
Melinda	0.0	3.333333
Robert	2.5	0.000000
Samantha	2.5	0.000000

Another Example

The following Python code is used to create the data, we will use in our next groupby example. It is not necessary to understand the following Python code for the content following afterwards. The module faker has to be installed. In cae of an Anaconda installation this can be done by executing one of the following commands in a shell:

conda install -c conda-forge faker
conda install -c conda-forge/label/gcc7 faker
conda install -c conda-forge/label/cf201901 faker
conda install -c conda-forge/label/cf202003 faker

from faker import Faker
import numpy as np
from itertools import chain

fake = Faker('de_DE')

number_of_names = 10
names = []
for _ in range(number_of_names):
    names.append(fake.first_name())


data = {}
workweek = ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
weekend = ("Saturday", "Sunday")

for day in chain(workweek, weekend):
    data[day] = np.random.randint(0, 10, (number_of_names,))
    
data_df = pd.DataFrame(data, index=names)
data_df

	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday	Sunday
Kristina	8	9	7	4	0	7	6
Zofia	2	9	7	8	6	7	9
Sahin	1	7	4	2	6	1	0
Giuseppe	2	5	8	3	6	1	6
Ulrich	7	6	3	0	4	8	2
Hans-Michael	2	0	8	3	4	7	9
Karl-Werner	7	0	5	0	4	6	0
Ludmila	5	3	6	0	8	0	3
Peter	0	4	5	2	7	1	5
Margret	2	2	3	7	4	5	0

print(names)

OUTPUT:

['Kristina', 'Zofia', 'Sahin', 'Giuseppe', 'Ulrich', 'Hans-Michael', 'Karl-Werner', 'Ludmila', 'Peter', 'Margret']

names = ('Ortwin', 'Mara', 'Siegrun', 'Sylvester', 'Metin', 'Adeline', 'Utz', 'Susan', 'Gisbert', 'Senol')
data = {'Monday': np.array([0, 9, 2, 3, 7, 3, 9, 2, 4, 9]),
        'Tuesday': np.array([2, 6, 3, 3, 5, 5, 7, 7, 1, 0]),
        'Wednesday': np.array([6, 1, 1, 9, 4, 0, 8, 6, 8, 8]),
        'Thursday': np.array([1, 8, 6, 9, 9, 4, 1, 7, 3, 2]),
        'Friday': np.array([3, 5, 6, 6, 5, 2, 2, 4, 6, 5]),
        'Saturday': np.array([8, 4, 8, 2, 3, 9, 3, 4, 9, 7]),
        'Sunday': np.array([0, 8, 7, 8, 9, 7, 2, 0, 5, 2])}

data_df = pd.DataFrame(data, index=names)
data_df

	Monday	Tuesday	Wednesday	Thursday	Friday	Saturday	Sunday
Ortwin	0	2	6	1	3	8	0
Mara	9	6	1	8	5	4	8
Siegrun	2	3	1	6	6	8	7
Sylvester	3	3	9	9	6	2	8
Metin	7	5	4	9	5	3	9
Adeline	3	5	0	4	2	9	7
Utz	9	7	8	1	2	3	2
Susan	2	7	6	7	4	4	0
Gisbert	4	1	8	3	6	9	5
Senol	9	0	8	2	5	7	2

We will demonstrate with this DataFrame how to combine columns by a function.

def is_weekend(day):
    if day in {'Saturday', 'Sunday'}:
        return "Weekend"
    else:
        return "Workday"

for res_func, df in data_df.T.groupby(by=is_weekend):
    print(df)

OUTPUT:

          Ortwin  Mara  Siegrun  Sylvester  Metin  Adeline  Utz  Susan  \
Saturday       8     4        8          2      3        9    3      4   
Sunday         0     8        7          8      9        7    2      0   

          Gisbert  Senol  
Saturday        9      7  
Sunday          5      2  
           Ortwin  Mara  Siegrun  Sylvester  Metin  Adeline  Utz  Susan  \
Monday          0     9        2          3      7        3    9      2   
Tuesday         2     6        3          3      5        5    7      7   
Wednesday       6     1        1          9      4        0    8      6   
Thursday        1     8        6          9      9        4    1      7   
Friday          3     5        6          6      5        2    2      4   

           Gisbert  Senol  
Monday           4      9  
Tuesday          1      0  
Wednesday        8      8  
Thursday         3      2  
Friday           6      5

data_df.T.groupby(by=is_weekend).sum()

	Ortwin	Mara	Siegrun	Sylvester	Metin	Adeline	Utz	Susan	Gisbert	Senol
Weekend	8	12	15	10	12	16	5	4	14	9
Workday	12	29	18	30	30	14	27	26	22	24

Exercises

Exercise 1

Calculate the average prices of the products of the following DataFrame:

import pandas as pd

d = {"products": ["Oppilume", "Dreaker", "Lotadilo", 
                  "Crosteron", "Wazzasoft", "Oppilume", 
                  "Dreaker", "Lotadilo", "Wazzasoft"],
     "colours": ["blue", "blue", "blue", 
                 "green", "blue", "green", 
                 "green", "green", "red"],
     "customer_price": [2345.89, 2390.50, 1820.00, 
                        3100.00, 1784.50, 2545.89,
                        2590.50, 2220.00, 2084.50],
     "non_customer_price": [2445.89, 2495.50, 1980.00, 
                            3400.00, 1921.00, 2645.89, 
                            2655.50, 2140.00, 2190.00]}

product_prices = pd.DataFrame(d)
product_prices

	products	colours	customer_price	non_customer_price
0	Oppilume	blue	2345.89	2445.89
1	Dreaker	blue	2390.50	2495.50
2	Lotadilo	blue	1820.00	1980.00
3	Crosteron	green	3100.00	3400.00
4	Wazzasoft	blue	1784.50	1921.00
5	Oppilume	green	2545.89	2645.89
6	Dreaker	green	2590.50	2655.50
7	Lotadilo	green	2220.00	2140.00
8	Wazzasoft	red	2084.50	2190.00

Exercise 2

Calculate the sum of the price according to the colours.

Exercise 3

Read in the project_times.txt file from the data1 directory. This rows of this file contain comma separated the date, the name of the programmer, the name of the project, the time the programmer spent on the project.

Calculate the time spend on all the projects per day

Exercise 4

Create a DateFrame containing the total times spent on a project per day by all the programmers

Exercise 5

Calculate the total times spent on the projects over the whole month.

Exercise 6

Calculate the monthly times of each programmer regardless of the projects

Exercise 7

Rearrange the DataFrame with a MultiIndex consisting of the date and the project names, the columns should be the programmer names and the data of the columns the time of the programmers spent on the projects.

                   time
programmer         Antonie  Elise  Fatima  Hella  Mariola
date     project
2020-01-01 BIRDY   NaN      NaN    NaN     1.50   1.75
           NSTAT   NaN      NaN    0.25    NaN    1.25
           XTOR    NaN      NaN    NaN     1.00   3.50
2020-01-02 BIRDY   NaN      NaN    NaN     1.75   2.00
           NSTAT   0.5      NaN    NaN     NaN    1.75

Replace the NaN values by 0.

Exercise 8:

The folder data contains a file donation.txt with the following data:

firstname,surname,city,job,income,donations
Janett,Schwital,Karlsruhe,Politician,244400,2512
Daniele,Segebahn,Freiburg,Student,16800,336
Kirstin,Klapp,Hamburg,Engineer,116900,1479
Oswald,Segebahn,Köln,Musician,57700,1142

group the data by the job of the persons.

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Upcoming online Courses

Python Intensive Course

08 Sep to 12 Sep 2025
20 Oct to 24 Oct 2025
01 Dec to 05 Dec 2025

Data Analysis with Python

10 Sep to 12 Sep 2025
22 Oct to 24 Oct 2025
03 Dec to 05 Dec 2025

Efficient Data Analysis with Pandas

08 Sep to 09 Sep 2025
20 Oct to 21 Oct 2025
01 Dec to 02 Dec 2025

Python Text Processing Course

10 Sep to 12 Sep 2025
22 Oct to 24 Oct 2025

See our Python training courses

See our Machine Learning with Python training courses

Solutions

Solution to Exercise 1

average_prices = product_prices.groupby(["products", "colours"]).mean()

print(average_prices)

OUTPUT:

                   customer_price  non_customer_price
products  colours                                    
Crosteron green           3100.00             3400.00
Dreaker   blue            2390.50             2495.50
          green           2590.50             2655.50
Lotadilo  blue            1820.00             1980.00
          green           2220.00             2140.00
Oppilume  blue            2345.89             2445.89
          green           2545.89             2645.89
Wazzasoft blue            1784.50             1921.00
          red             2084.50             2190.00

Solution to Exercise 2

x = product_prices[['colours', 'customer_price', 'non_customer_price']].groupby('colours').sum()
x

	customer_price	non_customer_price
colours
blue	8340.89	8842.39
green	10456.39	10841.39
red	2084.50	2190.00

# better
sum_of_prices = product_prices.drop(columns=["products"]).groupby("colours").sum()

print(sum_of_prices)

OUTPUT:

         customer_price  non_customer_price
colours                                    
blue            8340.89             8842.39
green          10456.39            10841.39
red             2084.50             2190.00

# or
product_prices[['colours', 'customer_price', 'non_customer_price']].groupby("colours").sum()

	customer_price	non_customer_price
colours
blue	8340.89	8842.39
green	10456.39	10841.39
red	2084.50	2190.00

Solution to Exercise 3

import pandas as pd

df = pd.read_csv("../data1/project_times.txt", index_col=0)
df

	programmer	project	time
date
2020-01-01	Hella	XTOR	1.00
2020-01-01	Hella	BIRDY	1.50
2020-01-01	Fatima	NSTAT	0.25
2020-01-01	Mariola	NSTAT	0.50
2020-01-01	Mariola	BIRDY	1.75
...	...	...	...
2030-01-30	Antonie	XTOR	0.50
2030-01-31	Hella	BIRDY	1.25
2030-01-31	Hella	BIRDY	1.75
2030-01-31	Mariola	BIRDY	1.00
2030-01-31	Hella	BIRDY	1.00

17492 rows × 3 columns

times_per_day = df['time'].groupby(df.index).sum()
print(times_per_day[:10])

OUTPUT:

date
2020-01-01     9.25
2020-01-02     6.00
2020-01-03     2.50
2020-01-06     5.75
2020-01-07    15.00
2020-01-08    13.25
2020-01-09    10.25
2020-01-10    17.00
2020-01-13     4.75
2020-01-14    10.00
Name: time, dtype: float64

Solution to Exercise 4

times_per_day_project = df[['project', 'time']].groupby([df.index, 'project']).sum()
print(times_per_day_project[:10])

OUTPUT:

                    time
date       project      
2020-01-01 BIRDY    3.25
           NSTAT    1.50
           XTOR     4.50
2020-01-02 BIRDY    3.75
           NSTAT    2.25
2020-01-03 BIRDY    1.00
           NSTAT    0.25
           XTOR     1.25
2020-01-06 BIRDY    2.75
           NSTAT    0.75

Solution to Exercise 5

df[['project', 'time']].groupby(['project']).sum()

	time
project
BIRDY	9605.75
NSTAT	8707.75
XTOR	6427.50

Solution to Exercise 6

df[['programmer', 'time']].groupby(['programmer']).sum()

	time
programmer
Antonie	1511.25
Elise	80.00
Fatima	593.00
Hella	10642.00
Mariola	11914.75

Solution to Exercise 7

x = df.groupby([df.index, 'project', 'programmer']).sum()

x = x.unstack()
x

		time
	programmer	Antonie	Elise	Fatima	Hella	Mariola
date	project
2020-01-01	BIRDY	NaN	NaN	NaN	1.50	1.75
	NSTAT	NaN	NaN	0.25	NaN	1.25
	XTOR	NaN	NaN	NaN	1.00	3.50
2020-01-02	BIRDY	NaN	NaN	NaN	1.75	2.00
2020-01-02	NSTAT	0.5	NaN	NaN	NaN	1.75
...	...	...	...	...	...	...
2030-01-29	XTOR	NaN	NaN	NaN	1.00	5.50
2030-01-30	BIRDY	NaN	NaN	NaN	0.75	4.75
	NSTAT	NaN	NaN	NaN	3.75	NaN
	XTOR	0.5	NaN	NaN	0.75	NaN
2030-01-31	BIRDY	NaN	NaN	NaN	4.00	1.00

7037 rows × 5 columns

x = x.fillna(0)
print(x[:10])

OUTPUT:

                      time                           
programmer         Antonie Elise Fatima Hella Mariola
date       project                                   
2020-01-01 BIRDY      0.00   0.0   0.00  1.50    1.75
           NSTAT      0.00   0.0   0.25  0.00    1.25
           XTOR       0.00   0.0   0.00  1.00    3.50
2020-01-02 BIRDY      0.00   0.0   0.00  1.75    2.00
           NSTAT      0.50   0.0   0.00  0.00    1.75
2020-01-03 BIRDY      0.00   0.0   1.00  0.00    0.00
           NSTAT      0.25   0.0   0.00  0.00    0.00
           XTOR       0.00   0.0   0.00  0.50    0.75
2020-01-06 BIRDY      0.00   0.0   0.00  2.50    0.25
           NSTAT      0.00   0.0   0.00  0.00    0.75

Solution to Exercise 8:

import pandas as pd

data = pd.read_csv('../data/donations.txt')
data

	firstname	surname	city	job	income	donations
0	Janett	Schwital	Karlsruhe	Politician	244400	2512
1	Daniele	Segebahn	Freiburg	Student	16800	336
2	Kirstin	Klapp	Hamburg	Engineer	116900	1479
3	Oswald	Segebahn	Köln	Musician	57700	1142
4	Heinz-Joachim	Wagner	Stuttgart	Engineer	109300	1592
...	...	...	...	...	...	...
95	Georgine	Köster	Stuttgart	Manager	364300	1487
96	Dagmar	Käster	Stuttgart	Student	12800	256
97	Hubert	Mosemann	Köln	Engineer	119300	1308
98	Susanne	Heidrich	Karlsruhe	Politician	295600	3364
99	Philomena	Grein Groth	Freiburg	Engineer	100700	1288

100 rows × 6 columns

data_sum = data[['job', 'income', 'donations']].groupby(['job']).sum()
data_sum.sort_values(by='donations')

	income	donations
job
Student	372900	7458
Musician	1448700	24376
Engineer	2067200	25564
Politician	4118300	30758
Manager	12862600	87475

data_sum['relative'] = data_sum.donations * 100 / data_sum.income

data_sum.sort_values(by='relative')

	income	donations	relative
job
Manager	12862600	87475	0.680072
Politician	4118300	30758	0.746862
Engineer	2067200	25564	1.236649
Musician	1448700	24376	1.682612
Student	372900	7458	2.000000

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Upcoming online Courses

Python Intensive Course

08 Sep to 12 Sep 2025
20 Oct to 24 Oct 2025
01 Dec to 05 Dec 2025

Data Analysis with Python

10 Sep to 12 Sep 2025
22 Oct to 24 Oct 2025
03 Dec to 05 Dec 2025

Efficient Data Analysis with Pandas

08 Sep to 09 Sep 2025
20 Oct to 21 Oct 2025
01 Dec to 02 Dec 2025

Python Text Processing Course

10 Sep to 12 Sep 2025
22 Oct to 24 Oct 2025

See our Python training courses

See our Machine Learning with Python training courses