python-course.eu

27. Pandas: groupby

By Bernd Klein. Last modified: 24 Mar 2022.

splitted banana

This chapter of our Pandas tutorial deals with an extremely important functionality, i.e. groupby. It is not really complicated, but it is not obvious at first glance and is sometimes found to be difficult. Completely wrong, as we shall see. It is also very important to become familiar with 'groupby' because it can be used to solve important problems that would not be possible without it. The Pandas groupby operation involves some combination of splitting the object, applying a function, and combining the results. We can split a DataFrame object into groups based on various criteria and row and column-wise, i.e. using axis.

'Applying' means

groupby can be applied to Pandas Series objects and DataFrame objects! We will learn to understand how it works with many small practical examples in this tutorial.

goupby with Series

We create with the following Python program a Series object with an index of size nvalues. The index will not be unique, because the strings for the index are taken from the list fruits, which has less elements than nvalues:

import pandas as pd
import numpy as np
import random

nvalues = 30
# we create random values, which will be used as the Series values:
values = np.random.randint(1, 20, (nvalues,))
fruits = ["bananas", "oranges", "apples", "clementines", "cherries", "pears"]
fruits_index = np.random.choice(fruits, (nvalues,))

s = pd.Series(values, index=fruits_index)
print(s[:10])

OUTPUT:

oranges        14
cherries        8
clementines     1
apples          8
bananas         9
apples          9
cherries       18
clementines    18
clementines     5
bananas         5
dtype: int64
grouped = s.groupby(s.index)
grouped

OUTPUT:

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f2cec564f10>

We can see that we get a SeriesGroupBy object, if we apply groupby on the index of our series object s. The result of this operation grouped is iterable. In every step we get a tuple object returned, which consists of an index label and a series object. The series object is s reduced to this label.

grouped = s.groupby(s.index)

for fruit, s_obj in grouped:
    print(f"===== {fruit} =====")
    print(s_obj)

OUTPUT:

===== apples =====
apples     8
apples     9
apples    15
apples     6
dtype: int64
===== bananas =====
bananas    9
bananas    5
bananas    4
dtype: int64
===== cherries =====
cherries     8
cherries    18
cherries     7
cherries     5
cherries     7
cherries    10
cherries     5
cherries     5
cherries    11
cherries     2
cherries    13
dtype: int64
===== clementines =====
clementines     1
clementines    18
clementines     5
clementines     6
clementines     6
dtype: int64
===== oranges =====
oranges    14
oranges    14
dtype: int64
===== pears =====
pears     8
pears     2
pears    13
pears    18
pears    13
dtype: int64

We could have got the same result - except for the order - without using `` groupby '' with the following Python code.

for fruit in set(s.index):
    print(f"===== {fruit} =====")
    print(s[fruit])

OUTPUT:

===== cherries =====
cherries     8
cherries    18
cherries     7
cherries     5
cherries     7
cherries    10
cherries     5
cherries     5
cherries    11
cherries     2
cherries    13
dtype: int64
===== oranges =====
oranges    14
oranges    14
dtype: int64
===== pears =====
pears     8
pears     2
pears    13
pears    18
pears    13
dtype: int64
===== bananas =====
bananas    9
bananas    5
bananas    4
dtype: int64
===== apples =====
apples     8
apples     9
apples    15
apples     6
dtype: int64
===== clementines =====
clementines     1
clementines    18
clementines     5
clementines     6
clementines     6
dtype: int64

Live Python training

instructor-led training course

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Enrol here

groupby with DataFrames

We will start with a very simple DataFrame. The DataFRame has two columns one containing names Name and the other one Coffee contains integers which are the number of cups of coffee the person drank.

import pandas as pd
beverages = pd.DataFrame({'Name': ['Robert', 'Melinda', 'Brenda',
                                   'Samantha', 'Melinda', 'Robert',
                                   'Melinda', 'Brenda', 'Samantha'],
                          'Coffee': [3, 0, 2, 2, 0, 2, 0, 1, 3],
                          'Tea':    [0, 4, 2, 0, 3, 0, 3, 2, 0]})
    
beverages
Name Coffee Tea
0 Robert 3 0
1 Melinda 0 4
2 Brenda 2 2
3 Samantha 2 0
4 Melinda 0 3
5 Robert 2 0
6 Melinda 0 3
7 Brenda 1 2
8 Samantha 3 0

It's simple, and we've already seen in the previous chapters of our tutorial how to calculate the total number of coffee cups. The task is to sum a column of a DatFrame, i.e. the 'Coffee' column:

beverages['Coffee'].sum()

OUTPUT:

13

Let's compute now the total number of coffees and teas:

beverages[['Coffee', 'Tea']].sum()

OUTPUT:

Coffee    13
Tea       14
dtype: int64

'groupby' has not been necessary for the previous tasks. Let's have a look at our DataFrame again. We can see that some of the names appear multiple times. So it will be very interesting to see how many cups of coffee and tea each person drank in total. That means we are applying 'groupby' to the 'Name' column. Thereby we split the DatFrame. Then we apply 'sum' to the results of 'groupby':

res = beverages.groupby(['Name']).sum()
print(res)

OUTPUT:

          Coffee  Tea
Name                 
Brenda         3    4
Melinda        0   10
Robert         5    0
Samantha       5    0

We can see that the names are now the index of the resulting DataFrame:

print(res.index)

OUTPUT:

Index(['Brenda', 'Melinda', 'Robert', 'Samantha'], dtype='object', name='Name')

There is only one column left, i.e. the Coffee column:

print(res.columns)

OUTPUT:

Index(['Coffee', 'Tea'], dtype='object')

We can also calculate the average number of coffee and tea cups the persons had:

beverages.groupby(['Name']).mean()
Coffee Tea
Name
Brenda 1.5 2.000000
Melinda 0.0 3.333333
Robert 2.5 0.000000
Samantha 2.5 0.000000

Another Example

The following Python code is used to create the data, we will use in our next groupby example. It is not necessary to understand the following Python code for the content following afterwards. The module faker has to be installed. In cae of an Anaconda installation this can be done by executing one of the following commands in a shell:

conda install -c conda-forge faker
conda install -c conda-forge/label/gcc7 faker
conda install -c conda-forge/label/cf201901 faker
conda install -c conda-forge/label/cf202003 faker
from faker import Faker
import numpy as np
from itertools import chain

fake = Faker('de_DE')

number_of_names = 10
names = []
for _ in range(number_of_names):
    names.append(fake.first_name())


data = {}
workweek = ("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
weekend = ("Saturday", "Sunday")

for day in chain(workweek, weekend):
    data[day] = np.random.randint(0, 10, (number_of_names,))
    
data_df = pd.DataFrame(data, index=names)
data_df
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Hans-Karl 6 3 2 9 9 3 7
Marek 7 7 3 3 7 8 9
Alexej 1 6 5 7 8 8 3
Liesbeth 4 7 2 9 9 1 3
Henryk 5 6 4 0 2 0 3
Josip 2 4 9 5 2 3 7
Mario 7 3 9 1 4 5 7
Hilda 9 1 2 2 9 0 2
Cristina 1 6 6 8 2 4 6
Johanna 6 3 1 1 2 3 8
print(names)

OUTPUT:

['Hans-Karl', 'Marek', 'Alexej', 'Liesbeth', 'Henryk', 'Josip', 'Mario', 'Hilda', 'Cristina', 'Johanna']
names = ('Ortwin', 'Mara', 'Siegrun', 'Sylvester', 'Metin', 'Adeline', 'Utz', 'Susan', 'Gisbert', 'Senol')
data = {'Monday': np.array([0, 9, 2, 3, 7, 3, 9, 2, 4, 9]),
        'Tuesday': np.array([2, 6, 3, 3, 5, 5, 7, 7, 1, 0]),
        'Wednesday': np.array([6, 1, 1, 9, 4, 0, 8, 6, 8, 8]),
        'Thursday': np.array([1, 8, 6, 9, 9, 4, 1, 7, 3, 2]),
        'Friday': np.array([3, 5, 6, 6, 5, 2, 2, 4, 6, 5]),
        'Saturday': np.array([8, 4, 8, 2, 3, 9, 3, 4, 9, 7]),
        'Sunday': np.array([0, 8, 7, 8, 9, 7, 2, 0, 5, 2])}

data_df = pd.DataFrame(data, index=names)
data_df
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Ortwin 0 2 6 1 3 8 0
Mara 9 6 1 8 5 4 8
Siegrun 2 3 1 6 6 8 7
Sylvester 3 3 9 9 6 2 8
Metin 7 5 4 9 5 3 9
Adeline 3 5 0 4 2 9 7
Utz 9 7 8 1 2 3 2
Susan 2 7 6 7 4 4 0
Gisbert 4 1 8 3 6 9 5
Senol 9 0 8 2 5 7 2

We will demonstrate with this DataFrame how to combine columns by a function.

def is_weekend(day):
    if day in {'Saturday', 'Sunday'}:
        return "Weekend"
    else:
        return "Workday"
        
for res_func, df in data_df.groupby(by=is_weekend, axis=1):
    print(df)

OUTPUT:

           Saturday  Sunday
Ortwin            8       0
Mara              4       8
Siegrun           8       7
Sylvester         2       8
Metin             3       9
Adeline           9       7
Utz               3       2
Susan             4       0
Gisbert           9       5
Senol             7       2
           Monday  Tuesday  Wednesday  Thursday  Friday
Ortwin          0        2          6         1       3
Mara            9        6          1         8       5
Siegrun         2        3          1         6       6
Sylvester       3        3          9         9       6
Metin           7        5          4         9       5
Adeline         3        5          0         4       2
Utz             9        7          8         1       2
Susan           2        7          6         7       4
Gisbert         4        1          8         3       6
Senol           9        0          8         2       5
data_df.groupby(by=is_weekend, axis=1).sum()
Weekend Workday
Ortwin 8 12
Mara 12 29
Siegrun 15 18
Sylvester 10 30
Metin 12 30
Adeline 16 14
Utz 5 27
Susan 4 26
Gisbert 14 22
Senol 9 24

Exercises

Exercise 1

Calculate the average prices of the products of the following DataFrame:

import pandas as pd

d = {"products": ["Oppilume", "Dreaker", "Lotadilo", 
                  "Crosteron", "Wazzasoft", "Oppilume", 
                  "Dreaker", "Lotadilo", "Wazzasoft"],
     "colours": ["blue", "blue", "blue", 
                 "green", "blue", "green", 
                 "green", "green", "red"],
     "customer_price": [2345.89, 2390.50, 1820.00, 
                        3100.00, 1784.50, 2545.89,
                        2590.50, 2220.00, 2084.50],
     "non_customer_price": [2445.89, 2495.50, 1980.00, 
                            3400.00, 1921.00, 2645.89, 
                            2655.50, 2140.00, 2190.00]}

product_prices = pd.DataFrame(d)
product_prices
products colours customer_price non_customer_price
0 Oppilume blue 2345.89 2445.89
1 Dreaker blue 2390.50 2495.50
2 Lotadilo blue 1820.00 1980.00
3 Crosteron green 3100.00 3400.00
4 Wazzasoft blue 1784.50 1921.00
5 Oppilume green 2545.89 2645.89
6 Dreaker green 2590.50 2655.50
7 Lotadilo green 2220.00 2140.00
8 Wazzasoft red 2084.50 2190.00

Exercise 2

Calculate the sum of the price according to the colours.

Exercise 3

Read in the project_times.txt file from the data1 directory. This rows of this file contain comma separated the date, the name of the programmer, the name of the project, the time the programmer spent on the project.

Calculate the time spend on all the projects per day

Exercise 4

Create a DateFrame containing the total times spent on a project per day by all the programmers

Exercise 5

Calculate the total times spent on the projects over the whole month.

Exercise 6

Calculate the monthly times of each programmer regardless of the projects

Exercise 7

Rearrange the DataFrame with a MultiIndex consisting of the date and the project names, the columns should be the programmer names and the data of the columns the time of the programmers spent on the projects.

                   time
programmer         Antonie  Elise  Fatima  Hella  Mariola
date     project
2020-01-01 BIRDY   NaN      NaN    NaN     1.50   1.75
           NSTAT   NaN      NaN    0.25    NaN    1.25
           XTOR    NaN      NaN    NaN     1.00   3.50
2020-01-02 BIRDY   NaN      NaN    NaN     1.75   2.00
           NSTAT   0.5      NaN    NaN     NaN    1.75

Replace the NaN values by 0.

Exercise 8:

The folder data contains a file donation.txt with the following data:

firstname,surname,city,job,income,donations
Janett,Schwital,Karlsruhe,Politician,244400,2512
Daniele,Segebahn,Freiburg,Student,16800,336
Kirstin,Klapp,Hamburg,Engineer,116900,1479
Oswald,Segebahn,Köln,Musician,57700,1142

group the data by the job of the persons.

Live Python training

instructor-led training course

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Upcoming online Courses

Python for Engineers and Scientists

29 Aug 2022 to 02 Sep 2022
17 Oct 2022 to 21 Oct 2022

Data Analysis With Python

31 Aug 2022 to 02 Sep 2022
19 Oct 2022 to 21 Oct 2022

Enrol here

Solutions

Solution to Exercise 1

x = product_prices.groupby("products").mean()
x
customer_price non_customer_price
products
Crosteron 3100.00 3400.00
Dreaker 2490.50 2575.50
Lotadilo 2020.00 2060.00
Oppilume 2445.89 2545.89
Wazzasoft 1934.50 2055.50

Solution to Exercise 2

x = product_prices.groupby("colours").sum()
x
customer_price non_customer_price
colours
blue 8340.89 8842.39
green 10456.39 10841.39
red 2084.50 2190.00

Solution to Exercise 3

import pandas as pd

df = pd.read_csv("../data1/project_times.txt", index_col=0)
df
programmer project time
date
2020-01-01 Hella XTOR 1.00
2020-01-01 Hella BIRDY 1.50
2020-01-01 Fatima NSTAT 0.25
2020-01-01 Mariola NSTAT 0.50
2020-01-01 Mariola BIRDY 1.75
... ... ... ...
2030-01-30 Antonie XTOR 0.50
2030-01-31 Hella BIRDY 1.25
2030-01-31 Hella BIRDY 1.75
2030-01-31 Mariola BIRDY 1.00
2030-01-31 Hella BIRDY 1.00

17492 rows × 3 columns

times_per_day = df.groupby(df.index).sum()
print(times_per_day[:10])

OUTPUT:

             time
date             
2020-01-01   9.25
2020-01-02   6.00
2020-01-03   2.50
2020-01-06   5.75
2020-01-07  15.00
2020-01-08  13.25
2020-01-09  10.25
2020-01-10  17.00
2020-01-13   4.75
2020-01-14  10.00

Solution to Exercise 4

times_per_day_project = df.groupby([df.index, 'project']).sum()
print(times_per_day_project[:10])

OUTPUT:

                    time
date       project      
2020-01-01 BIRDY    3.25
           NSTAT    1.50
           XTOR     4.50
2020-01-02 BIRDY    3.75
           NSTAT    2.25
2020-01-03 BIRDY    1.00
           NSTAT    0.25
           XTOR     1.25
2020-01-06 BIRDY    2.75
           NSTAT    0.75

Solution to Exercise 5

df.groupby(['project']).sum()
time
project
BIRDY 9605.75
NSTAT 8707.75
XTOR 6427.50

Solution to Exercise 6

df.groupby(['programmer']).sum()
time
programmer
Antonie 1511.25
Elise 80.00
Fatima 593.00
Hella 10642.00
Mariola 11914.75

Solution to Exercise 7

x = df.groupby([df.index, 'project', 'programmer']).sum()

x = x.unstack()
x
time
programmer Antonie Elise Fatima Hella Mariola
date project
2020-01-01 BIRDY NaN NaN NaN 1.50 1.75
NSTAT NaN NaN 0.25 NaN 1.25
XTOR NaN NaN NaN 1.00 3.50
2020-01-02 BIRDY NaN NaN NaN 1.75 2.00
NSTAT 0.5 NaN NaN NaN 1.75
... ... ... ... ... ... ...
2030-01-29 XTOR NaN NaN NaN 1.00 5.50
2030-01-30 BIRDY NaN NaN NaN 0.75 4.75
NSTAT NaN NaN NaN 3.75 NaN
XTOR 0.5 NaN NaN 0.75 NaN
2030-01-31 BIRDY NaN NaN NaN 4.00 1.00

7037 rows × 5 columns

x = x.fillna(0)
print(x[:10])

OUTPUT:

                      time                           
programmer         Antonie Elise Fatima Hella Mariola
date       project                                   
2020-01-01 BIRDY      0.00   0.0   0.00  1.50    1.75
           NSTAT      0.00   0.0   0.25  0.00    1.25
           XTOR       0.00   0.0   0.00  1.00    3.50
2020-01-02 BIRDY      0.00   0.0   0.00  1.75    2.00
           NSTAT      0.50   0.0   0.00  0.00    1.75
2020-01-03 BIRDY      0.00   0.0   1.00  0.00    0.00
           NSTAT      0.25   0.0   0.00  0.00    0.00
           XTOR       0.00   0.0   0.00  0.50    0.75
2020-01-06 BIRDY      0.00   0.0   0.00  2.50    0.25
           NSTAT      0.00   0.0   0.00  0.00    0.75

Solution to Exercise 8:

import pandas as pd

data = pd.read_csv('../data/donations.txt')
data_sum = data.groupby(['job']).sum()
data_sum.sort_values(by='donations')
income donations
job
Student 372900 7458
Musician 1448700 24376
Engineer 2067200 25564
Politician 4118300 30758
Manager 12862600 87475
data_sum['relative'] = data_sum.donations * 100 / data_sum.income

data_sum.sort_values(by='relative')
income donations relative
job
Manager 12862600 87475 0.680072
Politician 4118300 30758 0.746862
Engineer 2067200 25564 1.236649
Musician 1448700 24376 1.682612
Student 372900 7458 2.000000

Live Python training

instructor-led training course

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Upcoming online Courses

Python for Engineers and Scientists

29 Aug 2022 to 02 Sep 2022
17 Oct 2022 to 21 Oct 2022

Data Analysis With Python

31 Aug 2022 to 02 Sep 2022
19 Oct 2022 to 21 Oct 2022

Enrol here