python-course.eu

30. Pandas Groupby Example

By Bernd Klein. Last modified: 21 Feb 2024.


Person on a heap of coins

In this part of our Python course we would like to give you a detailed example of how to use groupby. We will use a data file donations.txt.

The first few lines of this data file look like this

firstname,surname,city,job,income,donations
Janett,Schwital,Karlsruhe,Politician,244400,2512
Daniele,Segebahn,Freiburg,Student,16800,336
Kirstin,Klapp,Hamburg,Engineer,116900,1479
Oswald,Segebahn,Köln,Musician,57700,1142
Heinz-Joachim,Wagner,Stuttgart,Engineer,109300,1592

The data of this is artifially created by the Faker module, so any possible correspondence with real existing persons is purely coincidental and not intended. Each line contains the firstname and the surname of a fictive person plus the persons city, the job position (one of five politician, student, engineer, musician and manager), the early income and the sum of donations per year.

We first read in the data file, which is of course a csv file:

import pandas as pd

fname = '../data/donations.txt'
data = pd.read_csv(fname, usecols=[2, 3, 4, 5])
data[:10]
city job income donations
0 Karlsruhe Politician 244400 2512
1 Freiburg Student 16800 336
2 Hamburg Engineer 116900 1479
3 Köln Musician 57700 1142
4 Stuttgart Engineer 109300 1592
5 Hamburg Student 12500 250
6 Freiburg Engineer 128700 1984
7 Stuttgart Politician 161300 822
8 Freiburg Engineer 129000 2159
9 Stuttgart Musician 108800 1516

First, let's look at how much each occupational group earned and donated in total. We can do this be using the Pandas groupby together with sum:

data_sum = data[['job', 'income', 'donations']].groupby(['job']).sum()
data_sum
income donations
job
Engineer 2067200 25564
Manager 12862600 87475
Musician 1448700 24376
Politician 4118300 30758
Student 372900 7458

By using sort_values on the donations column, we can sort this DataFrame:

data_sum.sort_values(by='donations')
income donations
job
Student 372900 7458
Musician 1448700 24376
Engineer 2067200 25564
Politician 4118300 30758
Manager 12862600 87475

Let's look at the following three people from our datafile:

Janett,Schwital,Karlsruhe,Politician,244400,2512
Daniele,Segebahn,Freiburg,Student,16800,336
Kirstin,Klapp,Hamburg,Engineer,116900,1479

Which is the most generous of them? The politician Janett Schwital, because 2512 Euros is nearly 8 times more than what the student Daniele Segebahn has donated and nearly two times as much as the engineer Kirstin Klapp has donated.

Most people would think that we have to see the donations in relation to their incomes. Then we get the following:

The politician spent about 1 % of the income, the engineer about 1.3 % and the student 2 %. So we could say that Daniele is the most generous one in relation to her income!

Let's do the same thing on the previous DataFrame:

data_sum['relative'] = data_sum.donations * 100 / data_sum.income
data_sum.sort_values(by='relative')
income donations relative
job
Manager 12862600 87475 0.680072
Politician 4118300 30758 0.746862
Engineer 2067200 25564 1.236649
Musician 1448700 24376 1.682612
Student 372900 7458 2.000000

So, we see that the managers and the politicians are very stingy and the students are extremely generous. We are not sure, if we can general this by saying: The less people have the more generous they are. You shouldn't forget that the whole data set is fake!

We want to count now the number of people in each job group for each city. Using count and groupby in the following way is helpful but not nice. The numbers in the columns 'income' and 'donations' show the wanted numbers but the column names are misleading and we don't need the results twice:

x = data.groupby(['city', 'job']).count()
x
income donations
city job
Berlin Engineer 3 3
Manager 3 3
Musician 2 2
Politician 1 1
Student 2 2
Freiburg Engineer 3 3
Manager 5 5
Musician 3 3
Politician 3 3
Student 5 5
Hamburg Engineer 4 4
Musician 4 4
Politician 1 1
Student 2 2
Karlsruhe Engineer 1 1
Manager 3 3
Politician 7 7
Student 2 2
Konstanz Engineer 2 2
Manager 5 5
Musician 3 3
Politician 4 4
Student 3 3
Köln Engineer 1 1
Manager 3 3
Musician 2 2
Politician 1 1
Student 3 3
Stuttgart Engineer 4 4
Manager 4 4
Musician 4 4
Politician 3 3
Student 4 4

We could do the following to get a nice result, but the whole way is clumsy:

people_job_city = data.groupby(['city', 'job']).count()
people_job_city.drop(columns=['income'], inplace=True)
people_job_city.rename(columns={'donations': 'number of people'})
number of people
city job
Berlin Engineer 3
Manager 3
Musician 2
Politician 1
Student 2
Freiburg Engineer 3
Manager 5
Musician 3
Politician 3
Student 5
Hamburg Engineer 4
Musician 4
Politician 1
Student 2
Karlsruhe Engineer 1
Manager 3
Politician 7
Student 2
Konstanz Engineer 2
Manager 5
Musician 3
Politician 4
Student 3
Köln Engineer 1
Manager 3
Musician 2
Politician 1
Student 3
Stuttgart Engineer 4
Manager 4
Musician 4
Politician 3
Student 4

There is a better way to solve this with groupby of Pandas. In the following solution, the groupby method groups the DataFrame by the specified columns (in this case, "city" and "job"), creating a new DataFrame with each group. The size method returns the count of each group, resulting in a new DataFrame with the counts for each combination of city and job. Note that the result is a Series object:

result = data[['city', 'job']].groupby(['city', 'job']).size()
result

OUTPUT:

city       job       
Berlin     Engineer      3
           Manager       3
           Musician      2
           Politician    1
           Student       2
Freiburg   Engineer      3
           Manager       5
           Musician      3
           Politician    3
           Student       5
Hamburg    Engineer      4
           Musician      4
           Politician    1
           Student       2
Karlsruhe  Engineer      1
           Manager       3
           Politician    7
           Student       2
Konstanz   Engineer      2
           Manager       5
           Musician      3
           Politician    4
           Student       3
Köln       Engineer      1
           Manager       3
           Musician      2
           Politician    1
           Student       3
Stuttgart  Engineer      4
           Manager       4
           Musician      4
           Politician    3
           Student       4
dtype: int64

What's more interesting than the previous result is the number of donations and especially donations in relation to the income for each job and city:

city_job_data = data.groupby(['city', 'job']).sum()
city_job_data[:12]  # the first 12 lines
income donations
city job
Berlin Engineer 334300 3732
Manager 1916000 16290
Musician 135500 2644
Politician 246700 1103
Student 31200 624
Freiburg Engineer 358400 5431
Manager 2423700 17811
Musician 273800 4640
Politician 489300 4352
Student 87500 1750
Hamburg Engineer 448400 5390
Musician 358800 5359

Again, we prefer to see the donations in relation to the income:

city_job_data['rel_donations'] = (city_job_data['donations'] * 100 / city_job_data['income']).round(2)
city_job_data.drop(['income', 'donations'], axis=1, inplace=True)
city_job_data[:12]
rel_donations
city job
Berlin Engineer 1.12
Manager 0.85
Musician 1.95
Politician 0.45
Student 2.00
Freiburg Engineer 1.52
Manager 0.73
Musician 1.69
Politician 0.89
Student 2.00
Hamburg Engineer 1.20
Musician 1.49

Maybe, you want to know now, in which city the most generous people live?

data
city job income donations
0 Karlsruhe Politician 244400 2512
1 Freiburg Student 16800 336
2 Hamburg Engineer 116900 1479
3 Köln Musician 57700 1142
4 Stuttgart Engineer 109300 1592
... ... ... ... ...
95 Stuttgart Manager 364300 1487
96 Stuttgart Student 12800 256
97 Köln Engineer 119300 1308
98 Karlsruhe Politician 295600 3364
99 Freiburg Engineer 100700 1288

100 rows × 4 columns

cities_donations = data[['city', 'income', 'donations']].groupby(['city']).sum()
cities_donations['relative'] = cities_donations['donations'] * 100 / cities_donations['income']
cities_donations.sort_values(by='relative')
income donations relative
city
Karlsruhe 3457600 22265 0.643944
Konstanz 4165500 29386 0.705462
Köln 2369500 21158 0.892931
Stuttgart 3522900 31891 0.905249
Berlin 2663700 24393 0.915756
Freiburg 3632700 33984 0.935503
Hamburg 1057800 12554 1.186803

Live Python training

instructor-led training course

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Upcoming online Courses

Enrol here