DataFrame

Playing Pandas

The underlying idea of a DataFrame is based on spreadsheets. We can see the data structure of a DataFrame as tabular and spreadsheet-like. A DataFrame logically corresponds to a "sheet" of an Excel document. A DataFrame has both a row and a column index.

Like a spreadsheet or Excel sheet, a DataFrame object contains an ordered collection of columns. Each column consists of a unique data typye, but different columns can have different types, e.g. the first column may consist of integers, while the second one consists of boolean values and so on.

There is a close connection between the DataFrames and the Series of Pandas. A DataFrame can be seen as a concatenation of Series, each Series having the same index, i.e. the index of the DataFrame.

We will demonstrate this in the following example.

We define the following three Series:

import pandas as pd
years = range(2014, 2018)
shop1 = pd.Series([2409.14, 2941.01, 3496.83, 3119.55], index=years)
shop2 = pd.Series([1203.45, 3441.62, 3007.83, 3619.53], index=years)
shop3 = pd.Series([3412.12, 3491.16, 3457.19, 1963.10], index=years)

What happens, if we concatenate these "shop" Series? Pandas provides a concat function for this purpose:

pd.concat([shop1, shop2, shop3])
The above code returned the following result:
2014    2409.14
2015    2941.01
2016    3496.83
2017    3119.55
2014    1203.45
2015    3441.62
2016    3007.83
2017    3619.53
2014    3412.12
2015    3491.16
2016    3457.19
2017    1963.10
dtype: float64

This result is not what we have intended or expected. The reason is that concat used 0 as the default for the axis parameter. Let's do it with "axis=1":

shops_df = pd.concat([shop1, shop2, shop3], axis=1)
shops_df
The above code returned the following:
0 1 2
2014 2409.14 1203.45 3412.12
2015 2941.01 3441.62 3491.16
2016 3496.83 3007.83 3457.19
2017 3119.55 3619.53 1963.10

Let's do some fine sanding by giving names to the columns:

cities = ["Zürich", "Winterthur", "Freiburg"]
shops_df.columns = cities 
print(shops_df)
# alternative way: give names to series:
shop1.name = "Zürich"
shop2.name = "Winterthur"
shop3.name = "Freiburg"
print("------")
shops_df2 = pd.concat([shop1, shop2, shop3], axis=1)
print(shops_df2)
       Zürich  Winterthur  Freiburg
2014  2409.14     1203.45   3412.12
2015  2941.01     3441.62   3491.16
2016  3496.83     3007.83   3457.19
2017  3119.55     3619.53   1963.10
------
       Zürich  Winterthur  Freiburg
2014  2409.14     1203.45   3412.12
2015  2941.01     3441.62   3491.16
2016  3496.83     3007.83   3457.19
2017  3119.55     3619.53   1963.10

This was nice, but what kind of data type is our result?

print(type(shops_df))
<class 'pandas.core.frame.DataFrame'>

This means, we can arrange or concat Series into DataFrames!

DataFrames from Dictionaries

A DataFrame has a row and column index; it's like a dict of Series with a common index.

cities = {"name": ["London", "Berlin", "Madrid", "Rome", 
                   "Paris", "Vienna", "Bucharest", "Hamburg", 
                   "Budapest", "Warsaw", "Barcelona", 
                   "Munich", "Milan"],
          "population": [8615246, 3562166, 3165235, 2874038,
                         2273305, 1805681, 1803425, 1760433,
                         1754000, 1740119, 1602386, 1493900,
                         1350680],
          "country": ["England", "Germany", "Spain", "Italy",
                      "France", "Austria", "Romania", 
                      "Germany", "Hungary", "Poland", "Spain",
                      "Germany", "Italy"]}
city_frame = pd.DataFrame(cities)
city_frame
This gets us the following:
country name population
0 England London 8615246
1 Germany Berlin 3562166
2 Spain Madrid 3165235
3 Italy Rome 2874038
4 France Paris 2273305
5 Austria Vienna 1805681
6 Romania Bucharest 1803425
7 Germany Hamburg 1760433
8 Hungary Budapest 1754000
9 Poland Warsaw 1740119
10 Spain Barcelona 1602386
11 Germany Munich 1493900
12 Italy Milan 1350680

Retrieving the Column Names

It's possible to get the names of the columns as a list:

city_frame.columns.values
The Python code above returned the following:
array(['country', 'name', 'population'], dtype=object)

Custom Index

We can see that an index (0,1,2, ...) has been automatically assigned to the DataFrame. We can also assign a custom index to the DataFrame object:

ordinals = ["first", "second", "third", "fourth",
            "fifth", "sixth", "seventh", "eigth",
            "ninth", "tenth", "eleventh", "twelvth",
            "thirteenth"]
city_frame = pd.DataFrame(cities, index=ordinals)
city_frame
The previous code returned the following:
country name population
first England London 8615246
second Germany Berlin 3562166
third Spain Madrid 3165235
fourth Italy Rome 2874038
fifth France Paris 2273305
sixth Austria Vienna 1805681
seventh Romania Bucharest 1803425
eigth Germany Hamburg 1760433
ninth Hungary Budapest 1754000
tenth Poland Warsaw 1740119
eleventh Spain Barcelona 1602386
twelvth Germany Munich 1493900
thirteenth Italy Milan 1350680

Rearranging the Order of Columns

We can also define and rearrange the order of the columns at the time of creation of the DataFrame. This makes also sure that we will have a defined ordering of our columns, if we create the DataFrame from a dictionary. Dictionaries are not ordered, as you have seen in our chapter on Dictionaries in our Python tutorial, so we cannot know in advance what the ordering of our columns will be:

city_frame = pd.DataFrame(cities,
                          columns=["name", 
                                   "country", 
                                   "population"])
city_frame
This gets us the following result:
name country population
0 London England 8615246
1 Berlin Germany 3562166
2 Madrid Spain 3165235
3 Rome Italy 2874038
4 Paris France 2273305
5 Vienna Austria 1805681
6 Bucharest Romania 1803425
7 Hamburg Germany 1760433
8 Budapest Hungary 1754000
9 Warsaw Poland 1740119
10 Barcelona Spain 1602386
11 Munich Germany 1493900
12 Milan Italy 1350680

But what if you want to change the column names and the ordering of an existing DataFrame?

city_frame.reindex(["country", "name", "population"])
city_frame
This gets us the following:
name country population
0 London England 8615246
1 Berlin Germany 3562166
2 Madrid Spain 3165235
3 Rome Italy 2874038
4 Paris France 2273305
5 Vienna Austria 1805681
6 Bucharest Romania 1803425
7 Hamburg Germany 1760433
8 Budapest Hungary 1754000
9 Warsaw Poland 1740119
10 Barcelona Spain 1602386
11 Munich Germany 1493900
12 Milan Italy 1350680

Now, we want to rename our columns. For this purpose, we will use the DataFrame method 'rename'. This method supports two calling conventions

We will rename the columns of our DataFrame into Romanian names in the following example. We set the parameter inplace to True so that our DataFrame will be changed instead of returning a new DataFrame, if inplace is set to False, which is the default!

city_frame.rename(columns={"name":"Nume", "country":"țară", "population":"populație"},
                 inplace=True)
city_frame
The above code returned the following:
Nume țară populație
0 London England 8615246
1 Berlin Germany 3562166
2 Madrid Spain 3165235
3 Rome Italy 2874038
4 Paris France 2273305
5 Vienna Austria 1805681
6 Bucharest Romania 1803425
7 Hamburg Germany 1760433
8 Budapest Hungary 1754000
9 Warsaw Poland 1740119
10 Barcelona Spain 1602386
11 Munich Germany 1493900
12 Milan Italy 1350680

Existing Column as the Index of a DataFrame

We want to create a more useful index in the following example. We will use the country name as the index, i.e. the list value associated to the key "country" of our cities dictionary:

city_frame = pd.DataFrame(cities,
                          columns=["name", "population"],
                          index=cities["country"])
city_frame
The above code returned the following:
name population
England London 8615246
Germany Berlin 3562166
Spain Madrid 3165235
Italy Rome 2874038
France Paris 2273305
Austria Vienna 1805681
Romania Bucharest 1803425
Germany Hamburg 1760433
Hungary Budapest 1754000
Poland Warsaw 1740119
Spain Barcelona 1602386
Germany Munich 1493900
Italy Milan 1350680

Alternatively, we can change an existing DataFrame. We can us the method set_index to turn a column into an index. "set_index" does not work in-place, it returns a new data frame with the chosen column as the index:

city_frame = pd.DataFrame(cities)
city_frame2 = city_frame.set_index("country")
print(city_frame2)
              name  population
country                       
England     London     8615246
Germany     Berlin     3562166
Spain       Madrid     3165235
Italy         Rome     2874038
France       Paris     2273305
Austria     Vienna     1805681
Romania  Bucharest     1803425
Germany    Hamburg     1760433
Hungary   Budapest     1754000
Poland      Warsaw     1740119
Spain    Barcelona     1602386
Germany     Munich     1493900
Italy        Milan     1350680

We saw in the previous example that the set_index method returns a new DataFrame object and doesn't change the original DataFrame. If we set the optional parameter "inplace" to True, the DataFrame will be changed in place, i.e. no new object will be created:

city_frame = pd.DataFrame(cities)
city_frame.set_index("country", inplace=True)
print(city_frame)
              name  population
country                       
England     London     8615246
Germany     Berlin     3562166
Spain       Madrid     3165235
Italy         Rome     2874038
France       Paris     2273305
Austria     Vienna     1805681
Romania  Bucharest     1803425
Germany    Hamburg     1760433
Hungary   Budapest     1754000
Poland      Warsaw     1740119
Spain    Barcelona     1602386
Germany     Munich     1493900
Italy        Milan     1350680

Label-Indexing on the Rows

So far we have indexed DataFrames via the columns. We will demonstrate now, how we can access rows from DataFrames via the locators 'loc' and 'iloc'. ('ix' is deprecated and will be removed in the future)

city_frame = pd.DataFrame(cities, 
                          columns=("name", "population"), 
                          index=cities["country"])
print(city_frame.loc["Germany"])
            name  population
Germany   Berlin     3562166
Germany  Hamburg     1760433
Germany   Munich     1493900
print(city_frame.loc[["Germany", "France"]])
            name  population
Germany   Berlin     3562166
Germany  Hamburg     1760433
Germany   Munich     1493900
France     Paris     2273305
print(city_frame.loc[city_frame.population>2000000])
           name  population
England  London     8615246
Germany  Berlin     3562166
Spain    Madrid     3165235
Italy      Rome     2874038
France    Paris     2273305

Sum and Cumulative Sum

We can calculate the sum of all the columns of a DataFrame or the sum of certain columns:

print(city_frame.sum())
name          LondonBerlinMadridRomeParisViennaBucharestHamb...
population                                             33800614
dtype: object
city_frame["population"].sum()
The above code returned the following output:
33800614

We can use "cumsum" to calculate the cumulative sum:

x = city_frame["population"].cumsum()
print(x)
England     8615246
Germany    12177412
Spain      15342647
Italy      18216685
France     20489990
Austria    22295671
Romania    24099096
Germany    25859529
Hungary    27613529
Poland     29353648
Spain      30956034
Germany    32449934
Italy      33800614
Name: population, dtype: int64

Assigning New Values to Columns

x is a Pandas Series. We can reassign the previously calculated cumulative sums to the population column:

city_frame["population"] = x
print(city_frame)
              name  population
England     London     8615246
Germany     Berlin    12177412
Spain       Madrid    15342647
Italy         Rome    18216685
France       Paris    20489990
Austria     Vienna    22295671
Romania  Bucharest    24099096
Germany    Hamburg    25859529
Hungary   Budapest    27613529
Poland      Warsaw    29353648
Spain    Barcelona    30956034
Germany     Munich    32449934
Italy        Milan    33800614

Instead of replacing the values of the population column with the cumulative sum, we want to add the cumulative population sum as a new culumn with the name "cum_population".

city_frame = pd.DataFrame(cities,
                          columns=["country", 
                                   "population",
                                   "cum_population"],
                          index=cities["name"])
city_frame
The above code returned the following result:
country population cum_population
London England 8615246 NaN
Berlin Germany 3562166 NaN
Madrid Spain 3165235 NaN
Rome Italy 2874038 NaN
Paris France 2273305 NaN
Vienna Austria 1805681 NaN
Bucharest Romania 1803425 NaN
Hamburg Germany 1760433 NaN
Budapest Hungary 1754000 NaN
Warsaw Poland 1740119 NaN
Barcelona Spain 1602386 NaN
Munich Germany 1493900 NaN
Milan Italy 1350680 NaN

We can see that the column "cum_population" is set to Nan, as we haven't provided any data for it.

We will assign now the cumulative sums to this column:

city_frame["cum_population"] = city_frame["population"].cumsum()
city_frame
The previous code returned the following result:
country population cum_population
London England 8615246 8615246
Berlin Germany 3562166 12177412
Madrid Spain 3165235 15342647
Rome Italy 2874038 18216685
Paris France 2273305 20489990
Vienna Austria 1805681 22295671
Bucharest Romania 1803425 24099096
Hamburg Germany 1760433 25859529
Budapest Hungary 1754000 27613529
Warsaw Poland 1740119 29353648
Barcelona Spain 1602386 30956034
Munich Germany 1493900 32449934
Milan Italy 1350680 33800614

We can also include a column name which is not contained in the dictionary, when we create the DataFrame from the dictionary. In this case, all the values of this column will be set to NaN:

city_frame = pd.DataFrame(cities,
                          columns=["country", 
                                   "area",
                                   "population"],
                          index=cities["name"])
print(city_frame)
           country area  population
London     England  NaN     8615246
Berlin     Germany  NaN     3562166
Madrid       Spain  NaN     3165235
Rome         Italy  NaN     2874038
Paris       France  NaN     2273305
Vienna     Austria  NaN     1805681
Bucharest  Romania  NaN     1803425
Hamburg    Germany  NaN     1760433
Budapest   Hungary  NaN     1754000
Warsaw      Poland  NaN     1740119
Barcelona    Spain  NaN     1602386
Munich     Germany  NaN     1493900
Milan        Italy  NaN     1350680

Accessing the Columns of a DataFrame

There are two ways to access a column of a DataFrame. The result is in both cases a Series:

# in a dictionary-like way:
print(city_frame["population"])
London       8615246
Berlin       3562166
Madrid       3165235
Rome         2874038
Paris        2273305
Vienna       1805681
Bucharest    1803425
Hamburg      1760433
Budapest     1754000
Warsaw       1740119
Barcelona    1602386
Munich       1493900
Milan        1350680
Name: population, dtype: int64
# as an attribute
print(city_frame.population)
London       8615246
Berlin       3562166
Madrid       3165235
Rome         2874038
Paris        2273305
Vienna       1805681
Bucharest    1803425
Hamburg      1760433
Budapest     1754000
Warsaw       1740119
Barcelona    1602386
Munich       1493900
Milan        1350680
Name: population, dtype: int64
print(type(city_frame.population))
<class 'pandas.core.series.Series'>
city_frame.population
We received the following output:
London       8615246
Berlin       3562166
Madrid       3165235
Rome         2874038
Paris        2273305
Vienna       1805681
Bucharest    1803425
Hamburg      1760433
Budapest     1754000
Warsaw       1740119
Barcelona    1602386
Munich       1493900
Milan        1350680
Name: population, dtype: int64

From the previous example, we can see that we have not copied the population column. "p" is a view on the data of city_frame.

Assigning New Values to a Column

The column area is still not defined. We can set all elements of the column to the same value:

city_frame["area"] = 1572
print(city_frame)
           country  area  population
London     England  1572     8615246
Berlin     Germany  1572     3562166
Madrid       Spain  1572     3165235
Rome         Italy  1572     2874038
Paris       France  1572     2273305
Vienna     Austria  1572     1805681
Bucharest  Romania  1572     1803425
Hamburg    Germany  1572     1760433
Budapest   Hungary  1572     1754000
Warsaw      Poland  1572     1740119
Barcelona    Spain  1572     1602386
Munich     Germany  1572     1493900
Milan        Italy  1572     1350680

In this case, it will be definitely better to assign the exact area to the cities. The list with the area values needs to have the same length as the number of rows in our DataFrame.

# area in square km:
area = [1572, 891.85, 605.77, 1285, 
        105.4, 414.6, 228, 755, 
        525.2, 517, 101.9, 310.4, 
        181.8]
# area could have been designed as a list, a Series, an array or a scalar   
city_frame["area"] = area
print(city_frame)
           country     area  population
London     England  1572.00     8615246
Berlin     Germany   891.85     3562166
Madrid       Spain   605.77     3165235
Rome         Italy  1285.00     2874038
Paris       France   105.40     2273305
Vienna     Austria   414.60     1805681
Bucharest  Romania   228.00     1803425
Hamburg    Germany   755.00     1760433
Budapest   Hungary   525.20     1754000
Warsaw      Poland   517.00     1740119
Barcelona    Spain   101.90     1602386
Munich     Germany   310.40     1493900
Milan        Italy   181.80     1350680

Sorting DataFrames

Let's sort our DataFrame according to the city area:

city_frame = city_frame.sort_values(by="area", ascending=False)
print(city_frame)
           country     area  population
London     England  1572.00     8615246
Rome         Italy  1285.00     2874038
Berlin     Germany   891.85     3562166
Hamburg    Germany   755.00     1760433
Madrid       Spain   605.77     3165235
Budapest   Hungary   525.20     1754000
Warsaw      Poland   517.00     1740119
Vienna     Austria   414.60     1805681
Munich     Germany   310.40     1493900
Bucharest  Romania   228.00     1803425
Milan        Italy   181.80     1350680
Paris       France   105.40     2273305
Barcelona    Spain   101.90     1602386

Let's assume, we have only the areas of London, Hamburg and Milan. The areas are in a series with the correct indices. We can assign this series as well:

city_frame = pd.DataFrame(cities,
                          columns=["country", 
                                   "area",
                                   "population"],
                          index=cities["name"])
some_areas = pd.Series([1572, 755, 181.8], 
                    index=['London', 'Hamburg', 'Milan'])
city_frame['area'] = some_areas
print(city_frame)
           country    area  population
London     England  1572.0     8615246
Berlin     Germany     NaN     3562166
Madrid       Spain     NaN     3165235
Rome         Italy     NaN     2874038
Paris       France     NaN     2273305
Vienna     Austria     NaN     1805681
Bucharest  Romania     NaN     1803425
Hamburg    Germany   755.0     1760433
Budapest   Hungary     NaN     1754000
Warsaw      Poland     NaN     1740119
Barcelona    Spain     NaN     1602386
Munich     Germany     NaN     1493900
Milan        Italy   181.8     1350680

Inserting new columns into existing DataFrames

In the previous example we have added the column area at creation time. Quite often it will be necessary to add or insert columns into existing DataFrames. For this purpose the DataFrame class provides a method "insert", which allows us to insert a column into a DataFrame at a specified location:

insert(self, loc, column, value, allow_duplicates=False)`

The parameters are specified as:

Parameter Meaning
loc int
This value should be within
the range 0 <= loc <= len(columns)
column the column name
value can be a list, a Series an array or a scalar
allow_duplicates If allow_duplicates is False,
an Exception will be raised,
if column is already contained
in the DataFrame.
city_frame = pd.DataFrame(cities,
                          columns=["country", 
                                   "population"],
                          index=cities["name"])
idx = 1
city_frame.insert(loc=idx, column='area', value=area)
city_frame
<class 'pandas.core.frame.DataFrame'>
The above Python code returned the following output:
country area population
London England 1572.00 8615246
Berlin Germany 891.85 3562166
Madrid Spain 605.77 3165235
Rome Italy 1285.00 2874038
Paris France 105.40 2273305
Vienna Austria 414.60 1805681
Bucharest Romania 228.00 1803425
Hamburg Germany 755.00 1760433
Budapest Hungary 525.20 1754000
Warsaw Poland 517.00 1740119
Barcelona Spain 101.90 1602386
Munich Germany 310.40 1493900
Milan Italy 181.80 1350680

DataFrame from Nested Dictionaries

A nested dictionary of dicts can be passed to a DataFrame as well. The indices of the outer dictionary are taken as the the columns and the inner keys. i.e. the keys of the nested dictionaries, are used as the row indices:

growth = {"Switzerland": {"2010": 3.0, "2011": 1.8, "2012": 1.1, "2013": 1.9},
          "Germany": {"2010": 4.1, "2011": 3.6, "2012":	0.4, "2013": 0.1},
          "France": {"2010":2.0,  "2011":2.1, "2012": 0.3, "2013": 0.3},
          "Greece": {"2010":-5.4, "2011":-8.9, "2012":-6.6, "2013":	-3.3},
          "Italy": {"2010":1.7, "2011":	0.6, "2012":-2.3, "2013":-1.9}
          } 
growth_frame = pd.DataFrame(growth)
growth_frame
The previous code returned the following:
France Germany Greece Italy Switzerland
2010 2.0 4.1 -5.4 1.7 3.0
2011 2.1 3.6 -8.9 0.6 1.8
2012 0.3 0.4 -6.6 -2.3 1.1
2013 0.3 0.1 -3.3 -1.9 1.9

You like to have the years in the columns and the countries in the rows? No problem, you can transpose the data:

growth_frame.T
The above Python code returned the following output:
2010 2011 2012 2013
France 2.0 2.1 0.3 0.3
Germany 4.1 3.6 0.4 0.1
Greece -5.4 -8.9 -6.6 -3.3
Italy 1.7 0.6 -2.3 -1.9
Switzerland 3.0 1.8 1.1 1.9
growth_frame = growth_frame.T
growth_frame2 = growth_frame.reindex(["Switzerland", 
                                      "Italy", 
                                      "Germany", 
                                      "Greece"])
print(growth_frame2)
             2010  2011  2012  2013
Switzerland   3.0   1.8   1.1   1.9
Italy         1.7   0.6  -2.3  -1.9
Germany       4.1   3.6   0.4   0.1
Greece       -5.4  -8.9  -6.6  -3.3

Filling a DataFrame with random values:

import numpy as np
names = ['Frank', 'Eve', 'Stella', 'Guido', 'Lara']
index = ["January", "February", "March",
         "April", "May", "June",
         "July", "August", "September",
         "October", "November", "December"]
df = pd.DataFrame((np.random.randn(12, 5)*1000).round(2),
                  columns=names,
                  index=index)
df
The Python code above returned the following:
Frank Eve Stella Guido Lara
January -1715.54 -860.69 -1602.44 431.37 384.42
February -221.95 -366.25 -1829.61 386.04 886.35
March -1287.75 980.25 -1210.07 399.73 -822.63
April -87.91 -2158.54 738.34 -1519.46 -1204.74
May -151.25 700.41 438.22 332.79 544.60
June 888.29 46.74 1188.26 -2083.93 541.45
July -512.79 765.51 1563.98 -2365.39 858.46
August 1397.45 -899.42 -162.13 604.49 -319.63
September 739.99 482.76 -1163.16 18.69 -868.65
October 2524.67 216.33 22.52 -585.97 2127.54
November -140.05 -1630.00 -1611.42 -2681.52 -923.53
December -744.64 817.47 -802.55 -740.33 798.00