The underlying idea of a DataFrame is based on spreadsheets. We can see the data structure of a DataFrame as tabular and spreadsheet-like. A DataFrame logically corresponds to a "sheet" of an Excel document. A DataFrame has both a row and a column index.
Like a spreadsheet or Excel sheet, a DataFrame object contains an ordered collection of columns. Each column consists of a unique data typye, but different columns can have different types, e.g. the first column may consist of integers, while the second one consists of boolean values and so on.
There is a close connection between the DataFrames and the Series of Pandas. A DataFrame can be seen as a concatenation of Series, each Series having the same index, i.e. the index of the DataFrame.
We will demonstrate this in the following example.
We define the following three Series:
import pandas as pd
years = range(2014, 2018)
shop1 = pd.Series([2409.14, 2941.01, 3496.83, 3119.55], index=years)
shop2 = pd.Series([1203.45, 3441.62, 3007.83, 3619.53], index=years)
shop3 = pd.Series([3412.12, 3491.16, 3457.19, 1963.10], index=years)
What happens, if we concatenate these "shop" Series? Pandas provides a concat function for this purpose:
pd.concat([shop1, shop2, shop3])
This result is not what we have intended or expected. The reason is that concat used 0 as the default for the axis parameter. Let's do it with "axis=1":
shops_df = pd.concat([shop1, shop2, shop3], axis=1)
shops_df
Let's do some fine sanding by giving names to the columns:
cities = ["Zürich", "Winterthur", "Freiburg"]
shops_df.columns = cities
print(shops_df)
# alternative way: give names to series:
shop1.name = "Zürich"
shop2.name = "Winterthur"
shop3.name = "Freiburg"
print("------")
shops_df2 = pd.concat([shop1, shop2, shop3], axis=1)
print(shops_df2)
This was nice, but what kind of data type is our result?
print(type(shops_df))
This means, we can arrange or concat Series into DataFrames!
cities = {"name": ["London", "Berlin", "Madrid", "Rome",
"Paris", "Vienna", "Bucharest", "Hamburg",
"Budapest", "Warsaw", "Barcelona",
"Munich", "Milan"],
"population": [8615246, 3562166, 3165235, 2874038,
2273305, 1805681, 1803425, 1760433,
1754000, 1740119, 1602386, 1493900,
1350680],
"country": ["England", "Germany", "Spain", "Italy",
"France", "Austria", "Romania",
"Germany", "Hungary", "Poland", "Spain",
"Germany", "Italy"]}
city_frame = pd.DataFrame(cities)
city_frame
city_frame.columns.values
ordinals = ["first", "second", "third", "fourth",
"fifth", "sixth", "seventh", "eigth",
"ninth", "tenth", "eleventh", "twelvth",
"thirteenth"]
city_frame = pd.DataFrame(cities, index=ordinals)
city_frame
Rearranging the Order of Columns
We can also define and rearrange the order of the columns at the time of creation of the DataFrame. This makes also sure that we will have a defined ordering of our columns, if we create the DataFrame from a dictionary. Dictionaries are not ordered, as you have seen in our chapter on Dictionaries in our Python tutorial, so we cannot know in advance what the ordering of our columns will be:
city_frame = pd.DataFrame(cities,
columns=["name",
"country",
"population"])
city_frame
We change both the column order and the ordering of the index with the function reindex
with the following code:
city_frame.reindex(index=[0, 2, 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11],
columns=['country', 'name', 'population'])
Now, we want to rename our columns. For this purpose, we will use the DataFrame method 'rename'. This method supports two calling conventions
- (index=index_mapper, columns=columns_mapper, ...)
- (mapper, axis={'index', 'columns'}, ...)
We will rename the columns of our DataFrame into Romanian names in the following example. We set the parameter inplace to True so that our DataFrame will be changed instead of returning a new DataFrame, if inplace is set to False, which is the default!
city_frame.rename(columns={"name":"Soyadı",
"country":"Ülke",
"population":"Nüfus"},
inplace=True)
city_frame
city_frame = pd.DataFrame(cities,
columns=["name", "population"],
index=cities["country"])
city_frame
Alternatively, we can change an existing DataFrame. We can us the method set_index to turn a column into an index. "set_index" does not work in-place, it returns a new data frame with the chosen column as the index:
city_frame = pd.DataFrame(cities)
city_frame2 = city_frame.set_index("country")
print(city_frame2)
We saw in the previous example that the set_index method returns a new DataFrame object and doesn't change the original DataFrame. If we set the optional parameter "inplace" to True, the DataFrame will be changed in place, i.e. no new object will be created:
city_frame = pd.DataFrame(cities)
city_frame.set_index("country", inplace=True)
print(city_frame)
Accessing Rows via Index Values
So far we have accessed DataFrames via the columns. It is often necessary to select certain rows via the index names. We will demonstrate now, how we can access rows from DataFrames via the locators 'loc' and 'iloc'. We will not cover 'ix' because it is deprecated and will be removed in the future.
We select all the German cities in the following example by using 'loc'. The result is a DataFrame:
city_frame = pd.DataFrame(cities,
columns=("name", "population"),
index=cities["country"])
print(city_frame.loc["Germany"])
It is also possible to simultaneously extracting rows by chosen more than on index labels. To do this we use a list of indices:
print(city_frame.loc[["Germany", "France"]])
We will also need to select pandas DataFrame rows based on conditions, which are applied to column values. We can use the operators '>', '=', '=', '<=', '!=' for this purpose. We select all cities with a population of more than two million in the following example:
condition = city_frame.population>2000000
condition
We can use this Boolean DataFrame condition
with loc
to finally create the selection:
print(city_frame.loc[condition])
It is also possible to logically combine more than one condition with &
and |
:
condition1 = (city_frame.population>1500000)
condition2 = (city_frame['name'].str.contains("m"))
print(city_frame.loc[condition1 & condition2])
We use a logical or |
in the following example tot see all cities of the Pandas DataFrame, where either the city name contains the letter 'm' or the population number is greater than three million:
condition1 = (city_frame.population>3000000)
condition2 = (city_frame['name'].str.contains("m"))
print(city_frame.loc[condition1 | condition2])
Accessing Rows by Position
The iloc
method of a Pandas DataFrame object can be used to select rows and columns by number, i.e. in the order that they appear in the data frame. iloc
allows selections of the rows, as if they were numbered by integers 0
, 1
, 2
, ....
We demonstrate this in the following example:
df = city_frame.iloc[3]
print(df)
To get a DataFrame with selected rows by numbers, we use a list of integers. We can see that we can change the order of the rows and we are also able to select rows multiple times:
df = city_frame.iloc[[3, 2, 0, 5, 0]]
print(df)
Sum and Cumulative Sum
The DataFrame object of Pandas provides a method to sum both columns and rows. Before we will explain the usage of the sum method, we will create a new DataFrame object on which we will apply our examples. We will start by creating an empty DataFrame without columns but an index. We populate this DataFrame by adding columns with random values:
years = range(2014, 2019)
cities = ["Zürich", "Freiburg", "München", "Konstanz", "Saarbrücken"]
shops = pd.DataFrame(index=years)
for city in cities:
shops.insert(loc=len(shops.columns),
column=city,
value=(np.random.uniform(0.7, 1, (5,)) * 1000).round(2))
print(shops)
Let's apply sum
to the DataFrame shops
:
shops.sum()
We can see that it summed up all the columns of our DataFrame. What about calculating the sum of the rows? We can do this by using the axis
parameter of sum
.
shops.sum(axis=1)
You only want to the the sums for the first, third and the last column and for all the years:
s = shops.iloc[:, [0, 2, -1]]
print(s)
print("and now the sum:")
print(s.sum())
Of course, you could have also have achieved it in the following way, if the column names are known:
shops[["Zürich", "München", "Saarbrücken"]].sum()
We can use "cumsum" to calculate the cumulative sum over the years:
x = shops.cumsum()
print(x)
Using the keyword parameter axis
with the value 1, we can build the cumulative sum over the rows:
x = shops.cumsum(axis=1)
print(x)
city_frame["population"] = x
print(city_frame)
Instead of replacing the values of the population column with the cumulative sum, we want to add the cumulative population sum as a new culumn with the name "cum_population".
city_frame = pd.DataFrame(cities,
columns=["country",
"population",
"cum_population"],
index=cities["name"])
city_frame
We can see that the column "cum_population" is set to Nan, as we haven't provided any data for it.
We will assign now the cumulative sums to this column:
city_frame["cum_population"] = city_frame["population"].cumsum()
city_frame
We can also include a column name which is not contained in the dictionary, when we create the DataFrame from the dictionary. In this case, all the values of this column will be set to NaN:
city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
print(city_frame)
# in a dictionary-like way:
print(city_frame["population"])
# as an attribute
print(city_frame.population)
print(type(city_frame.population))
city_frame.population
From the previous example, we can see that we have not copied the population column. "p" is a view on the data of city_frame.
city_frame["area"] = 1572
print(city_frame)
In this case, it will be definitely better to assign the exact area to the cities. The list with the area values needs to have the same length as the number of rows in our DataFrame.
# area in square km:
area = [1572, 891.85, 605.77, 1285,
105.4, 414.6, 228, 755,
525.2, 517, 101.9, 310.4,
181.8]
# area could have been designed as a list, a Series, an array or a scalar
city_frame["area"] = area
print(city_frame)
city_frame = city_frame.sort_values(by="area", ascending=False)
print(city_frame)
Let's assume, we have only the areas of London, Hamburg and Milan. The areas are in a series with the correct indices. We can assign this series as well:
city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
some_areas = pd.Series([1572, 755, 181.8],
index=['London', 'Hamburg', 'Milan'])
city_frame['area'] = some_areas
print(city_frame)
Inserting new columns into existing DataFrames
In the previous example we have added the column area at creation time. Quite often it will be necessary to add or insert columns into existing DataFrames. For this purpose the DataFrame class provides a method "insert", which allows us to insert a column into a DataFrame at a specified location:
insert(self, loc, column, value, allow_duplicates=False)`
The parameters are specified as:
Parameter | Meaning |
---|---|
loc | int |
This value should be within the range 0 <= loc <= len(columns) |
|
column | the column name |
value | can be a list, a Series an array or a scalar |
allow_duplicates | If allow_duplicates is False,an Exception will be raised, if column is already contained in the DataFrame. |
city_frame = pd.DataFrame(cities,
columns=["country",
"population"],
index=cities["name"])
idx = 1
city_frame.insert(loc=idx, column='area', value=area)
city_frame
growth = {"Switzerland": {"2010": 3.0, "2011": 1.8, "2012": 1.1, "2013": 1.9},
"Germany": {"2010": 4.1, "2011": 3.6, "2012": 0.4, "2013": 0.1},
"France": {"2010":2.0, "2011":2.1, "2012": 0.3, "2013": 0.3},
"Greece": {"2010":-5.4, "2011":-8.9, "2012":-6.6, "2013": -3.3},
"Italy": {"2010":1.7, "2011": 0.6, "2012":-2.3, "2013":-1.9}
}
growth_frame = pd.DataFrame(growth)
growth_frame
You like to have the years in the columns and the countries in the rows? No problem, you can transpose the data:
growth_frame.T
growth_frame = growth_frame.T
growth_frame2 = growth_frame.reindex(["Switzerland",
"Italy",
"Germany",
"Greece"])
print(growth_frame2)
import numpy as np
names = ['Frank', 'Eve', 'Stella', 'Guido', 'Lara']
index = ["January", "February", "March",
"April", "May", "June",
"July", "August", "September",
"October", "November", "December"]
df = pd.DataFrame((np.random.randn(12, 5)*1000).round(2),
columns=names,
index=index)
df