The underlying idea of a DataFrame is based on spreadsheets. We can see the data structure of a DataFrame as tabular and spreadsheet-like. A DataFrame logically corresponds to a "sheet" of an Excel document. A DataFrame has both a row and a column index.
Like a spreadsheet or Excel sheet, a DataFrame object contains an ordered collection of columns. Each column consists of a unique data typye, but different columns can have different types, e.g. the first column may consist of integers, while the second one consists of boolean values and so on.
There is a close connection between the DataFrames and the Series of Pandas. A DataFrame can be seen as a concatenation of Series, each Series having the same index, i.e. the index of the DataFrame.
We will demonstrate this in the following example.
We define the following three Series:
import pandas as pd
years = range(2014, 2018)
shop1 = pd.Series([2409.14, 2941.01, 3496.83, 3119.55], index=years)
shop2 = pd.Series([1203.45, 3441.62, 3007.83, 3619.53], index=years)
shop3 = pd.Series([3412.12, 3491.16, 3457.19, 1963.10], index=years)
What happens, if we concatenate these "shop" Series? Pandas provides a concat function for this purpose:
pd.concat([shop1, shop2, shop3])
This result is not what we have intended or expected. The reason is that concat used 0 as the default for the axis parameter. Let's do it with "axis=1":
shops_df = pd.concat([shop1, shop2, shop3], axis=1)
shops_df
Let's do some fine sanding by giving names to the columns:
cities = ["Zürich", "Winterthur", "Freiburg"]
shops_df.columns = cities
print(shops_df)
# alternative way: give names to series:
shop1.name = "Zürich"
shop2.name = "Winterthur"
shop3.name = "Freiburg"
print("------")
shops_df2 = pd.concat([shop1, shop2, shop3], axis=1)
print(shops_df2)
This was nice, but what kind of data type is our result?
print(type(shops_df))
This means, we can arrange or concat Series into DataFrames!
cities = {"name": ["London", "Berlin", "Madrid", "Rome",
"Paris", "Vienna", "Bucharest", "Hamburg",
"Budapest", "Warsaw", "Barcelona",
"Munich", "Milan"],
"population": [8615246, 3562166, 3165235, 2874038,
2273305, 1805681, 1803425, 1760433,
1754000, 1740119, 1602386, 1493900,
1350680],
"country": ["England", "Germany", "Spain", "Italy",
"France", "Austria", "Romania",
"Germany", "Hungary", "Poland", "Spain",
"Germany", "Italy"]}
city_frame = pd.DataFrame(cities)
city_frame
city_frame.columns.values
ordinals = ["first", "second", "third", "fourth",
"fifth", "sixth", "seventh", "eigth",
"ninth", "tenth", "eleventh", "twelvth",
"thirteenth"]
city_frame = pd.DataFrame(cities, index=ordinals)
city_frame
Rearranging the Order of Columns
We can also define and rearrange the order of the columns at the time of creation of the DataFrame. This makes also sure that we will have a defined ordering of our columns, if we create the DataFrame from a dictionary. Dictionaries are not ordered, as you have seen in our chapter on Dictionaries in our Python tutorial, so we cannot know in advance what the ordering of our columns will be:
city_frame = pd.DataFrame(cities,
columns=["name",
"country",
"population"])
city_frame
We change both the column order and the ordering of the index with the function reindex
with the following code:
city_frame.reindex(index=[0, 2, 4, 6, 8, 10, 12, 1, 3, 5, 7, 9, 11],
columns=['country', 'name', 'population'])
Now, we want to rename our columns. For this purpose, we will use the DataFrame method 'rename'. This method supports two calling conventions
- (index=index_mapper, columns=columns_mapper, ...)
- (mapper, axis={'index', 'columns'}, ...)
We will rename the columns of our DataFrame into Romanian names in the following example. We set the parameter inplace to True so that our DataFrame will be changed instead of returning a new DataFrame, if inplace is set to False, which is the default!
city_frame.rename(columns={"name":"Nume",
"country":"țară",
"population":"populație"},
inplace=True)
city_frame
city_frame = pd.DataFrame(cities,
columns=["name", "population"],
index=cities["country"])
city_frame
Alternatively, we can change an existing DataFrame. We can us the method set_index to turn a column into an index. "set_index" does not work in-place, it returns a new data frame with the chosen column as the index:
city_frame = pd.DataFrame(cities)
city_frame2 = city_frame.set_index("country")
print(city_frame2)
We saw in the previous example that the set_index method returns a new DataFrame object and doesn't change the original DataFrame. If we set the optional parameter "inplace" to True, the DataFrame will be changed in place, i.e. no new object will be created:
city_frame = pd.DataFrame(cities)
city_frame.set_index("country", inplace=True)
print(city_frame)
city_frame = pd.DataFrame(cities,
columns=("name", "population"),
index=cities["country"])
print(city_frame.loc["Germany"])
print(city_frame.loc[["Germany", "France"]])
print(city_frame.loc[city_frame.population>2000000])
print(city_frame.sum())
city_frame["population"].sum()
We can use "cumsum" to calculate the cumulative sum:
x = city_frame["population"].cumsum()
print(x)
city_frame["population"] = x
print(city_frame)
Instead of replacing the values of the population column with the cumulative sum, we want to add the cumulative population sum as a new culumn with the name "cum_population".
city_frame = pd.DataFrame(cities,
columns=["country",
"population",
"cum_population"],
index=cities["name"])
city_frame
We can see that the column "cum_population" is set to Nan, as we haven't provided any data for it.
We will assign now the cumulative sums to this column:
city_frame["cum_population"] = city_frame["population"].cumsum()
city_frame
We can also include a column name which is not contained in the dictionary, when we create the DataFrame from the dictionary. In this case, all the values of this column will be set to NaN:
city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
print(city_frame)
# in a dictionary-like way:
print(city_frame["population"])
# as an attribute
print(city_frame.population)
print(type(city_frame.population))
city_frame.population
From the previous example, we can see that we have not copied the population column. "p" is a view on the data of city_frame.
city_frame["area"] = 1572
print(city_frame)
In this case, it will be definitely better to assign the exact area to the cities. The list with the area values needs to have the same length as the number of rows in our DataFrame.
# area in square km:
area = [1572, 891.85, 605.77, 1285,
105.4, 414.6, 228, 755,
525.2, 517, 101.9, 310.4,
181.8]
# area could have been designed as a list, a Series, an array or a scalar
city_frame["area"] = area
print(city_frame)
city_frame = city_frame.sort_values(by="area", ascending=False)
print(city_frame)
Let's assume, we have only the areas of London, Hamburg and Milan. The areas are in a series with the correct indices. We can assign this series as well:
city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
some_areas = pd.Series([1572, 755, 181.8],
index=['London', 'Hamburg', 'Milan'])
city_frame['area'] = some_areas
print(city_frame)
Inserting new columns into existing DataFrames
In the previous example we have added the column area at creation time. Quite often it will be necessary to add or insert columns into existing DataFrames. For this purpose the DataFrame class provides a method "insert", which allows us to insert a column into a DataFrame at a specified location:
insert(self, loc, column, value, allow_duplicates=False)`
The parameters are specified as:
Parameter | Meaning |
---|---|
loc | int |
This value should be within the range 0 <= loc <= len(columns) |
|
column | the column name |
value | can be a list, a Series an array or a scalar |
allow_duplicates | If allow_duplicates is False,an Exception will be raised, if column is already contained in the DataFrame. |
city_frame = pd.DataFrame(cities,
columns=["country",
"population"],
index=cities["name"])
idx = 1
city_frame.insert(loc=idx, column='area', value=area)
city_frame
growth = {"Switzerland": {"2010": 3.0, "2011": 1.8, "2012": 1.1, "2013": 1.9},
"Germany": {"2010": 4.1, "2011": 3.6, "2012": 0.4, "2013": 0.1},
"France": {"2010":2.0, "2011":2.1, "2012": 0.3, "2013": 0.3},
"Greece": {"2010":-5.4, "2011":-8.9, "2012":-6.6, "2013": -3.3},
"Italy": {"2010":1.7, "2011": 0.6, "2012":-2.3, "2013":-1.9}
}
growth_frame = pd.DataFrame(growth)
growth_frame
You like to have the years in the columns and the countries in the rows? No problem, you can transpose the data:
growth_frame.T
growth_frame = growth_frame.T
growth_frame2 = growth_frame.reindex(["Switzerland",
"Italy",
"Germany",
"Greece"])
print(growth_frame2)
import numpy as np
names = ['Frank', 'Eve', 'Stella', 'Guido', 'Lara']
index = ["January", "February", "March",
"April", "May", "June",
"July", "August", "September",
"October", "November", "December"]
df = pd.DataFrame((np.random.randn(12, 5)*1000).round(2),
columns=names,
index=index)
df