Numerical & Scientific Computing with Python: Pandas Tutorial

# Introduction into Pandas

The pandas we are writing about in this chapter have nothing to do with the cute panda bears, and they are neither what our visitors are expecting in a Python tutorial. Pandas is a Python module, which is rounding up the capabilities of Numpy, Scipy and Matplotlab. The word pandas is an acronym which is derived from "Python and data analysis" and "panel data".

There is often some confusion about whether Pandas is an alternative to Numpy, SciPy and Matplotlib. The truth is that it is built on top of Numpy. This means that Numpy is required by pandas. Scipy and Matplotlib on the other hand are not required by pandas but they are extremely useful. That's why the Pandas project lists them as "optional dependency".

Pandas is a software library written for the Python programming language. It is used for data manipulation and analysis. It provides special data structures and operations for the manipulation of numerical tables and time series. Pandas is free software released under the three-clause BSD license.

## Data Structures

We will start with the following two important data structures of Pandas:

• Series and
• DataFrame

### Series

A Series is a one-dimensional labelled array-like object. It is capable of holding any data type, e.g. integers, floats, strings, Python objects, and so on. It can be seen as a data structure with two arrays: one functioning as the index, i.e. the labels, and the other one contains the actual data.

We define a simple Series object in the following example by instantiating a Pandas Series object with a list. We will later see that we can use other data objects for example Numpy arrays and dictionaries as well to instantiate a Series object.

import pandas as pd
S = pd.Series([11, 28, 72, 3, 5, 8])
S

The above code returned the following output:
0    11
1    28
2    72
3     3
4     5
5     8
dtype: int64

We haven't defined an index in our example, but we see two columns in our output: The right column contains our data, whereas the left column contains the index. Pandas created a default index starting with 0 going to 5, which is the length of the data minus 1.

We can directly access the index and the values of our Series S:

print(S.index)
print(S.values)

RangeIndex(start=0, stop=6, step=1)
[11 28 72  3  5  8]


If we compare this to creating an array in numpy, there are still lots of similarities:

import numpy as np
X = np.array([11, 28, 72, 3, 5, 8])
print(X)
print(S.values)
# both are the same type:
print(type(S.values), type(X))

[11 28 72  3  5  8]
[11 28 72  3  5  8]
<class 'numpy.ndarray'> <class 'numpy.ndarray'>


So far our Series have not been very different to ndarrays of Numpy. This changes, as soon as we start defining Series objects with individual indices:

fruits = ['apples', 'oranges', 'cherries', 'pears']
quantities = [20, 33, 52, 10]
S = pd.Series(quantities, index=fruits)
S

The previous Python code returned the following result:
apples      20
oranges     33
cherries    52
pears       10
dtype: int64

A big advantage to NumPy arrays is obvious from the previous example: We can use arbitrary indices.

If we add two series with the same indices, we get a new series with the same index and the correponding values will be added:

fruits = ['apples', 'oranges', 'cherries', 'pears']
S = pd.Series([20, 33, 52, 10], index=fruits)
S2 = pd.Series([17, 13, 31, 32], index=fruits)
print(S + S2)
print("sum of S: ", sum(S))

apples      37
oranges     46
cherries    83
pears       42
dtype: int64
sum of S:  115


The indices do not have to be the same for the Series addition. The index will be the "union" of both indices. If an index doesn't occur in both Series, the value for this Series will be NaN:

fruits = ['peaches', 'oranges', 'cherries', 'pears']
fruits2 = ['raspberries', 'oranges', 'cherries', 'pears']
S = pd.Series([20, 33, 52, 10], index=fruits)
S2 = pd.Series([17, 13, 31, 32], index=fruits2)
print(S + S2)

cherries       83.0
oranges        46.0
peaches         NaN
pears          42.0
raspberries     NaN
dtype: float64

fruits = ['apples', 'oranges', 'cherries', 'pears']
fruits_gr = ['μήλα', 'πορτοκάλια', 'κεράσια', 'αχλάδια']
S = pd.Series([20, 33, 52, 10], index=fruits)
S2 = pd.Series([17, 13, 31, 32], index=fruits_gr)
print(S+S2)

apples       NaN
cherries     NaN
oranges      NaN
pears        NaN
αχλάδια      NaN
κεράσια      NaN
μήλα         NaN
πορτοκάλια   NaN
dtype: float64


It's possible to access single values of a Series or more than one value by a list of indices:

print(S['apples'])

20

print(S[['apples', 'oranges', 'cherries']])

apples      20
oranges     33
cherries    52
dtype: int64


Similar to Numpy we can use scalar operations or mathematical functions on a series:

import numpy as np
print((S + 3) * 4)
print("======================")
print(np.sin(S))

apples       92
oranges     144
cherries    220
pears        52
dtype: int64
======================
apples      0.912945
oranges     0.999912
cherries    0.986628
pears      -0.544021
dtype: float64


#### pandas.Series.apply

Series.apply(func, convert_dtype=True, args=(), **kwds)

The function "func" will be applied to the Series and it returns either a Series or a DataFrame, depending on "func".

Parameter Meaning
func a function, which can be a NumPy function that will be applied to the entire Series or a Python function that will be applied to every single value of the series
convert_dtype A boolean value. If it is set to True (default), apply will try to find better dtype for elementwise function results. If False, leave as dtype=object
args Positional arguments which will be passed to the function "func" additionally to the values from the series.
**kwds Additional keyword arguments will be passed as keywords to the function

Example:

S.apply(np.sin)

The previous Python code returned the following output:
apples      0.912945
oranges     0.999912
cherries    0.986628
pears      -0.544021
dtype: float64

We can also use Python lambda functions. Let's assume, we have the following task. The test the amount of fruit for every kind. It there are less than 50 available, we will augment the stock by 10:

S.apply(lambda x: x if x > 50 else x+10 )

The above code returned the following output:
apples      30
oranges     43
cherries    52
pears       20
dtype: int64

Filtering with a boolean array:

S[S>30]

The previous Python code returned the following:
oranges     33
cherries    52
dtype: int64

A series can be seen as an ordered Python dictionary with a fixed length.

"apples" in S

The Python code above returned the following:
True

We can even pass a dictionary to a Series object, when we create it. We get a Series with the dict's keys as the indices. The indices will be sorted.

cities = {"London":   8615246,
"Berlin":   3562166,
"Rome":     2874038,
"Paris":    2273305,
"Vienna":   1805681,
"Bucharest":1803425,
"Hamburg":  1760433,
"Budapest": 1754000,
"Warsaw":   1740119,
"Barcelona":1602386,
"Munich":   1493900,
"Milan":    1350680}
city_series = pd.Series(cities)
print(city_series)

Barcelona    1602386
Berlin       3562166
Bucharest    1803425
Budapest     1754000
Hamburg      1760433
London       8615246
Milan        1350680
Munich       1493900
Paris        2273305
Rome         2874038
Vienna       1805681
Warsaw       1740119
dtype: int64


### NaN - Missing Data

One problem in dealing with data analysis tasks consists in missing data. Pandas makes it as easy as possible to work with missing data.

If we look once more at our previous example, we can see that the index of our series is the same as the keys of the dictionary we used to create the cities_series. Now, we want to use an index which is not overlapping with the dictionary keys. We have already seen that we can pass a list or a tuple to the keyword argument 'index' to define the index. In our next example, the list (or tuple) passed to the keyword parameter 'index' will not be equal to the keys. This means that some cities from the dictionary will be missing and two cities ("Zurich" and "Stuttgart") don't occur in the dictionary.

my_cities = ["London", "Paris", "Zurich", "Berlin",
"Stuttgart", "Hamburg"]
my_city_series = pd.Series(cities,
index=my_cities)
my_city_series

We received the following output:
London       8615246.0
Paris        2273305.0
Zurich             NaN
Berlin       3562166.0
Stuttgart          NaN
Hamburg      1760433.0
dtype: float64

Due to the Nan values the population values for the other cities are turned into floats. There is no missing data in the following examples, so the values are int:

my_cities = ["London", "Paris", "Berlin", "Hamburg"]
my_city_series = pd.Series(cities,
index=my_cities)
my_city_series

The above code returned the following:
London     8615246
Paris      2273305
Berlin     3562166
Hamburg    1760433
dtype: int64

#### The Methods isnull() and notnull()

We can see, that the cities, which are not included in the dictionary, get the value NaN assigned. NaN stands for "not a number". It can also be seen as meaning "missing" in our example.

We can check for missing values with the methods isnull and notnull:

my_cities = ["London", "Paris", "Zurich", "Berlin",
"Stuttgart", "Hamburg"]
my_city_series = pd.Series(cities,
index=my_cities)
print(my_city_series.isnull())

London       False
Paris        False
Zurich        True
Berlin       False
Stuttgart     True
Hamburg      False
dtype: bool

print(my_city_series.notnull())

London        True
Paris         True
Zurich       False
Berlin        True
Stuttgart    False
Hamburg       True
dtype: bool


We get also a NaN, if a value in the dictionary has a None:

d = {"a":23, "b":45, "c":None, "d":0}
S = pd.Series(d)
print(S)

a    23.0
b    45.0
c     NaN
d     0.0
dtype: float64

pd.isnull(S)

The above code returned the following result:
a    False
b    False
c     True
d    False
dtype: bool
pd.notnull(S)

The above Python code returned the following:
a     True
b     True
c    False
d     True
dtype: bool

#### Filtering out Missing Data

It's possible to filter out missing data with the Series method dropna. It returns a Series which consists only of non-null data:

print(my_city_series.dropna())

London     8615246.0
Paris      2273305.0
Berlin     3562166.0
Hamburg    1760433.0
dtype: float64


#### Filling in Missing Data

In many cases you don't want to filter out missing data, but you want to fill in appropriate data for the empty gaps. A suitable method in many situations will be fillna:

print(my_city_series.fillna(0))

London       8615246.0
Paris        2273305.0
Zurich             0.0
Berlin       3562166.0
Stuttgart          0.0
Hamburg      1760433.0
dtype: float64


Okay, that's not what we call "fill in appropriate data for the empty gaps". If we call fillna with a dict, we can provide the appropriate data, i.e. the population of Zurich and Stuttgart:

missing_cities = {"Stuttgart":597939, "Zurich":378884}
my_city_series.fillna(missing_cities)

The above Python code returned the following:
London       8615246.0
Paris        2273305.0
Zurich        378884.0
Berlin       3562166.0
Stuttgart     597939.0
Hamburg      1760433.0
dtype: float64

### DataFrame

The underlying idea of a DataFrame is based on spreadsheets. We can see the data structure of a DataFrame as tabular and spreadsheet-like. It contains an ordered collection of columns. Each column consists of a unique data typye, but different columns can have different types, e.g. the first column may consist of integers, while the second one consists of boolean values and so on.

A DataFrame has a row and column index; it's like a dict of Series with a common index.

cities = {"name": ["London", "Berlin", "Madrid", "Rome",
"Paris", "Vienna", "Bucharest", "Hamburg",
"Budapest", "Warsaw", "Barcelona",
"Munich", "Milan"],
"population": [8615246, 3562166, 3165235, 2874038,
2273305, 1805681, 1803425, 1760433,
1754000, 1740119, 1602386, 1493900,
1350680],
"country": ["England", "Germany", "Spain", "Italy",
"France", "Austria", "Romania",
"Germany", "Hungary", "Poland", "Spain",
"Germany", "Italy"]}
city_frame = pd.DataFrame(cities)
city_frame

This gets us the following:
country name population
0 England London 8615246
1 Germany Berlin 3562166
2 Spain Madrid 3165235
3 Italy Rome 2874038
4 France Paris 2273305
5 Austria Vienna 1805681
6 Romania Bucharest 1803425
7 Germany Hamburg 1760433
8 Hungary Budapest 1754000
9 Poland Warsaw 1740119
10 Spain Barcelona 1602386
11 Germany Munich 1493900
12 Italy Milan 1350680

#### Custom Index

We can see that an index (0,1,2, ...) has been automatically assigned to the DataFrame. We can also assign a custom index to the DataFrame object:

ordinals = ["first", "second", "third", "fourth",
"fifth", "sixth", "seventh", "eigth",
"ninth", "tenth", "eleventh", "twelvth",
"thirteenth"]
city_frame = pd.DataFrame(cities, index=ordinals)
city_frame

After having executed the Python code above we received the following output:
country name population
first England London 8615246
second Germany Berlin 3562166
third Spain Madrid 3165235
fourth Italy Rome 2874038
fifth France Paris 2273305
sixth Austria Vienna 1805681
seventh Romania Bucharest 1803425
eigth Germany Hamburg 1760433
ninth Hungary Budapest 1754000
tenth Poland Warsaw 1740119
eleventh Spain Barcelona 1602386
twelvth Germany Munich 1493900
thirteenth Italy Milan 1350680

#### Rearranging the Order of Columns

We can also define or rearrange the order of the columns.

city_frame = pd.DataFrame(cities,
columns=["name",
"country",
"population"],
index=ordinals)
city_frame

The code above returned the following:
name country population
first London England 8615246
second Berlin Germany 3562166
third Madrid Spain 3165235
fourth Rome Italy 2874038
fifth Paris France 2273305
sixth Vienna Austria 1805681
seventh Bucharest Romania 1803425
eigth Hamburg Germany 1760433
ninth Budapest Hungary 1754000
tenth Warsaw Poland 1740119
eleventh Barcelona Spain 1602386
twelvth Munich Germany 1493900
thirteenth Milan Italy 1350680

#### Existing Column as the Index of a DataFrame

We want to create a more useful index in the following example. We will use the country name as the index:

city_frame = pd.DataFrame(cities,
columns=["name", "population"],
index=cities["country"])
city_frame

The above Python code returned the following result:
name population
England London 8615246
Germany Berlin 3562166
Italy Rome 2874038
France Paris 2273305
Austria Vienna 1805681
Romania Bucharest 1803425
Germany Hamburg 1760433
Hungary Budapest 1754000
Poland Warsaw 1740119
Spain Barcelona 1602386
Germany Munich 1493900
Italy Milan 1350680

Alternatively, we can us the method set_index to turn a column into an index. "set_index" does not work in-place, it returns a new data frame with the chosen column as the index:

city_frame = pd.DataFrame(cities)
city_frame2 = city_frame.set_index("country")
print(city_frame2)

              name  population
country
England     London     8615246
Germany     Berlin     3562166
Italy         Rome     2874038
France       Paris     2273305
Austria     Vienna     1805681
Romania  Bucharest     1803425
Germany    Hamburg     1760433
Hungary   Budapest     1754000
Poland      Warsaw     1740119
Spain    Barcelona     1602386
Germany     Munich     1493900
Italy        Milan     1350680


We saw in the previous example that the set_index method returns a new DataFrame object and doesn't change the original DataFrame. If we set the optional parameter "inplace" to True, the DataFrame will be changed in place, i.e. no new object will be created:

city_frame = pd.DataFrame(cities)
city_frame.set_index("country", inplace=True)
print(city_frame)

              name  population
country
England     London     8615246
Germany     Berlin     3562166
Italy         Rome     2874038
France       Paris     2273305
Austria     Vienna     1805681
Romania  Bucharest     1803425
Germany    Hamburg     1760433
Hungary   Budapest     1754000
Poland      Warsaw     1740119
Spain    Barcelona     1602386
Germany     Munich     1493900
Italy        Milan     1350680


#### Sum and Cumulative Sum

We can calculate the sum of all the columns of a DataFrame or the sum of certain columns:

print(city_frame.sum())

name          LondonBerlinMadridRomeParisViennaBucharestHamb...
population                                             33800614
dtype: object

city_frame["population"].sum()

The previous code returned the following output:
33800614

We can use "cumsum" to calculate the cumulative sum:

x = city_frame["population"].cumsum()
print(x)

country
England     8615246
Germany    12177412
Spain      15342647
Italy      18216685
France     20489990
Austria    22295671
Romania    24099096
Germany    25859529
Hungary    27613529
Poland     29353648
Spain      30956034
Germany    32449934
Italy      33800614
Name: population, dtype: int64


#### Assigning New Values to Columns

x is a Pandas Series. We can reassign the previously calculated cumulative sums to the population column:

city_frame["population"] = x
print(city_frame)

              name  population
country
England     London     8615246
Germany     Berlin    12177412
Italy         Rome    18216685
France       Paris    20489990
Austria     Vienna    22295671
Romania  Bucharest    24099096
Germany    Hamburg    25859529
Hungary   Budapest    27613529
Poland      Warsaw    29353648
Spain    Barcelona    30956034
Germany     Munich    32449934
Italy        Milan    33800614


Instead of replacing the values of the population column with the cumulative sum, we want to add the cumulative population sum as a new culumn with the name "cum_population".

city_frame = pd.DataFrame(cities,
columns=["country",
"population",
"cum_population"],
index=cities["name"])
city_frame

The above Python code returned the following:
country population cum_population
London England 8615246 NaN
Berlin Germany 3562166 NaN
Madrid Spain 3165235 NaN
Rome Italy 2874038 NaN
Paris France 2273305 NaN
Vienna Austria 1805681 NaN
Bucharest Romania 1803425 NaN
Hamburg Germany 1760433 NaN
Budapest Hungary 1754000 NaN
Warsaw Poland 1740119 NaN
Barcelona Spain 1602386 NaN
Munich Germany 1493900 NaN
Milan Italy 1350680 NaN

We can see that the column "cum_population" is set to Nan, as we haven't provided any data for it.

We will assign now the cumulative sums to this column:

city_frame["cum_population"] = city_frame["population"].cumsum()
city_frame

After having executed the Python code above we received the following:
country population cum_population
London England 8615246 8615246
Berlin Germany 3562166 12177412
Madrid Spain 3165235 15342647
Rome Italy 2874038 18216685
Paris France 2273305 20489990
Vienna Austria 1805681 22295671
Bucharest Romania 1803425 24099096
Hamburg Germany 1760433 25859529
Budapest Hungary 1754000 27613529
Warsaw Poland 1740119 29353648
Barcelona Spain 1602386 30956034
Munich Germany 1493900 32449934
Milan Italy 1350680 33800614

We can also include a column name which is not contained in the dictionary. In this case, all the values of this column will be set to NaN:

city_frame = pd.DataFrame(cities,
columns=["country",
"area",
"population"],
index=cities["name"])
print(city_frame)

           country area  population
London     England  NaN     8615246
Berlin     Germany  NaN     3562166
Madrid       Spain  NaN     3165235
Rome         Italy  NaN     2874038
Paris       France  NaN     2273305
Vienna     Austria  NaN     1805681
Bucharest  Romania  NaN     1803425
Hamburg    Germany  NaN     1760433
Budapest   Hungary  NaN     1754000
Warsaw      Poland  NaN     1740119
Barcelona    Spain  NaN     1602386
Munich     Germany  NaN     1493900
Milan        Italy  NaN     1350680


#### Accessing the Columns of a DataFrame

There are two ways to access a column of a DataFrame. The result is in both cases a Series:

# in a dictionary-like way:
print(city_frame["population"])

London       8615246
Berlin       3562166
Rome         2874038
Paris        2273305
Vienna       1805681
Bucharest    1803425
Hamburg      1760433
Budapest     1754000
Warsaw       1740119
Barcelona    1602386
Munich       1493900
Milan        1350680
Name: population, dtype: int64

# as an attribute
print(city_frame.population)

London       8615246
Berlin       3562166
Rome         2874038
Paris        2273305
Vienna       1805681
Bucharest    1803425
Hamburg      1760433
Budapest     1754000
Warsaw       1740119
Barcelona    1602386
Munich       1493900
Milan        1350680
Name: population, dtype: int64

print(type(city_frame.population))

<class 'pandas.core.series.Series'>

city_frame.population

This gets us the following:
London       8615246
Berlin       3562166
Rome         2874038
Paris        2273305
Vienna       1805681
Bucharest    1803425
Hamburg      1760433
Budapest     1754000
Warsaw       1740119
Barcelona    1602386
Munich       1493900
Milan        1350680
Name: population, dtype: int64

From the previous example, we can see that we have not copied the population column. "p" is a view on the data of city_frame.

#### Assigning New Values to a Column

The column area is still not defined. We can set all elements of the column to the same value:

city_frame["area"] = 1572
print(city_frame)

           country  area  population
London     England  1572     8615246
Berlin     Germany  1572     3562166
Madrid       Spain  1572     3165235
Rome         Italy  1572     2874038
Paris       France  1572     2273305
Vienna     Austria  1572     1805681
Bucharest  Romania  1572     1803425
Hamburg    Germany  1572     1760433
Budapest   Hungary  1572     1754000
Warsaw      Poland  1572     1740119
Barcelona    Spain  1572     1602386
Munich     Germany  1572     1493900
Milan        Italy  1572     1350680


In this case, it will be definitely better to assign the exact area to the cities. The list with the area values needs to have the same length as the number of rows in our DataFrame.

# area in square km:
area = [1572, 891.85, 605.77, 1285,
105.4, 414.6, 228, 755,
525.2, 517, 101.9, 310.4,
181.8]
city_frame["area"] = area
print(city_frame)

           country     area  population
London     England  1572.00     8615246
Berlin     Germany   891.85     3562166
Madrid       Spain   605.77     3165235
Rome         Italy  1285.00     2874038
Paris       France   105.40     2273305
Vienna     Austria   414.60     1805681
Bucharest  Romania   228.00     1803425
Hamburg    Germany   755.00     1760433
Budapest   Hungary   525.20     1754000
Warsaw      Poland   517.00     1740119
Barcelona    Spain   101.90     1602386
Munich     Germany   310.40     1493900
Milan        Italy   181.80     1350680


#### Accessing the Rows of a DataFrame

We can also access the rows directly. We access the info of the fourth city in the following way:

city_frame.ix["Hamburg"]

The above Python code returned the following output:
country       Germany
area              755
population    1760433
Name: Hamburg, dtype: object

#### Sorting DataFrames

Let's sort our DataFrame according to the city area:

city_frame = city_frame.sort_values(by="area", ascending=False)
print(city_frame)

           country     area  population
London     England  1572.00     8615246
Rome         Italy  1285.00     2874038
Berlin     Germany   891.85     3562166
Hamburg    Germany   755.00     1760433
Madrid       Spain   605.77     3165235
Budapest   Hungary   525.20     1754000
Warsaw      Poland   517.00     1740119
Vienna     Austria   414.60     1805681
Munich     Germany   310.40     1493900
Bucharest  Romania   228.00     1803425
Milan        Italy   181.80     1350680
Paris       France   105.40     2273305
Barcelona    Spain   101.90     1602386


Let's assume, we have only the areas of London, Hamburg and Milan. The areas are in a series with the correct indices. We can assign this series as well:

city_frame = pd.DataFrame(cities,
columns=["name",
"country",
"area",
"population"],
index=ordinals)
some_areas = pd.Series([1572, 755, 181.8],
index=['first', 'eigth', 'thirteenth'])
city_frame['area'] = some_areas
print(city_frame)

                 name  country    area  population
first          London  England  1572.0     8615246
second         Berlin  Germany     NaN     3562166
third          Madrid    Spain     NaN     3165235
fourth           Rome    Italy     NaN     2874038
fifth           Paris   France     NaN     2273305
sixth          Vienna  Austria     NaN     1805681
seventh     Bucharest  Romania     NaN     1803425
eigth         Hamburg  Germany   755.0     1760433
ninth        Budapest  Hungary     NaN     1754000
tenth          Warsaw   Poland     NaN     1740119
eleventh    Barcelona    Spain     NaN     1602386
twelvth        Munich  Germany     NaN     1493900
thirteenth      Milan    Italy   181.8     1350680


A nested dictionary of dicts can be passed to a DataFrame as well. The indices of the outer dictionary are taken as the the columns and the inner keys. i.e. the keys of the nested dictionaries, are used as the row indices:

growth = {"Switzerland": {"2010": 3.0, "2011": 1.8, "2012": 1.1, "2013": 1.9},
"Germany": {"2010": 4.1, "2011": 3.6, "2012":	0.4, "2013": 0.1},
"France": {"2010":2.0,  "2011":2.1, "2012": 0.3, "2013": 0.3},
"Greece": {"2010":-5.4, "2011":-8.9, "2012":-6.6, "2013":	-3.3},
"Italy": {"2010":1.7, "2011":	0.6, "2012":-2.3, "2013":-1.9}
}

growth_frame = pd.DataFrame(growth)
growth_frame

The previous Python code returned the following result:
France Germany Greece Italy Switzerland
2010 2.0 4.1 -5.4 1.7 3.0
2011 2.1 3.6 -8.9 0.6 1.8
2012 0.3 0.4 -6.6 -2.3 1.1
2013 0.3 0.1 -3.3 -1.9 1.9

You like to have the years in the columns and the countries in the rows? No problem, you can transpose the data:

growth_frame.T

The above code returned the following:
2010 2011 2012 2013
France 2.0 2.1 0.3 0.3
Germany 4.1 3.6 0.4 0.1
Greece -5.4 -8.9 -6.6 -3.3
Italy 1.7 0.6 -2.3 -1.9
Switzerland 3.0 1.8 1.1 1.9
growth_frame = growth_frame.T
growth_frame2 = growth_frame.reindex(["Switzerland",
"Italy",
"Germany",
"Greece"])
print(growth_frame2)

             2010  2011  2012  2013
Switzerland   3.0   1.8   1.1   1.9
Italy         1.7   0.6  -2.3  -1.9
Germany       4.1   3.6   0.4   0.1
Greece       -5.4  -8.9  -6.6  -3.3


#### Filling a DataFrame with random values:

import numpy as np
names = ['Frank', 'Eve', 'Stella', 'Guido', 'Lara']
index = ["January", "February", "March",
"April", "May", "June",
"July", "August", "September",
"October", "November", "December"]
df = pd.DataFrame(np.random.randn(12, 5)*1000,
columns=names,
index=index)
df

After having executed the Python code above we received the following:
Frank Eve Stella Guido Lara
January 26.682579 -1910.853342 414.792564 61.359616 -343.289129
February 589.933361 -397.636536 -350.355907 -1687.693688 2472.369862
March -885.948225 -258.235038 -509.921313 1156.018346 -402.605559
April -61.782245 907.893918 -1314.143911 -587.755316 -550.862545
May -159.261854 1054.479234 -122.259786 -476.353132 365.718142
June 673.164789 849.095629 359.573705 946.862095 -96.540020
July 2233.529280 -917.197411 -492.496425 -1834.028866 -462.349069
August -495.832124 -1269.013450 -606.736727 -203.656006 -1938.187807
September -1000.806914 -1229.204807 1883.344890 -732.405557 -990.587027
October 2514.200089 1058.995673 -817.665647 424.993285 706.702958
November -927.163926 -977.320218 -1003.029741 648.017429 -1476.915087
December 355.977666 249.110737 -1343.996907 345.108074 833.047705

#### Reading a csv File

We want to read in a csv file with the population data of all countries (July 2014). The delimiter of the file a a space and commas are used to separate groups of thousands in the numbers:

pop = pd.read_csv("countries_population.csv",
names=["Country", "Population"],
index_col=0,
quotechar="'",
sep=" ",
thousands=",")
print(pop)

                                               Population
Country
China                                          1355692576
India                                          1236344631
European Union                                  511434812
United States                                   318892103
Indonesia                                       253609643
Brazil                                          202656788
Pakistan                                        196174380
Nigeria                                         177155754
Russia                                          142470272
Japan                                           127103388
Mexico                                          120286655
Philippines                                     107668231
Ethiopia                                         96633458
Vietnam                                          93421835
Egypt                                            86895099
Turkey                                           81619392
Germany                                          80996685
Iran                                             80840713
Congo, Democratic Republic of the                77433744
Thailand                                         67741401
France                                           66259012
United Kingdom                                   63742977
Italy                                            61680122
Burma                                            55746253
Tanzania                                         49639138
Korea, South                                     49039986
South Africa                                     48375645
Spain                                            47737941
Colombia                                         46245297
...                                                   ...
Saint Kitts and Nevis                               51538
Northern Mariana Islands                            51483
Faroe Islands                                       49947
Turks and Caicos Islands                            49070
Sint Maarten                                        39689
Liechtenstein                                       37313
San Marino                                          32742
British Virgin Islands                              32680
Saint Martin                                        31530
Monaco                                              30508
Gibraltar                                           29185
Palau                                               21186
Anguilla                                            16086
Wallis and Futuna                                   15561
Tuvalu                                              10782
Cook Islands                                        10134
Nauru                                                9488
Saint Helena, Ascension, and Tristan da Cunha        7776
Saint Barthelemy                                     7267
Saint Pierre and Miquelon                            5716
Montserrat                                           5215
Falkland Islands (Islas Malvinas)                    3361
Norfolk Island                                       2210
Svalbard                                             1872
Christmas Island                                     1530
Tokelau                                              1337
Niue                                                 1190
Holy See (Vatican City)                               842
Cocos (Keeling) Islands                               596
Pitcairn Islands                                       48
[238 rows x 1 columns]

In [ ]: