32. Reading and Writing Data in Pandas

By Bernd Klein. Last modified: 03 Feb 2025.

On this page ➤

Digits as File Input and Output

All the powerful data structures like the Series and the DataFrames would avail to nothing, if the Pandas module wouldn't provide powerful functionalities for reading in and writing out data. It is not only a matter of having a functions for interacting with files. To be useful to data scientists it also needs functions which support the most important data formats like

Delimiter-separated files, like e.g. csv
Microsoft Excel files
HTML
XML
JSON

Delimiter-separated Values

Most people take csv files as a synonym for delimter-separated values files. They leave the fact out of account that csv is an acronym for "comma separated values", which is not the case in many situations. Pandas also uses "csv" and contexts, in which "dsv" would be more appropriate.

Delimiter-separated values (DSV) are defined and stored two-dimensional arrays (for example strings) of data by separating the values in each row with delimiter characters defined for this purpose. This way of implementing data is often used in combination of spreadsheet programs, which can read in and write out data as DSV. They are also used as a general data exchange format.

We call a text file a "delimited text file" if it contains text in DSV format.

For example, the file dollar_euro.txt is a delimited text file and uses tabs (\t) as delimiters.

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

See our Python training courses

See our Machine Learning with Python training courses

Reading CSV and DSV Files

Pandas offers two ways to read in CSV or DSV files to be precise:

DataFrame.from_csv
read_csv

There is no big difference between those two functions, e.g. they have different default values in some cases and read_csv has more paramters. We will focus on read_csv, because DataFrame.from_csv is kept inside Pandas for reasons of backwards compatibility.

import pandas as pd

exchange_rates = pd.read_csv("../data1/dollar_euro.txt",
                             sep="\t")
print(exchange_rates)

OUTPUT:

    Year   Average  Min USD/EUR  Max USD/EUR  Working days
0   2016  0.901696     0.864379     0.959785           247
1   2015  0.901896     0.830358     0.947688           256
2   2014  0.753941     0.716692     0.823655           255
3   2013  0.753234     0.723903     0.783208           255
4   2012  0.778848     0.743273     0.827198           256
5   2011  0.719219     0.671953     0.775855           257
6   2010  0.755883     0.686672     0.837381           258
7   2009  0.718968     0.661376     0.796495           256
8   2008  0.683499     0.625391     0.802568           256
9   2007  0.730754     0.672314     0.775615           255
10  2006  0.797153     0.750131     0.845594           255
11  2005  0.805097     0.740357     0.857118           257
12  2004  0.804828     0.733514     0.847314           259
13  2003  0.885766     0.791766     0.963670           255
14  2002  1.060945     0.953562     1.165773           255
15  2001  1.117587     1.047669     1.192748           255
16  2000  1.085899     0.962649     1.211827           255
17  1999  0.939475     0.848176     0.998502           261

As we can see, read_csv used automatically the first line as the names for the columns. It is possible to give other names to the columns. For this purpose, we have to skip the first line by setting the parameter "header" to 0 and we have to assign a list with the column names to the parameter "names":

import pandas as pd

exchange_rates = pd.read_csv("../data1/dollar_euro.txt",
                             sep="\t",
                             header=0,
                             names=["year", "min", "max", "days"])
print(exchange_rates)

OUTPUT:

          year       min       max  days
2016  0.901696  0.864379  0.959785   247
2015  0.901896  0.830358  0.947688   256
2014  0.753941  0.716692  0.823655   255
2013  0.753234  0.723903  0.783208   255
2012  0.778848  0.743273  0.827198   256
2011  0.719219  0.671953  0.775855   257
2010  0.755883  0.686672  0.837381   258
2009  0.718968  0.661376  0.796495   256
2008  0.683499  0.625391  0.802568   256
2007  0.730754  0.672314  0.775615   255
2006  0.797153  0.750131  0.845594   255
2005  0.805097  0.740357  0.857118   257
2004  0.804828  0.733514  0.847314   259
2003  0.885766  0.791766  0.963670   255
2002  1.060945  0.953562  1.165773   255
2001  1.117587  1.047669  1.192748   255
2000  1.085899  0.962649  1.211827   255
1999  0.939475  0.848176  0.998502   261

Exercise 1

The file "countries_population.csv" is a csv file, containing the population numbers of all countries (July 2014). The delimiter of the file is a space and commas are used to separate groups of thousands in the numbers. The method 'head(n)' of a DataFrame can be used to give out only the first n rows or lines. Read the file into a DataFrame.

Solution:

pop = pd.read_csv("../data1/countries_population.csv", 
                  header=None,
                  names=["Country", "Population"],
                  index_col=0,
                  quotechar="'", 
                  sep=" ", 
                  thousands=",")
print(pop.head(5))

OUTPUT:

                Population
Country                   
China           1355692576
India           1236344631
European Union   511434812
United States    318892103
Indonesia        253609643

Writing csv Files

Writing CSV Files

We can create csv (or dsv) files with the method "to_csv". Before we do this, we will prepare some data to output, which we will write to a file. We have two csv files with population data for various countries. countries_male_population.csv contains the figures of the male populations and countries_female_population.csv correspondingly the numbers for the female populations. We will create a new csv file with the sum:

column_names = ["Country"] + list(range(2002, 2013))
male_pop = pd.read_csv("../data1/countries_male_population.csv",
                  header=None,
                  index_col=0,
                  names=column_names)

female_pop = pd.read_csv("../data1/countries_female_population.csv",
                         header=None,
                         index_col=0,
                         names=column_names)


population = male_pop + female_pop

population

	2002	2003	2004	2005	2006	2007	2008	2009	2010	2011	2012
Country
Australia	19640979.0	19872646	20091504	20339759	20605488	21015042	21431781	21874920	22342398	22620554	22683573
Austria	8139310.0	8067289	8140122	8206524	8265925	8298923	8331930	8355260	8375290	8404252	8443018
Belgium	10309725.0	10355844	10396421	10445852	10511382	10584534	10666866	10753080	10839905	10366843	11035958
Canada	NaN	31361611	31372587	31989454	32299496	32649482	32927372	33327337	33334414	33927935	34492645
Czech Republic	10269726.0	10203269	10211455	10220577	10251079	10287189	10381130	10467542	10506813	10532770	10505445
Denmark	5368354.0	5383507	5397640	5411405	5427459	5447084	5475791	5511451	5534738	5560628	5580516
Finland	5194901.0	5206295	5219732	5236611	5255580	5276955	5300484	5326314	5351427	5375276	5401267
France	59337731.0	59630121	59900680	62518571	62998773	63392140	63753140	64366962	64716310	65129746	65394283
Germany	82440309.0	82536680	82531671	82500849	82437995	82314906	82217837	82002356	81802257	81751602	81843743
Greece	10988000.0	11006377	11040650	11082751	11125179	11171740	11213785	11260402	11305118	11309885	11290067
Hungary	10174853.0	10142362	10116742	10097549	10076581	10066158	10045401	10030975	10014324	9985722	9957731
Iceland	286575.0	288471	290570	293577	299891	307672	315459	319368	317630	318452	319575
Ireland	3882683.0	3963636	4027732	4109173	4209019	4239848	4401335	4450030	4467854	4569864	4582769
Italy	56993742.0	57321070	57888245	58462375	58751711	59131287	59619290	60045068	60340328	60626442	60820696
Japan	127291000.0	127435000	127620000	127687000	127767994	127770000	127771000	127692000	127510000	128057000	127799000
Korea	47639618.0	47925318	48082163	48138077	48297184	48456369	48606787	48746693	48874539	49779440	50004441
Luxembourg	444050.0	448300	451600	455000	469086	476187	483799	493500	502066	511840	524853
Mexico	101826249.0	103039964	104213503	103001871	103946866	104874282	105790725	106682518	107550697	108396211	115682867
Netherlands	16105285.0	16192572	16258032	16305526	16334210	16357992	16405399	16485787	16574989	16655799	16730348
New Zealand	3939130.0	4009200	4062500	4100570	4139470	4228280	4268880	4315840	4367740	4405150	4433100
Norway	4524066.0	4552252	4577457	4606363	4640219	4681134	4737171	4799252	4858199	4920305	4985870
Poland	38632453.0	38218531	38190608	38173835	38157055	38125479	38115641	38135876	38167329	38200037	38538447
Portugal	10335559.0	10407465	10474685	10529255	10569592	10599095	10617575	10627250	10637713	10636979	10542398
Slovak Republic	5378951.0	5379161	5380053	5384822	5389180	5393637	5400998	5412254	5424925	5435273	5404322
Spain	40409330.0	41550584	42345342	43038035	43758250	44474631	45283259	45828172	45989016	46152926	46818221
Sweden	8909128.0	8940788	8975670	9011392	9047752	9113257	9182927	9256347	9340682	9415570	9482855
Switzerland	7261210.0	7313853	7364148	7415102	7459128	7508739	7593494	7701856	7785806	7870134	7954662
Turkey	NaN	70171979	70689500	71607500	72519974	72519974	70586256	71517100	72561312	73722988	74724269
United Kingdom	58706905.0	59262057	59699828	60059858	60412870	60781346	61179260	61595094	62026962	62498612	63256154
United States	277244916.0	288774226	290810719	294442683	297308143	300184434	304846731	305127551	307756577	309989078	312232049

population.to_csv("../data1/countries_total_population.csv")

We want to create a new DataFrame with all the information, i.e. female, male and complete population. This means that we have to introduce an hierarchical index. Before we do it on our DataFrame, we will introduce this problem in a simple example:

import pandas as pd

shop1 = {"foo":{2010:23, 2011:25}, "bar":{2010:13, 2011:29}}
shop2 = {"foo":{2010:223, 2011:225}, "bar":{2010:213, 2011:229}}

shop1 = pd.DataFrame(shop1)
shop2 = pd.DataFrame(shop2)
both_shops = shop1 + shop2
print("Sales of shop1:\n", shop1)
print("\nSales of both shops\n", both_shops)

OUTPUT:

Sales of shop1:
       foo  bar
2010   23   13
2011   25   29

Sales of both shops
       foo  bar
2010  246  226
2011  250  258

shops = pd.concat([shop1, shop2], keys=["one", "two"])
shops

		foo	bar
one	2010	23	13
one	2011	25	29
two	2010	223	213
two	2011	225	229

We want to swap the hierarchical indices. For this we will use 'swaplevel':

shops.swaplevel()
shops.sort_index(inplace=True)
shops

		foo	bar
one	2010	23	13
one	2011	25	29
two	2010	223	213
two	2011	225	229

We will go back to our initial problem with the population figures. We will apply the same steps to those DataFrames:

pop_complete = pd.concat([population.T, 
                          male_pop.T,
                          female_pop.T], 
                          keys=["total", "male", "female"])
df = pop_complete.swaplevel()
df.sort_index(inplace=True)
df[["Austria", "Australia", "France"]]

	Country	Austria	Australia	France
2002	female	4179743.0	9887846.0	30510073.0
	male	3959567.0	9753133.0	28827658.0
	total	8139310.0	19640979.0	59337731.0
2003	female	4158169.0	9999199.0	30655533.0
	male	3909120.0	9873447.0	28974588.0
	total	8067289.0	19872646.0	59630121.0
2004	female	4190297.0	10100991.0	30789154.0
	male	3949825.0	9990513.0	29111526.0
	total	8140122.0	20091504.0	59900680.0
2005	female	4220228.0	10218321.0	32147490.0
	male	3986296.0	10121438.0	30371081.0
	total	8206524.0	20339759.0	62518571.0
2006	female	4246571.0	10348070.0	32390087.0
	male	4019354.0	10257418.0	30608686.0
	total	8265925.0	20605488.0	62998773.0
2007	female	4261752.0	10570420.0	32587979.0
	male	4037171.0	10444622.0	30804161.0
	total	8298923.0	21015042.0	63392140.0
2008	female	4277716.0	10770864.0	32770860.0
	male	4054214.0	10660917.0	30982280.0
	total	8331930.0	21431781.0	63753140.0
2009	female	4287213.0	10986535.0	33208315.0
	male	4068047.0	10888385.0	31158647.0
	total	8355260.0	21874920.0	64366962.0
2010	female	4296197.0	11218144.0	33384930.0
	male	4079093.0	11124254.0	31331380.0
	total	8375290.0	22342398.0	64716310.0
2011	female	4308915.0	11359807.0	33598633.0
	male	4095337.0	11260747.0	31531113.0
	total	8404252.0	22620554.0	65129746.0
2012	female	4324983.0	11402769.0	33723892.0
	male	4118035.0	11280804.0	31670391.0
	total	8443018.0	22683573.0	65394283.0

df.to_csv("../data1/countries_total_population.csv")

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Upcoming online Courses

Python Intensive Course

23 Jun to 27 Jun 2025
28 Jul to 01 Aug 2025
08 Sep to 12 Sep 2025
20 Oct to 24 Oct 2025

Data Analysis with Python

04 Jun to 06 Jun 2025
30 Jul to 01 Aug 2025
10 Sep to 12 Sep 2025
22 Oct to 24 Oct 2025

Efficient Data Analysis with Pandas

02 Jun to 03 Jun 2025
23 Jun to 24 Jun 2025
28 Jul to 29 Jul 2025
08 Sep to 09 Sep 2025
20 Oct to 21 Oct 2025

Python Text Processing Course

04 Jun to 06 Jun 2025
10 Sep to 12 Sep 2025
22 Oct to 24 Oct 2025

See our Python training courses

See our Machine Learning with Python training courses

Exercise 2

Read in the dsv file (csv) bundeslaender.txt. Create a new file with the columns 'land', 'area', 'female', 'male', 'population' and 'density' (inhabitants per square kilometres.
print out the rows where the area is greater than 30000 and the population is greater than 10000
Print the rows where the density is greater than 300

lands = pd.read_csv('../data1/bundeslaender.txt', sep=" ")
print(lands.columns.values)

OUTPUT:

['land' 'area' 'male' 'female']

# swap the columns of our DataFrame:
lands = lands.reindex(columns=['land', 'area', 'female', 'male'])
lands[:2]

	land	area	female	male
0	Baden-Württemberg	35751.65	5465	5271
1	Bayern	70551.57	6366	6103

lands.insert(loc=len(lands.columns), 
             column='population', 
             value=lands['female'] + lands['male'])

lands[:3]

	land	area	female	male	population
0	Baden-Württemberg	35751.65	5465	5271	10736
1	Bayern	70551.57	6366	6103	12469
2	Berlin	891.85	1736	1660	3396

lands.insert(loc=len(lands.columns), 
             column='density', 
             value=(lands['population'] * 1000 / lands['area']).round(0))

lands[:4]

	land	area	female	male	population	density
0	Baden-Württemberg	35751.65	5465	5271	10736	300.0
1	Bayern	70551.57	6366	6103	12469	177.0
2	Berlin	891.85	1736	1660	3396	3808.0
3	Brandenburg	29478.61	1293	1267	2560	87.0

print(lands.loc[(lands.area>30000) & (lands.population>10000)])

OUTPUT:

                  land      area  female  male  population  density
0    Baden-Württemberg  35751.65    5465  5271       10736    300.0
1               Bayern  70551.57    6366  6103       12469    177.0
9  Nordrhein-Westfalen  34085.29    9261  8797       18058    530.0

Reading and Writing Excel Files

It is also possible to read and write Microsoft Excel files. The Pandas functionalities to read and write Excel files use the modules 'xlrd' and 'openpyxl'. These modules are not automatically installed by Pandas, so you may have to install them manually!

We will use a simple Excel document to demonstrate the reading capabilities of Pandas. The document sales.xls contains two sheets, one called 'week1' and the other one 'week2'.

An Excel file can be read in with the Pandas function "read_excel". This is demonstrated in the following example Python code:

with pd.ExcelFile("../data1/sales.xls") as excel_file:
    sheet = pd.read_excel(excel_file)
sheet

	Weekday	Sales
0	Monday	123432.980000
1	Tuesday	122198.650200
2	Wednesday	134418.515220
3	Thursday	131730.144916
4	Friday	128173.431003

The document "sales.xls" contains two sheets, but we only have been able to read in the first one with "read_excel". A complete Excel document, which can consist of an arbitrary number of sheets, can be completely read in like this:

docu = {}
for sheet_name in excel_file.sheet_names:
    docu[sheet_name] = excel_file.parse(sheet_name)

for sheet_name in docu:
    print("\n" + sheet_name + ":\n", docu[sheet_name])

OUTPUT:

week1:
      Weekday          Sales
0     Monday  123432.980000
1    Tuesday  122198.650200
2  Wednesday  134418.515220
3   Thursday  131730.144916
4     Friday  128173.431003

week2:
      Weekday          Sales
0     Monday  223277.980000
1    Tuesday  234441.879000
2  Wednesday  246163.972950
3   Thursday  241240.693491
4     Friday  230143.621590

We will calculate now the avarage sales numbers of the two weeks:

average = docu["week1"].copy()
average["Sales"] = (docu["week1"]["Sales"] + docu["week2"]["Sales"]) / 2
print(average)

OUTPUT:

     Weekday          Sales
0     Monday  173355.480000
1    Tuesday  178320.264600
2  Wednesday  190291.244085
3   Thursday  186485.419203
4     Friday  179158.526297

We will save the DataFrame 'average' in a new document with 'week1' and 'week2' as additional sheets as well:

with pd.ExcelWriter('../data1/sales_average.xlsx') as writer:
    docu['week1'].to_excel(writer,'week1')
    docu['week2'].to_excel(writer,'week2')
    average.to_excel(writer,'average')
    writer.save()

Sales_average LibreOffice

Live Python training

Enjoying this page? We offer live Python training courses covering the content of this site.

Upcoming online Courses

Python Intensive Course

23 Jun to 27 Jun 2025
28 Jul to 01 Aug 2025
08 Sep to 12 Sep 2025
20 Oct to 24 Oct 2025

Data Analysis with Python

04 Jun to 06 Jun 2025
30 Jul to 01 Aug 2025
10 Sep to 12 Sep 2025
22 Oct to 24 Oct 2025

Efficient Data Analysis with Pandas

02 Jun to 03 Jun 2025
23 Jun to 24 Jun 2025
28 Jul to 29 Jul 2025
08 Sep to 09 Sep 2025
20 Oct to 21 Oct 2025

Python Text Processing Course

04 Jun to 06 Jun 2025
10 Sep to 12 Sep 2025
22 Oct to 24 Oct 2025

See our Python training courses

See our Machine Learning with Python training courses