python-course.eu

11. Reading and Writing Data Files: ndarrays

By Bernd Klein. Last modified: 21 Feb 2024.

Scrabble with the Text Numpy, read, write, array

There are lots of ways for reading from file and writing to data files in numpy. We will discuss the different ways and corresponding functions in this chapter:


Saving textfiles with savetxt

The first two functions we will cover are savetxt and loadtxt.

In the following simple example, we define an array x and save it as a textfile with savetxt:

import numpy as np

x = np.array([[1, 2, 3], 
              [4, 5, 6],
              [7, 8, 9]], np.int32)

np.savetxt("../data/test.txt", x)

The file "test.txt" is a textfile and its content looks like this:

bernd@andromeda:~/Dropbox/notebooks/numpy$ more test.txt
1.000000000000000000e+00 2.000000000000000000e+00 3.000000000000000000e+00
4.000000000000000000e+00 5.000000000000000000e+00 6.000000000000000000e+00
7.000000000000000000e+00 8.000000000000000000e+00 9.000000000000000000e+00

Attention: The above output has been created on the Linux command prompt!

It's also possible to print the array in a special format, like for example with three decimal places or as integers, which are preceded with leading blanks, if the number of digits is less than 4 digits. For this purpose we assign a format string to the third parameter 'fmt'. We saw in our first example that the default delimeter is a blank. We can change this behaviour by assigning other character strings to the parameter "delimiter".

np.savetxt("../data/test2.txt", x, fmt="%2.3f", delimiter=",")

The newly created file look like this:

bernd@andromeda:~/Dropbox/notebooks/numpy$ more test2.txt 
1.000,2.000,3.000
4.000,5.000,6.000
7.000,8.000,9.000

The complete syntax of savetxt looks like this:

savetxt(fname, X, fmt='%.18e', delimiter=' ', newline='\n', header='', footer='', comments='# ')
Parameter Meaning
X array_like Data to be saved to a text file.
fmt str or sequence of strs, optional
A single format (%10.5f), a sequence of formats, or a multi-format string, e.g. 'Iteration %d -- %10.5f', in which case 'delimiter' is ignored. For complex 'X', the legal options for 'fmt' are:
a) a single specifier, "fmt='%.4e'", resulting in numbers formatted like "' (%s+%sj)' % (fmt, fmt)"
b) a full string specifying every real and imaginary part, e.g. "' %.4e %+.4j %.4e %+.4j %.4e %+.4j'" for 3 columns
c) a list of specifiers, one per column - in this case, the real and imaginary part must have separate specifiers, e.g. "['%.3e + %.3ej', '(%.15e%+.15ej)']" for 2 columns
delimiter A string used for separating the columns.
newline A string (e.g. "\n", "\r\n" or ",\n") which will end a line instead of the default line ending
header A String that will be written at the beginning of the file.
footer A String that will be written at the end of the file.
comments A String that will be prepended to the 'header' and 'footer' strings, to mark them as comments. The hash tag '#' is used as the default.

Live Python training

instructor-led training course

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Enrol here

Loading Textfiles with loadtxt

We will read in now the file "test.txt", which we have written in our previous subchapter:

y = np.loadtxt("../data/test.txt")
print(y)

OUTPUT:

[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]
y = np.loadtxt("../data/test2.txt", delimiter=",")
print(y)

OUTPUT:

[[1. 2. 3.]
 [4. 5. 6.]
 [7. 8. 9.]]

It's also possible to choose the columns by index:

y = np.loadtxt("../data/test2.txt", delimiter=",", usecols=(0,2))
print(y)

OUTPUT:

[[1. 3.]
 [4. 6.]
 [7. 9.]]

We will read in our next example the file "times_and_temperatures.txt", which we have created in our chapter on Generators of our Python tutorial. Every line contains a time in the format "hh::mm::ss" and random temperatures between 10.0 and 25.0 degrees. We have to convert the time string into float numbers. The time will be in minutes with seconds in the hundred. We define first a function which converts "hh::mm::ss" into minutes:

def time2float_minutes(time):
    """ turning times into decimal minutes """
    if type(time) == bytes:
        time = time.decode()   # turn bytcode to unicode
    t = time.split(":")
    minutes = float(t[0])*60 + float(t[1]) + float(t[2]) * 0.05 / 3
    return minutes

You might have noticed that we check the type of time for binary. The reason for this is the use of our function "time2float_minutes in loadtxt in the following example. In this case the function will be called with bytestrings. The keyword parameter converters of loadtxt contains a dictionary which can hold a function for a column (the key of the column corresponds to the key of the dictionary) to convert the string data of this column into a float. The string data is a byte string. That is why we had to transfer it into a a unicode string in our function:

y = np.loadtxt("../data/times_and_temperatures.txt", 
               converters={ 0: time2float_minutes})
print(y)

OUTPUT:

[[ 360.    20.1]
 [ 361.5   16.1]
 [ 363.    16.9]
 ...
 [1375.5   22.5]
 [1377.    11.1]
 [1378.5   15.2]]

We can rewrite the function time2float_minutes. This is possible because split and float can be applied both on bytestrings and unicode strings. If we use split on a bytestring, we have to make sure that the argument of split has to be a bytestring as well that's why we assign corresponging values to the varialbe slpitter:

def time2float_minutes(time):
    """ turning times into decimal minutes """
    splitter = b":" if type(time) == bytes else ":"
    t = time.split(splitter)
    minutes = float(t[0])*60 + float(t[1]) + float(t[2]) * 0.05 / 3
    return minutes

for t in ["06:00:10", "06:27:45", "12:59:59"]:
    print(time2float_minutes(t))

OUTPUT:

360.1666666666667
387.75
779.9833333333333

tofile

tofile is a function to write the content of an array to a file both in binary, which is the default, and text format.

A.tofile(fid, sep="", format="%s")

The data of the A ndarry is always written in 'C' order, regardless of the order of A.

The data file written by this method can be reloaded with the function fromfile().

Parameter Meaning
fid can be either an open file object, or a string containing a filename.
sep The string 'sep' defines the separator between array items for text output. If it is empty (''), a binary file is written, equivalent to file.write(a.tostring()).
format Format string for text file output. Each entry in the array is formatted to text by first converting it to the closest Python type, and then using 'format' % item.

Remark:

Information on endianness and precision is lost. Therefore it may not be a good idea to use the function to archive data or transport data between machines with different endianness. Some of these problems can be overcome by outputting the data as text files, at the expense of speed and file size.

dt = np.dtype([('time', [('min', int), ('sec', int)]),
               ('temp', float)])
x = np.zeros((1,), dtype=dt)
x['time']['min'] = 10
x['temp'] = 98.25
print(x)

fh = open("test6.txt", "bw")
x.tofile(fh)

OUTPUT:

[((10, 0), 98.25)]

Live Python training

instructor-led training course

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Upcoming online Courses

Enrol here

fromfile

fromfile to read in data, which has been written with the tofile function. It's possible to read binary data, if the data type is known. It's also possible to parse simply formatted text files. The data from the file is turned into an array.

The general syntax looks like this:

numpy.fromfile(file, dtype=float, count=-1, sep='')

Parameter Meaning
file 'file' can be either a file object or the name of the file to read.
dtype defines the data type of the array, which will be constructed from the file data. For binary files, it is used to determine the size and byte-order of the items in the file.
count defines the number of items, which will be read. -1 means all items will be read.

| sep | The string 'sep' defines the separator between the items, if the file is a text file. If it is empty (''), the file will be treated as a binary file. A space (" ") in a separator matches zero or more whitespace characters. A separator consisting solely of spaces has to match at least one whitespace.

fh = open("../data/test.txt", "rb")

np.fromfile(fh, dtype=dt)

OUTPUT:

array([((3472328296227679793, 3472328296227680304), 1.39642638e-076),
       ((3472328296227549728, 3472328296227680304), 1.18295070e-076),
       ((3472328296194318384, 3472328296227680304), 1.21089429e-099),
       ((3472328287702364208, 3472328296227680304), 2.62395837e+179),
       ((3472326118410825771, 3472328296227680304), 1.39804329e-076),
       ((3471771874624547685, 3472328296227680304), 1.39804329e-076),
       ((3330141651546629424, 3472328296227680304), 1.39804329e-076),
       ((4044285448823320624, 3472328296227680302), 1.39804329e-076),
       ((2319406771035189296, 3472328296227679801), 1.39804329e-076)],
      dtype=[('time', [('min', '<i8'), ('sec', '<i8')]), ('temp', '<f8')])
import numpy as np
import os

# platform dependent: difference between Linux and Windows
#data = np.arange(50, dtype=np.int)

data = np.arange(50, dtype=np.int32)
data.tofile("../data/test4.txt")

fh = open("../data/test4.txt", "rb")
# 4 * 32 = 128
fh.seek(128, os.SEEK_SET)

x = np.fromfile(fh, dtype=np.int32)
print(x)

OUTPUT:

[32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]

Attention:

It can cause problems to use tofile and fromfile for data storage, because the binary files generated are not platform independent. There is no byte-order or data-type information saved by tofile. Data can be stored in the platform independent .npy format using save and load instead.

Best Practice to Load and Save Data

The recommended way to store and load data with Numpy in Python consists in using load and save. We also use a temporary file in the following :

import numpy as np

print(x)

from tempfile import TemporaryFile

outfile = TemporaryFile()

x = np.arange(10)
np.save(outfile, x)

outfile.seek(0) # Only needed here to simulate closing & reopening file
np.load(outfile)

OUTPUT:

[32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Live Python training

instructor-led training course

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Enrol here

and yet another way: genfromtxt

There is yet another way to read tabular input from file to create arrays. As the name implies, the input file is supposed to be a text file. The text file can be in the form of an archive file as well. genfromtxt can process the archive formats gzip and bzip2. The type of the archive is determined by the extension of the file, i.e. '.gz' for gzip and bz2' for an bzip2.

genfromtxt is slower than loadtxt, but it is capable of coping with missing data. It processes the file data in two passes. At first it converts the lines of the file into strings. Thereupon it converts the strings into the requested data type. loadtxt on the other hand works in one go, which is the reason, why it is faster.

Example with genfromtxt:

If the dtype is set to None, as in the following example, the dtypes will be determined by the contents of each column, individually.

sales = np.genfromtxt("../data/shop_sales_figures.txt", 
                      encoding='utf8',
                      dtype=None)
sales

OUTPUT:

array([['Year', 'Frankfurt', 'Munich', 'Zurich'],
       ['2000', '1245.89', '2220.00', '1936.25'],
       ['2001', '1289.99', '2405.14', '2064.32'],
       ['2002', '1379.04', '1984.90', '1879.30'],
       ['2003', '1450.89', '2178.34', '2027.51'],
       ['2004', '1680.98', '2163.86', '2147.96'],
       ['2005', '1860.33', '2079.97', '2201.28'],
       ['2006', '2103.54', '2310.92', '2466.17'],
       ['2007', '2354.54', '2360.46', '2634.07'],
       ['2008', '2648.10', '2433.92', '2839.12'],
       ['2009', '2971.56', '2566.19', '3093.72'],
       ['2010', '3338.08', '2661.59', '3351.77'],
       ['2011', '3747.93', '2774.12', '3643.60'],
       ['2012', '4209.09', '2901.24', '3972.25'],
       ['2013', '4726.47', '3026.45', '4331.24'],
       ['2014', '5307.72', '3162.80', '4732.13']], dtype='<U9')

The reason why all the columns are turned into strings instead of float numbers lies in the first line, because the first line contains strings and not floats. We can skip the first line by using the skip_header parameter and get float numbers for all the columns.

sales = np.genfromtxt("../data/shop_sales_figures.txt", 
                      encoding='utf8',
                      skip_header=1,
                      dtype=None)
sales

OUTPUT:

array([(2000, 1245.89, 2220.  , 1936.25),
       (2001, 1289.99, 2405.14, 2064.32),
       (2002, 1379.04, 1984.9 , 1879.3 ),
       (2003, 1450.89, 2178.34, 2027.51),
       (2004, 1680.98, 2163.86, 2147.96),
       (2005, 1860.33, 2079.97, 2201.28),
       (2006, 2103.54, 2310.92, 2466.17),
       (2007, 2354.54, 2360.46, 2634.07),
       (2008, 2648.1 , 2433.92, 2839.12),
       (2009, 2971.56, 2566.19, 3093.72),
       (2010, 3338.08, 2661.59, 3351.77),
       (2011, 3747.93, 2774.12, 3643.6 ),
       (2012, 4209.09, 2901.24, 3972.25),
       (2013, 4726.47, 3026.45, 4331.24),
       (2014, 5307.72, 3162.8 , 4732.13)],
      dtype=[('f0', '<i8'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<f8')])

recfromcsv(fname, **kwargs)

This is not really another way to read in csv data. 'recfromcsv' basically a shortcut for

np.genfromtxt(filename, delimiter=",", dtype=None)

Live Python training

instructor-led training course

Enjoying this page? We offer live Python training courses covering the content of this site.

See: Live Python courses overview

Upcoming online Courses

Enrol here