Section 2: Reading Data from Files
Table of Contents >
Chapter 6 > Section 2
In this section, we assume that we have data stored in a text file that we
want to process in a program. The data could have been generated by
another program that we wrote or we could have obtained it from some other
source such as the Internet.
Python provides us with a way of reading data from a text file one line at
a time. Each line of data is provided to us as a
str
object. Data in a text file can therefore be viewed as a
(listof
str)
where each
str
in the list represents an
entire line of data in the file.
Before we can read data from a file, we must first open that file for
reading. In Python, we achieve this using the built-in
open
function. When we are done with a file, we must remember to close
it. Python makes this particularly easy to do through the use of the
with
construct:
with open('myFile.txt', 'U') as f:
...f
This segment of code opens the file named
'myFile.txt'
for
reading and assigns it to the variable
f
. When the
block of code associated with the
with
statement has
finished executing, the file is automatically closed for us. Notice
that the
open
function takes two arguments. The first
is the name of the file, the second is a mode which specifies how the file
is to be used. The
'U'
or universal mode indicates
that the file is to be opened for reading with universal support for
end-of-line markers. This is a technical issue that we will not
consider in detail. Suffice it to say that the way an end-of-line is
represented is different on different systems. By specifying
universal mode, Python gracefully handles these different
representations. So, if we obtain a file from a colleague who uses
Windows, we can transfer it to an OS/X or Linux system and have our Python
program read it without having to worry about the fact that end-of-line
markers on these systems are different.
Having opened a file for reading, the following provides a template for
reading data from the file:
for line in f:
...line
where line is of type
str
. The problem of processing
data in a file is now reduced to that of processing a string that
represents data on a single line in that file.
In the case where each line in the file contains a string that represents
a single numeric value, it is not hard to process an entire line of
data. Python has constructors for
int
and
float
that convert a string representation of a numeric value to the
corresponding type. So, for example,
float('34.256')
converts
the
str
object
'34.256'
to the
float
whose value is
34.256
.
The function
sum_file
presented below computes the sum of
the values stored in a file where values are stored one per line. We
assume that there are no empty lines in the file or lines that contain any
other form of data. If we opened such a file with a text editor, we
might therefore see something like:
43.63
23.51
6
0.453
-12.443
def sum_file(f):
"""
file -> Real
Produces the sum of data stored in file f
Requires: f has been opened for reading;
one real number per line
"""
sumData = 0.0
for line in f:
sumData = sumData +
float(line)
return sumData
Notice the
requires clause in the documentation. The
requires clause lists certain conditions that must be satisfied if the
purpose statement is to hold true. First, the file must have already
been opened for reading - notice that we don't make a call to
open
from this function, so the file must have been opened prior to calling
sum_file
.
Second, there must be only one real number per line in the file. If
the second condition does not hold, the expression
float(line)
will produce an error as the data in
line
cannot be
interpreted as a single
float
.
Testing functions that read from files
Notice that the function
sum_file
does not have any
doctests. There are two approaches we can take to testing functions
that read from files.
First, we can generate a sample data file for the sole purpose of testing
that contains data for which we know the expected result. We can
open the file for reading and make a call to
sum_file
from
the doctests. If we take this approach, we have to be careful to
always keep the sample data file and the Python source file together, as
the doctests now depend on the sample data file.
The second approach is to design something that behaves like a file for
the purposes of testing our function. Note that the signature of
sum_file
indicates that this function consumes something of type
file
. Python uses a mechanism called
duck typing to
determine if something has an appropriate interface: "if it looks like a
duck and quacks like a duck, it must be a duck.", (ref: the Python
glossary).
So if, in the context of the
sum_file
function, the object
we pass as a parameter looks like a file and behaves like a file, it must
be a file. So what must this object be if it is to look and behave
like a file in the context of the
sum_file
function?
The only requirement imposed by
sum_file
is that we can
iterate over it and retrieve each line of text as an object of type
str
.
The standard Python library provides a class of such objects in the
io
module. A
StringIO
object is constructed from a string
and supports all the file operations used by the
sum_file
function - in other words, from the point of view of the
sum_file
function, it looks like a file and behaves like a file, so it's a
file! Our
sum_file
function with tests included is
presented in
Code
Explorer 7.1
Some comments are in order.
- Note that the string that we pass to the constructor of the
StringIO
object represents the data in the file. The empty string
represents an empty file. If there is more than one line of data
in the file, we must explicitly include \n
to indicate
where each line of data ends. Also note that we must write \n
as \\n
so that the \n
does not get
interpreted as an end-of-line in the test itself.
- Note that the string that represents the data in the file is too
long to fit on one line. We close the string, include a
continuation marker
\
at the end of line and then
continue the string on the next line. This segment of code:
'1.0\\n2.5\\n5.4\\n3.0\\n' \
'2.5\\n0.5\\n'
is therefore interpreted as the single string:
'1.0\\n2.5\\n5.4\\n3.0\\n2.5\\n0.5\\n'
- Suppose we wish to sum data stored in the file
myData.txt
.
We make a call to our sum_file
function from the Python
shell and print the result on the screen as follows:
>>> with open('myData.txt', 'U') as f:
>>> print sum_file(f)