Section 2: Reading Data from Files

Table of Contents > Chapter 6 > Section 2

In this section, we assume that we have data stored in a text file that we want to process in a program. The data could have been generated by another program that we wrote or we could have obtained it from some other source such as the Internet.

Python provides us with a way of reading data from a text file one line at a time. Each line of data is provided to us as a str object. Data in a text file can therefore be viewed as a

(listof
        str)

where each str in the list represents an entire line of data in the file.

Before we can read data from a file, we must first open that file for reading. In Python, we achieve this using the built-in open function. When we are done with a file, we must remember to close it. Python makes this particularly easy to do through the use of the with construct:

with open('myFile.txt', 'U') as f:
      ...f
This segment of code opens the file named 'myFile.txt' for reading and assigns it to the variable f. When the block of code associated with the with statement has finished executing, the file is automatically closed for us. Notice that the open function takes two arguments. The first is the name of the file, the second is a mode which specifies how the file is to be used. The 'U' or universal mode indicates that the file is to be opened for reading with universal support for end-of-line markers. This is a technical issue that we will not consider in detail. Suffice it to say that the way an end-of-line is represented is different on different systems. By specifying universal mode, Python gracefully handles these different representations. So, if we obtain a file from a colleague who uses Windows, we can transfer it to an OS/X or Linux system and have our Python program read it without having to worry about the fact that end-of-line markers on these systems are different.

Having opened a file for reading, the following provides a template for reading data from the file:

for line in f:
     ...line
where line is of type str. The problem of processing data in a file is now reduced to that of processing a string that represents data on a single line in that file.

In the case where each line in the file contains a string that represents a single numeric value, it is not hard to process an entire line of data. Python has constructors for int and float that convert a string representation of a numeric value to the corresponding type. So, for example, float('34.256')converts the str object '34.256' to the float whose value is 34.256.

The function sum_file presented below computes the sum of the values stored in a file where values are stored one per line. We assume that there are no empty lines in the file or lines that contain any other form of data. If we opened such a file with a text editor, we might therefore see something like:

43.63 
23.51
6 0.453
 -12.443

def sum_file(f):
     """
     file -> Real
      Produces the sum of data stored in file f
     Requires: f has been opened for reading;
               one real number per line
     """ 
    sumData = 0.0

      for line in f:
         sumData = sumData + float(line)  

    return sumData
Notice the requires clause in the documentation. The requires clause lists certain conditions that must be satisfied if the purpose statement is to hold true. First, the file must have already been opened for reading - notice that we don't make a call to open from this function, so the file must have been opened prior to calling sum_file. Second, there must be only one real number per line in the file. If the second condition does not hold, the expression float(line) will produce an error as the data in line cannot be interpreted as a single float.

Testing functions that read from files

Notice that the function sum_file does not have any doctests. There are two approaches we can take to testing functions that read from files.
First, we can generate a sample data file for the sole purpose of testing that contains data for which we know the expected result. We can open the file for reading and make a call to sum_file from the doctests. If we take this approach, we have to be careful to always keep the sample data file and the Python source file together, as the doctests now depend on the sample data file.

The second approach is to design something that behaves like a file for the purposes of testing our function. Note that the signature of

sum_file

indicates that this function consumes something of type


        file

. Python uses a mechanism called duck typing to determine if something has an appropriate interface: "if it looks like a duck and quacks like a duck, it must be a duck.", (ref: the Python glossary). So if, in the context of the sum_file function, the object we pass as a parameter looks like a file and behaves like a file, it must be a file. So what must this object be if it is to look and behave like a file in the context of the sum_file function? The only requirement imposed by sum_file is that we can iterate over it and retrieve each line of text as an object of type str.

The standard Python library provides a class of such objects in the io module. A StringIO object is constructed from a string and supports all the file operations used by the sum_file function - in other words, from the point of view of the sum_file function, it looks like a file and behaves like a file, so it's a file! Our sum_file function with tests included is presented in

Some comments are in order.

Note that the string that we pass to the constructor of the StringIO object represents the data in the file. The empty string represents an empty file. If there is more than one line of data in the file, we must explicitly include \n to indicate where each line of data ends. Also note that we must write \n as \\n so that the \n does not get interpreted as an end-of-line in the test itself.
Note that the string that represents the data in the file is too long to fit on one line. We close the string, include a continuation marker \ at the end of line and then continue the string on the next line. This segment of code:

'1.0\\n2.5\\n5.4\\n3.0\\n' \ 
'2.5\\n0.5\\n'

'1.0\\n2.5\\n5.4\\n3.0\\n2.5\\n0.5\\n'

Suppose we wish to sum data stored in the file myData.txt. We make a call to our sum_file function from the Python shell and print the result on the screen as follows: 
>>> with open('myData.txt', 'U') as f:
>>> print sum_file(f)