Files in Python (Part I)
When we want to read or write a file (say on your hard drive), we first must open the file. Opening the file communicates with your operating system which knows where the data for each file is stored. When you open a file, you are asking the operating system to find the file by name and make sure the file exists.
>>> fhand = open('mbox.txt')
>>> print fhand
<open file ‘mbox.txt’, mode ‘r’ at 0x1005088b0>
If the open is successful, the operating system returns us a file handle. The file handle is not the actual data contained in the file, but instead it is a “handle” that we can use to read the data. You are given a handle if the requested file exists and you have the proper permissions to read the file. If the file does not exist, open will fail with a traceback and you will not get a handle to access the contents of the file:
>>> fhand = open('stuff.txt')
Traceback (most recent call last):
File “”, line 1, in
IOError: [Errno 2] No such file or directory: ‘stuff.txt’
A text file can be thought of as a sequence of lines, much like a Python string can be thought of as a sequence of characters. To break the file into lines, there is a special character that represents the “end of the line” called the newline character. In Python, we represent the newline character as a backslash-n in string constants. Even though this looks like two characters, it is actually a single character. When we look at the variable by entering “stuff” in the interpreter, it shows us the \n in the string, but when we use print to show the string, we see the string broken into two lines by the newline character.
>>> stuff = 'Hello\nWorld!'
>>> stuff
'Hello\nWorld!'
>>> print stuff
Hello
World!
>>> stuff = 'X\nY'
>>> print stuff
X
Y
>>> len(stuff)
3
You can also see that the length of the string ‘X\nY’ is three characters because the newline character is a single character. So when we look at the lines in a file, we need to imagine that there is a special invisible character at the end of each line that marks the end of the line called the newline. So the newline character separates the characters in the file into lines.
While the file handle does not contain the data for the file, it is quite easy to construct a for loop to read through and count each of the lines in a file:
fhand = open('mbox.txt')
count = 0
for line in fhand:
- count = count + 1
print 'Line Count:', count
python open.py
Line Count: 132045
We can use the file handle as the sequence in our for loop. Our for loop simply counts the number of lines in the file and prints them out. The rough translation of the for loop into English is, “for each line in the file represented by the file handle, add one to the count variable.”
The reason that the open function does not read the entire file is that the file might be quite large with many gigabytes of data. The open statement takes the same amount of time regardless of the size of the file. The for loop actually causes the data to be read from the file. When the file is read using a for loop in this manner, Python takes care of splitting the data in the file into separate lines using the newline character. Python reads each line through the newline and includes the newline as the last character in the line variable for each iteration of the for loop. Because the for loop reads the data one line at a time, it can efficiently read and count the lines in very large files without running out of main memory to store the data. The above program can count the lines in any size file using very little memory since each line is read, counted, and then discarded. If you know the file is relatively small compared to the size of your main memory, you can read the whole file into one string using the read method on the file handle.
>>> fhand = open('mbox-short.txt')
>>> inp = fhand.read()
>>> print len(inp)
94626