Tags

, ,

The last two posts looked at Python list comprehensions. [See Simple Tricks #2 and Simple Tricks #3] This time we look at file handling, one of the most common tasks programmers deal with, especially with script languages such as Python.

Python’s native file object, created with the built in open function, is simple and easy, but here are some tricks that make file access even simpler and easier.

First, though, let’s consider how we use filenames. On most file systems, a file has specific name, the filename, and a directory (or folder) path, the pathname, that specifies where it is on disk. On some systems, the pathname can begin with a drive or resource identifier. These two comprise the full specification, the filespec, identifying the desired file.

Python, to a great extent in os.path, provides a rich set of functions for dealing with these aspects of a file specification. Most relevant here is the os.path.join function, which is the main tool for constructing the filespec. Canonically:

001| from os import path
002| 
003| filespec = path.join(pathname, subdir1, subdir2, filename)
004| 

The above fragment assumes that pathname, subdir, and filename are defined strings with relevant text. Note how the path.join function combines multiple parts into a single filespec string. It takes care of the separator character, usually a forward or backward slash, between parts of the file specification.

A general rule of thumb is to always make use of the path.join function to keep file path information separate from file name information. This gives you much more flexibility when (not if, when) file locations change.

It’s not uncommon to write a function that, given a filename, opens a file and either does something with it or returns data from that file. Typically, then, the function takes the file’s name as a parameter. I’ve found it useful to have the function take both the filename and the pathname. For one thing, this allows me to specify a default pathname. Here’s an example:

001| from os import path
002| 
003| BasePath = r’C:\users\me\proj’
004| 
005| def some_function_2 (*args, fname, fpath=BasePath):
006|     filespec = path.join(fpath,fname) if fpath else fname
007|     print(f’Filename: {filespec}’)
008|     
009| 
010| args = [42, 21, ‘caramel’]
011| some_function_2(*args, fname=‘foo.txt’)
012| some_function_2(*args, fname=‘foo.txt’, fpath=None)
013| some_function_2(*args, fname=‘foo.txt’, fpath=r’c:\users\me\proj\blog’)
014| print()
015| some_function_2(*args, fname=r’temp\foo.txt’)
016| some_function_2(*args, fname=r’temp\foo.txt’, fpath=None)
017| print()
018| some_function_2(*args, fname=r’C:\users\me\proj\blog\foo.txt’, fpath=None)
019| some_function_2(*args, fname=r’C:\users\me\proj\blog\foo.txt’)
020| print()
021| 

Line #3 defines a pathname, which the function definition on line #5 uses as the default value for fpath. (However, it requires a value for fname.) We can imagine that the function takes other arguments, as represented by the *args parameter.

What’s important here is line #6, which creates the filespec, uses the fpath variable unless it has been explicitly set to None by the calling code. In that case, it uses the fname variable as the entire filespec. This allows for multiple ways to call this function depending on the desired effect.

When run, this prints:

Filename: C:\users\me\proj\foo.txt
Filename: foo.txt
Filename: c:\users\me\proj\blog\foo.txt

Filename: C:\users\me\proj\temp\foo.txt
Filename: temp\foo.txt

Filename: C:\users\me\proj\blog\foo.txt
Filename: C:\users\me\proj\blog\foo.txt

The first three show, respectively, using the default filepath, setting it to None, and supplying a different filepath. Setting fpath to None completely eliminates path information from the filespec.

The middle two show how path information can be part of the filename but are otherwise like the first two examples. The last two show what happens when full path information is passed in fname, even if the default in fpath is used by the path.join function. The complete filespec in fname overrides the fpath value. Even so, callers wanting to override the fpath default should set it to None.

Here’s a slight variation on how we treat fname and fpath:

001| from os import path
002| 
003| def some_function_1 (*args, fname, fpath=None):
004|     filespec = path.join(fpath,fname) if fpath else fname
005|     print(f’Filename: {filespec}’)
006|     
007| 
008| args = [42, 21, ‘caramel’]
009| some_function_1(*args, fname=‘foo.txt’)
010| some_function_1(*args, fname=‘foo.txt’, fpath=None)
011| some_function_1(*args, fname=‘foo.txt’, fpath=r’c:\users\me\proj\blog’)
012| print()
013| some_function_1(*args, fname=r’temp\foo.txt’)
014| some_function_1(*args, fname=r’temp\foo.txt’, fpath=r’c:\users\me’)
015| print()
016| some_function_1(*args, fname=r’C:\users\me\blog\foo.txt’)
017| some_function_1(*args, fname=r’C:\users\me\blog\foo.txt’, fpath=r’D:\backup’)
018| print()
019| 

The only difference is that, rather than set fpath to a default, we set it to None. This makes the default behavior treating fname as containing the entire filespec. This is handy if your code usually has a full filespec rather than filename and pathname pairs.

When run, this prints:

Filename: foo.txt
Filename: foo.txt
Filename: c:\users\me\proj\blog\foo.txt

Filename: temp\foo.txt
Filename: c:\users\me\temp\foo.txt

Filename: C:\users\me\proj\blog\foo.txt
Filename: C:\users\me\proj\blog\foo.txt

Compare this to the first one to see how this second version of the function treats the inputs differently.

The first example is better when your code usually has just a filename and usually uses some default location for the files. That said, hardcoding any specific file or path information into your code has (usually eventually bad) consequences. At the very least, use a global variable that only needs to be changed in one location.

Despite that I almost always deal in separate filenames and pathnames, I generally use the second version to avoid having to specify a default pathname.


Before we look at tricks for dealing with files, let’s review the native capability:

001| from os import path
002| 
003| BasePath = r’C:\demo\hcc\python’
004| filename = ‘the-dream.txt’
005| 
006| fn = path.join(BasePath, filename)
007| fp = open(fn, mode=‘r’, encoding=‘utf8’)
008| lines = fp.readlines()
009| fp.close()
010| 
011| print(f’read: {fn}’)
012| print(f’lines: {len(lines)}’)
013| print()
014| for line in lines:
015|     print(line, end=)
016| print()
017| 

This uses the path.join function to create the filespec (line #6) and then the open function (line #7) to create fp, an open file object.

Line #8 uses the readlines method to read all the lines of the file into the lines variable. This assumes the file is a typical text file comprised of lines of text. Once the data has been read, line #9 calls the close method to release the file (something you should always do, even when just reading a file).

The rest, line #11 through line #16, just prints the file lines. Note that we use the end parameter to prevent the print function from applying a newline because the line already has a newline at the end.

But this bare-bones example lacks a vital feature: error handling. In particular, if the readlines method fails, we never reach the close method to properly close the file. At the very least, we prefer to write it like this:

001| from os import path
002| 
003| BasePath = r’C:\demo\hcc\python’
004| filename = ‘the-dream.txt’
005| 
006| fn = path.join(BasePath, filename)
007| fp = open(fn, mode=‘r’, encoding=‘utf8’)
008| try:
009|     lines = fp.readlines()
010| except Exception as e:
011|     print(e)
012|     raise
013| else:
014|     print(f’read: {fn}’)
015|     print(f’lines: {len(lines)}’)
016| finally:
017|     fp.close()
018| 
019| print()
020| for line in lines:
021|     print(line, end=)
022| print()
023| 

The try-catch configuration ensures that errors reading the file are caught and handled. More importantly, the finally block guarantees we always call the close method. We can also leverage the else block for the status report, which gathers all the file-handling into the try structure. The remainder (lines #19-#22), as in the previous example, just print the file lines.

In this example, we print the error and re-raise the exception, but what action we actually take depends on the overall context of the code.

Python offers a much shorter (and generally preferred) way to do largely the same thing:

001| from os import path
002| 
003| BasePath = r’C:\demo\hcc\python’
004| filename = ‘the-dream.txt’
005| 
006| fn = path.join(BasePath, filename)
007| with open(fn, mode=‘r’, encoding=‘utf8’) as fp:
008|     data = fp.read()
009| 
010| print(f’read: {fn}’)
011| print(f’chars: {len(data)}’)
012| print()
013| print(data)
014| print()
015| 

But this version has no opportunity to react to the exception locally (the calling function needs to catch it and deal with it). And the status report, as in the first example, is no longer coupled so nicely with the file operation.

As a general rule, though, this last example is the “Python way” of doing it.


I’ve long been a fan of object-oriented programming, so naturally my approach to file handling convenience involves defining some new classes. The first one is a base class that implements some fundamental file handling:

001| from os import path
002| from time import localtime
003| from datetime import datetime
004| 
005| class fileobj:
006|     ”’File Object class.”’
007| 
008|     def __init__ (self, filename):
009|         ”’New File Object instance.”’
010|         self.fullname = filename
011| 
012|         # Parse the full name into parts…
013|         t = path.split(self.fullname)
014|         self.pathname = t[0]
015|         self.filename = t[1]
016|         t = path.splitext(self.filename)
017|         self.name  = t[0]
018|         self.ext = t[1][1:]
019| 
020|         # Default file properties…
021|         self.fsize = 0
022|         self.created = None
023|         self.updated = None
024| 
025|         # If the file exists, get its properties…
026|         self.dflag = False
027|         self.fflag = False
028|         self.xflag = path.exists(self.fullname)
029|         if self.xflag:
030|             # Exists!…
031|             self.dflag = path.isdir(self.fullname)
032|             self.fflag = path.isfile(self.fullname)
033| 
034|             # And it’s a file!…
035|             if self.fflag:
036|                 # File Size…
037|                 self.fsize = path.getsize(self.fullname)
038|                 # File Create Date/Time…
039|                 t = localtime(path.getctime(self.fullname))
040|                 self.created = datetime(*t[0:6])
041|                 # File Updated Date/Time…
042|                 t = localtime(path.getmtime(self.fullname))
043|                 self.updated = datetime(*t[0:6])
044| 
045|     def __getattribute__ (self, name):
046|         ”’Enable virtual attributes.”’
047|         if name == ‘exists’: return self.xflag
048|         if name == ‘isfile’: return self.fflag
049|         if name == ‘isdir’:  return self.dflag
050|         return super().__getattribute__(name)
051| 
052|     def __bool__ (self):
053|         ”’Boolean value: True if file exists.”’
054|         return True if self.xflag else False
055| 
056|     def __eq__ (self, other): return (self.fullname == other.fullname)
057|     def __lt__ (self, other): return (self.fullname < other.fullname)
058|     def __gt__ (self, other): return (other < self)
059|     def __le__ (self, other): return not (self < other)
060|     def __ge__ (self, other): return not (other < self)
061|     def __ne__ (self, other): return not (self == other)
062| 
063|     def __str__ (self): return f'{self.fullname} ({self.fsize} bytes)’
064| 
065| fn = r”C:\demo\hcc\python\the-dream.txt”
066| fo = fileobj(fn)
067| 
068| print(f'{fo=!s}’)
069| print(f'{fo.fullname=}’)
070| print(f'{fo.pathname=}’)
071| print(f'{fo.filename=}’)
072| print(f'{fo.name=}’)
073| print(f'{fo.ext=}’)
074| print(f'{fo.fsize=}’)
075| print(f'{fo.created=}’)
076| print(f'{fo.updated=}’)
077| print(f'{fo.exists=}’)
078| print(f'{fo.isfile=}’)
079| print(f'{fo.isdir=}’)
080| 

There’s a fair amount packed into this! New instances, line #8, take a filename parameter assumed to be a filespec such as would be passed to the open function. Lines #10-#18 break the filespec into its parts and assign those to various instance attributes. Lines #21-#23 establish defaults for the size and file time attributes. Lines #26-#28 establish default for the directory and file flags and set the exists flag using the path.exists function. If the file does exist, lines #29-#43, set the file and directory flags, the file size attribute, and the two file times (created and updated).

Lines #45-#50 provide obvious aliases for the three flags by overriding the dunder getattribute method. Lines #52-#54 implement the dunder bool method and couple it to the file exists flag. Thus, a file instance object can be treated as a boolean that returns True if the file exists. Lines #56-#61 allow file instance objects to be sorted based on the filename. Lastly, as we always should, line #63 implements the “to string” method dunder str.

Some of the above might be optional if your fileobj class is intended for a specific use. For instance, you might never need to sort instance objects by filespec, so you don’t need to implement the comparison operators. You could also go the other direction and make them more powerful by, if the names match, comparing sizes, and if those match, comparing file dates. You might also prefer to use the @property decorator to create read-only properties for the exists, isdir, and isfile attributes [for details see Python Descriptors, part 1 and part 2]. Or you might decide you don’t need the properties and will use the flag attributes if necessary. The code above should be taken only as a guide.

Lastly, lines #65-#79 test the class by creating an instance object and printing out its many attributes. When run, this prints:

fo=C:\demo\hcc\python\the-dream.txt (3604 bytes)
fo.fullname='C:\\demo\\hcc\\python\\the-dream.txt'
fo.pathname='C:\\demo\\hcc\\python'
fo.filename='the-dream.txt'
fo.name='the-dream'
fo.ext='txt'
fo.fsize=3604
fo.created=datetime.datetime(2024, 6, 7, 15, 52, 8)
fo.updated=datetime.datetime(2024, 6, 7, 15, 52, 28)
fo.exists=True
fo.isfile=True
fo.isdir=False

Note that we can use this class for some kinds of file operations. For example, we could use it, and the Python listdir function, to create a recursive directory listing function that included file sizes and dates. (Leave a comment to let me know if you’d be interested in seeing an example in a future Simple Trick post.)

To represent a physical file with contents, we can extend our base class with an abstraction with a key attribute and some useful methods:

001| from examples import fileobj
002| 
003| class filecontent (fileobj):
004|     ”’Abstract File base class.”’
005| 
006|     def __init__ (self, filename):
007|         ”’New File Object instance.”’
008|         super().__init__(filename)
009|         self.data = None
010| 
011|     def __len__ (self):
012|         ”’File Object’s length is the length of its data.”’
013|         return len(self.data)
014| 
015|     def __getitem__ (self, ix):
016|         ”’Get a datum by index. (Assumes data is indexable!)”’
017|         return self.data[ix]
018| 
019|     def __setitem__ (self, ix, value):
020|         ”’Set a datum by index. (Assumes data is indexable!)”’
021|         self.data[ix] = value
022| 
023|     def __iter__ (self):
024|         ”’Set a datum by index. (Assumes data is iterable!)”’
025|         return iter(self.data)
026| 

New instances (lines #6-#9) still take a filespec, but we add the data attribute intended to hold the file’s data (line #9). We also implement the dunder len, getitem, setitem, and iter methods.

Because this is an abstract class, and the data attribute is set to None, none of these methods work (they’ll all raise an Exception). They depend on the subclasses we’ll make for dealing with different kinds of files.

Which is where we’ll pick up next time.


Link: Zip file containing all code fragments used in this post.