Tags

, ,

Last time we looked at dealing with files in Python, looked into filename handling, and left off after creating a couple of base classes to support general file operations.

This time we’ll extend those classes into some useful file utility classes for handling data from different types of files (for instance, binary, plain text, line-oriented text, and any type of structured file).

We’ll extend the filecontent base class from last time to create a class for opening a generic file. Since we don’t know the file type, we’ll open the file in binary and treat it as an array of bytes:

001| from examples import filecontent
002| 
003| class datafile (filecontent):
004|     ”’Generic Data File class.”’
005| 
006|     def __init__ (self, filename):
007|         ”’ New Data File instance.”’
008|         super().__init__(filename)
009|         self.data = None
010| 
011|     def load (self):
012|         ”’Load data file. (Data will be bytes.)”’
013|         fp = open(self.fullname, mode=‘rb’)
014|         try:
015|             self.data = fp.read()
016|         except Exception as e:
017|             print(e)
018|             raise
019|         else:
020|             print(f’read: {self.fullname}’)
021|         finally:
022|             fp.close()
023| 
024|     def save (self):
025|         ”’Save data file. (Data must be bytes.)”’
026|         fp = open(self.fullname, mode=‘wb’)
027|         try:
028|             fp.write(self.data)
029|         except Exception as e:
030|             print(e)
031|             raise
032|         else:
033|             print(f’wrote: {self.fullname}’)
034|         finally:
035|             fp.close()
036| 
037|     def __contains__ (self, item): return (item in self.data)
038| 
039| from os import path
040| 
041| BasePath = r’C:\demo\hcc\python’
042| filename = r’colors.png’
043| 
044| fn = path.join(BasePath, filename)
045| fo = datafile(fn)
046| fo.load()
047| 
048| print()
049| print(f'{fo.filename=}’)
050| print(f'{len(fo)=}’)
051| print(f'{fo[0:10]=}’)
052| print(f'{fo[9]=}’)
053| print(f'{0x0a in fo=}’)
054| print(f'{0xff in fo=}’)
055| for b in fo:
056|     print(f'{b=:02x}’)
057|     if b == ord(‘\n’):
058|         break
059| print()
060| 

The dunder init method (lines #6-#9) delegates to the parent class and then just adds a data property (line #9). Lines #11-#22 implement a load method while lines #24-#35 implement a corresponding save method. Both assume the data attribute for the file contents. Line #37 implements dunder contains so clients can use the in operator. Lastly, lines #39-59 exercise the class with a binary data file.

Note that both the load and save methods use the ‘b’ (binary) mode to open the file, so loaded data attribute will be bytes even if the file isn’t binary, and data to be saved must also consist of bytes. The upside is that datafile objects can open any type of file.

When run, this prints:

read: C:\demo\hcc\python\colors.png

fo.filename='colors.png'
len(fo)=1992
fo[0:10]=b'\x89PNG\r\n\x1a\n\x00\x00'
fo[9]=0
0x0a in fo=True
0xff in fo=True
b=89
b=50
b=4e
b=47
b=0d
b=0a

This is fine for binary files, such as images, but working with text files will typically involve the extra steps of converting the bytes to string when reading data and vice versa when writing it.

We can make life easy by defining an almost identical class, but one that reads text files:

001| from examples import filecontent
002| 
003| class textfile (filecontent):
004|     ”’Generic Text File class.”’
005| 
006|     def __init__ (self, filename):
007|         ”’ New Text File instance.”’
008|         super().__init__(filename)
009|         self.data = str()
010| 
011|     def load (self, encoding=‘utf-8’):
012|         ”’Load text file. (Data will be a single string.)”’
013|         fp = open(self.fullname, mode=‘r’, encoding=encoding)
014|         try:
015|             # Read string from file…
016|             self.data = fp.read()
017| 
018|             # Check for a Unicode BOM…
019|             if self.data[0] == ‘\ufeff’:
020|                 # Remove it if found…
021|                 self.data = self.data[1:]
022|         except Exception as e:
023|             print(e)
024|             raise
025|         else:
026|             print(f’read: {self.fullname}’)
027|         finally:
028|             fp.close()
029| 
030|     def save (self, encoding=‘utf-8’, use_bom=False):
031|         ”’Save text file. (Data must be a single string.)”’
032|         fp = open(self.fullname, mode=‘w’, encoding=encoding)
033|         try:
034|             # Optional Unicode BOM marker…
035|             if use_bom:
036|                 fp.write(‘\ufeff’)
037| 
038|             # Write string to file…
039|             fp.write(self.data)
040|         except Exception as e:
041|             print(e)
042|             raise
043|         else:
044|             print(f’wrote: {self.fullname}’)
045|         finally:
046|             fp.close()
047| 
048|     def __contains__ (self, item): return (item in self.data)
049| 
050| from os import path
051| 
052| BasePath = r’C:\demo\hcc\python’
053| filename = r’the-dream.txt’
054| 
055| fn = path.join(BasePath, filename)
056| fo = textfile(fn)
057| fo.load()
058| 
059| print()
060| print(f'{fo.filename=}’)
061| print(f'{len(fo)=}’)
062| print(f'{fo[0:60]=}’)
063| print(f'{fo[100]=}’)
064| print(f'{“house” in fo=}’)
065| print(f'{“foobar” in fo=}’)
066| nbr_of_spaces = 0
067| for ch in fo:
068|     if ch == ‘ ‘:
069|         nbr_of_spaces += 1
070| print(f'{nbr_of_spaces=}’)
071| print()
072| 

This is very much like the datafile class, but here the data attribute is a (single) string. One notable change is the addition of an encoding parameter to the load and save methods. This defaults to ‘utf-8’ but can be changed for other file encodings. The method passes the value to the open function.

Associated with the file encoding awareness is the code on lines #18-#21 and lines #34-#36. The former checks for the Unicode BOM (byte order mark) and silently removes it from the string if found. The latter adds the BOM to the output if the use_bom parameter is True.

When run this prints:

read: C:\demo\hcc\python\the-dream.txt

fo.filename='the-dream.txt'
len(fo)=3543
fo[0:60]="I had what HAS to be the longest, strangest dream that I've "
fo[100]='e'
"house" in fo=True
"foobar" in fo=False
nbr_of_spaces=663

These two classes can handle just about any file type, but text files (and binary files, for that matter) are often structured in a regular way we can use to load the file data into a more structured data object.

One simple example is loading line-oriented text files as a list of lines rather than a single string (which avoids the need to use str.splitlines on the textfile.data string). A more involved example is loading a JSON file as the dictionary object it defines.


Let’s start simple with a class that loads line-oriented text files as a list of strings:

001| from examples import filecontent
002| 
003| class linefile (filecontent):
004|     ”’Line-oriented Text File Class.”’
005| 
006|     def __init__ (self, filename):
007|         ”’ New Text File instance.”’
008|         super().__init__(filename)
009|         self.data = []
010| 
011|     def __call__ (self):
012|         ”’Calling object returns lines as single string.”’
013|         return ‘\n’.join(self.data)
014| 
015|     def add (self, txt=):
016|         ”’Add’s a new line to the end of the data.”’
017|         self.data.append(txt)
018| 
019|     def load (self, encoding=‘utf-8’):
020|         ”’Load text file. (Data will be a list of strings.)”’
021|         fp = open(self.fullname, mode=‘r’, encoding=encoding)
022|         try:
023|             # Read lines from file…
024|             self.data = [s.rstrip() for s in fp]
025| 
026|             # Check for a Unicode BOM…
027|             if 0 < len(self.data):
028|                 if 0 < len(self.data[0]):
029|                     if self.data[0][0] == ‘\ufeff’:
030|                         self.data[0] = self.data[0][1:]
031|         except:
032|             raise
033|         else:
034|             print(f’read: {self.fullname}’)
035|         finally:
036|             fp.close()
037| 
038|     def save (self, encoding=‘utf-8’, use_bom=False):
039|         ”’Save text file. (Data must be a list of strings.)”’
040|         fp = open(self.fullname, mode=‘w’, encoding=encoding)
041|         try:
042|             # Optional Unicode BOM marker…
043|             if use_bom:
044|                 fp.write(‘\ufeff’)
045| 
046|             # Write lines to file…
047|             for s in self.data:
048|                 fp.write(s)
049|                 fp.write(‘\n’)
050|         except:
051|             raise
052|         else:
053|             print(f’wrote: {self.fullname}’)
054|         finally:
055|             fp.close()
056| 
057| fn = r’C:\demo\hcc\python\the-dream.txt’
058| fo = linefile(fn)
059| fo.load()
060| 
061| print()
062| print(f'{fo.filename=}’)
063| print(f'{len(fo)=}’)
064| print(f'{fo[3:5]=}’)
065| print()
066| for line in fo:
067|     print(line)
068|     if line.startswith(‘I dreamed’):
069|         break
070| print()
071| 

Now the data attribute is a list, initially empty (line #9). We’re dealing with text files, so we still want encoding parameters for the load and save methods. We also keep the use_bom parameter. (Text files always use these, so it won’t be remarked again.). The mechanics of the load and save functions are slightly different to deal with a list of strings.

More noticeably, we’ve added two methods. The dunder call method is a convenience method that returns a single string with all the file lines joined by newline characters. The add method provides a convenient way to create a line-oriented file from scratch (see below for some examples of creating new file objects and using the save method).

When run, this prints:

read: C:\demo\hcc\python\the-dream.txt

fo.filename='the-dream.txt'
len(fo)=61
fo[3:5]=['probably as close to a nightmare as I ever come!!', '']

I had what HAS to be the longest, strangest dream that I've had in a long,
long time.  I usually have fun dreams, or weird dreams, sometimes
disturbing dreams - never nightmares.  This dream was very disturbing;
probably as close to a nightmare as I ever come!!

I dreamed I was visiting a film company on location in Cincinnati.  I don't

In all of these file classes, we can create a new instance, set the data attribute, and call the save method to create a new file (or overwrite an existing one). The only requirement is setting the data attribute with the right kind of data. A byte array for datafile, a string for textfile, and a list for linefile:

001| from os import path
002| from examples import linefile
003| 
004| BasePath = r’C:\demo\hcc\python’
005| filename = r’2b-or-not2b.txt’
006| 
007| fn = path.join(BasePath, filename)
008| fo = linefile(fn)
009| fo.add(“To be, or not to be: that is the question:”)
010| fo.add(“Whether ’tis nobler in the mind to suffer”)
011| fo.add(“The slings and arrows of outrageous fortune,”)
012| fo.add(“Or to take arms against a sea of troubles,”)
013| fo.add(“And by opposing end them? To die: to sleep;”)
014| fo.add(“No more; and by a sleep to say we end”)
015| fo.add(“The heart-ache and the thousand natural shocks”)
016| fo.add(“That flesh is heir to, ’tis a consummation”)
017| fo.add(“Devoutly to be wish’d. To die, to sleep;”)
018| fo.add(“To sleep: perchance to dream: ay, there’s the rub;”)
019| fo.save(use_bom=True)
020| 

Note that the linefile class doesn’t expose all possible list methods, just the key ones, length, indexing, and iterating. For example, the delete and pop methods aren’t exposed, but clients can access them through the data attribute.


Because Python has JSON-handling library routines, a class for JSON files is simpler than the previous example:

001| from json import load as jsonload, dump as jsonsave
002| from examples import filecontent
003| 
004| class jsonfile (filecontent):
005|     ”’JSON (Text) File class.”’
006| 
007|     def __init__ (self, filename):
008|         ”’ New JSON File instance.”’
009|         super().__init__(filename)
010|         self.data = {}
011| 
012|     def load (self, encoding=‘utf-8’):
013|         ”’Load JSON file. (Data will be a dictionary.)”’
014|         fp = open(self.fullname, mode=‘r’, encoding=encoding)
015|         try:
016|             # Load a JSON file…
017|             self.data = jsonload(fp)
018|         except:
019|             raise
020|         else:
021|             print(f’read: {self.fullname}’)
022|         finally:
023|             fp.close()
024| 
025|     def save (self, encoding=‘utf-8’, indent=3):
026|         ”’Save JSON file. (Data must be a dictionary.)”’
027|         fp = open(self.fullname, mode=‘w’, encoding=encoding)
028|         try:
029|             # Ensure the version key exists…
030|             if ‘version’ not in self.data:
031|                 self.data[‘version’] = 1
032| 
033|             # Write a JSON file…
034|             jsonsave(self.data, fp, indent=indent)
035|         except:
036|             raise
037|         else:
038|             print(f’wrote: {self.fullname}’)
039|         finally:
040|             fp.close()
041| 
042| fn = r’C:\demo\hcc\python\example.json’
043| fo = jsonfile(fn)
044| fo.load()
045| 
046| print(fo)
047| print()
048| print(f'{len(fo)=} (# of top-level dicts)’)
049| print(f’Top-level dicts: {list(fo.data.keys())}’)
050| print()
051| print(f'{fo[“version”]=}’)
052| print(f'{fo[“head”][“title”]=}’)
053| print()
054| 

Python does all the work! Note that the data attribute is a dict object now (line #10). The save method doesn’t use the use_bom parameter because the json.load and json.dump functions don’t allow easy access to it. If controlling the BOM is required, the best option is to take over the file read/write operation and use json.loads and json.dumps.

When run, this prints:

read: C:\demo\hcc\python\example.json
C:\demo\hcc\python\example.json (1040 bytes)

len(fo)=4 (# of top-level dicts)
Top-level dicts: ['version', 'head', 'body', 'tail']

fo["version"]=1
fo["head"]["title"]='JSON Test File'

Note that in all of these file classes, in lots of use cases the client code doesn’t need to know about the data attribute. The exception is creating a new file with code-supplied data. We can use methods that hide the data attribute — for instance, linefile.add — but otherwise we just write our output directly to the data attribute:

001| from examples import jsonfile
002| 
003| fn = r’C:\demo\hcc\python\output.json’
004| fo = jsonfile(fn)
005| 
006| fo.data = {‘artist’:{}, ‘album’:{}}
007| fo[‘artist’][‘firstname’] = “Joanne”
008| fo[‘artist’][‘midname’] = “Shaw”
009| fo[‘artist’][‘lastname’] = “Taylor”
010| fo[‘album’][‘name’] = “Diamonds in the Dirt”
011| fo[‘album’][‘date’] = 2010
012| fo[‘album’][‘length’] = [0, 43, 21]
013| fo[‘album’][‘songs’] = []
014| 
015| fo.save()
016| 

Note that the examples module used in the import statements here is assumed to contain whatever previously mentioned code is necessary for the import to succeed. Which means the contents must be a little fluid. For example, the jsonfile class requires filecontent (which requires fileobj), but the example above assumes the jsonfile class itself is defined in the examples module (which clearly it couldn’t be if its definition imports the module).

Alternately, all the code here could go in one source code file and all imports of “examples” deleted. Just keep it in mind if you use any of these “as is”.


This last class loads various kinds of script files into a list that has been stripped of blank lines and lines that start with a specified comment string:

001| from examples import linefile
002| 
003| class srcfile (linefile):
004|     ”’Source Code File class. Auto-skips blank lines and comments.”’
005|     def __init__ (self, filename, rems=None):
006|         super().__init__(filename)
007|         self.rems = rems if rems else [‘;’, ‘#’, ‘//’, ‘–‘]
008|         self.src = []
009| 
010|     def load (self):
011|         ”’Load Source Code file. (Code lines will be in a list.)”’
012|         # Delegate loading lines of text to parent…
013|         super().load()
014| 
015|         # Scan file lines for blank lines and comments…
016|         for line in self.data:
017|             # Ignore leading and trailing spaces…
018|             test = line.strip()
019| 
020|             # Skip blank lines…
021|             if len(test) == 0:
022|                 continue
023| 
024|             # Skip lines beginning with rem char…
025|             rem_flag = 0
026|             for rem in self.rems:
027|                 if test.startswith(rem):
028|                     rem_flag = 1
029|                     break
030|             if rem_flag:
031|                 continue
032| 
033|             # Add the line to the src code…
034|             self.src.append(line)
035| 
036|     def save (self):
037|         raise NotImplementedError()
038| 
039| fn = r’C:\demo\hcc\python\vim.txt’
040| fo = srcfile(fn, rems=[‘”‘])
041| fo.load()
042| 
043| for ix,line in enumerate(fo.src):
044|     print(f'{ix+1:02d}: {line}’)
045| print()
046| 

Note that the save function has been disabled. This class treat script files as read-only.

Lines #39-#45 illustrate a simple use case, reading a vim file. In such files, the comment character is the double-quote (hence line #40).

This is just one example of format-specific file classes. Another easy one is the traditional INI file. A harder one is an XML file. Regardless, it’s helpful to encapsulate access to common file types. All these — minus the demonstration code — can be put into a module file so you can import whatever file classes you need in your work.

That’s all for this time.


Link: Zip file containing all code fragments used in this post.