Tags
format method, format strings, Python code, regular expressions, Simple Tricks, str.translate, upside down characters
The political situation in the USA has dampened my mood and crashed my interest in this Simple Tricks series, but in hope of getting at least one post out this month (and on the very last day, no less), I’m going to revisit two topics I’ve written about before.
Back in Issue #8 of this series, I wrote about formatted output (to screen or file), especially using format strings (“f-strings”) and the format function and built-in method. In this issue, I’ll revisit the latter for a more sophisticated example. I also have some goodies for the str.translate method.
In the following, I’m going to assume readers are familiar with the topics covered in Simple Python Tricks #8, particularly with regard to f-strings and the built-in object.__format__ method. The simple example class here leverages both.
[See Format String Syntax for details about format specifications.]
WARNING: this class also uses a rightly dreaded regular expression trick, but sometimes — as you’ll see — a regex fits the application too well not to use.
The design requirement is a very basic Person class with attributes for first and last name. We’ll choose f-strings to enable a variety of formatted outputs. We want to control how many characters of the first and last name are displayed, and we want to be able to display optional characters after either. For example, if the name is “John Smith”, we want to be able to display:
- John Smith
- John
- Smith
- J. Smith
- JS
- J.S.
- J. S.
- JoSmith:
To accomplish this, we’ll use a format spec with the following syntax (spaces included here for clarity — spaces are not allowed except as characters in the <sep> and <end> fields):
<num> f <sep> <num> l <end>
The <num> fields before the literal ‘f’ (for first name) and literal ‘l’ (for last name) are required and determine how many characters we want of the first or last name. If we want to suppress the first or last name, set <num> to zero. If we want the entire name, set <num> to the asterisk (‘*’).
The <sep> and <end> fields are zero or more characters for, respectively, what separates the first and last name, and what follows the last name.
The format specs to accomplish the above list of outputs are:
*f *l⇒ John Smith*f9l⇒ John0f*l⇒ Smith1f. *l⇒ J. Smith1f1l⇒ JS1f.1l.⇒ J.S.1f. 1l.⇒ J. S.2f*l:⇒ JoSmith:
It’s certainly possible to parse the format spec “by hand”, so to speak, but this seems a situation that almost screams for a regular expression to pick out the various fields. But as I wrote in my Regular Expressions post.
It has been said, “If you have a problem — and you solve it with a regular expression — now you have two problems.”
And, indeed, this is typically the case. The trick is to thoroughly document the use. The reason to use a regex here is that, at least in my view, parsing the format spec by hand leads to code at least as complicated — if not more so — as the regex.
Here is the regular expression that parses the above format syntax (again with spaces between elements for clarity):
^ (\d+|\*) f (.*) (\d+|\*) l (.*) $
There are eight elements in this regex. From left to right:
^— the caret symbol anchors to the start of the text.(\d+|\*)— either one or more digits or the literal * symbol.f— the literal (lowercase) f character.(.*)— any number of characters (including none).(\d+|\*)— either one or more digits or the literal * symbol.l— the literal (lowercase) l character.(.*)— any number of characters (including none).$— the dollar symbol anchors to the end of the text.
The fields enclosed with parentheses are captured by the regex mechanism for our use later in the code. If this regex succeeds in matching the entire format string from beginning to end, then the format string is valid. (If we parsed by hand, we’d have to explicitly check for the required f and l characters and their required numeric (or *) prefixes. The regex does this for us.)
Now let’s take a look at the code for the simple Person class (I mean, of course, that the class is simple):
002|
003| class Person:
004| ”’Trivial Person class but with formatting!”’
005|
006| regx = compile(‘^(\d+|\*)f(.*)(\d+|\*)l(.*)$’)
007| # beginning of string
008| # match group 0: digits or literal ‘*’
009| # a literal ‘f’
010| # match group 1: characters
011| # match group 2: digits or literal ‘*’
012| # a literal ‘l’
013| # match group 3: characters
014| # end of string
015|
016| def __init__ (self, fname, lname):
017| self.fname = fname
018| self.lname = lname
019|
020| def __repr__ (self):
021| return f'{self.fname} {self.lname}’
022|
023| def __format__ (self, fstr):
024| ”’Format template: <num>f<sep><num>l<end>”’
025| if not fstr:
026| return str(self)
027|
028| # Attempt to match the format string…
029| mobj = self.regx.match(fstr)
030| if mobj is None:
031| # Failed to match pattern…
032| raise ValueError(f’Illegal format spec: “{fstr}”‘)
033|
034| grps = mobj.groups()
035| if len(grps) != 4:
036| # Failed to find all four items we need…
037| raise ValueError(f’Invalid format spec: “{fstr}”‘)
038|
039| # Setup the output…
040| fnx = len(self.fname) if grps[0]==‘*’ else int(grps[0])
041| lnx = len(self.lname) if grps[2]==‘*’ else int(grps[2])
042| sep = grps[1]
043| end = grps[3]
044|
045| # Return formatted string…
046| return self.fname[:fnx]+sep+self.lname[:lnx]+end
047|
We start on line #1 by importing the compile method from the re (regular expression) module. We use that method on line #8 to create a class variable (regex) that we’ll use to parse any incoming format spec. Note the detailed explanation of each element of the regular expression (lines #7 to #14).
Our __init__ method (lines #16 to #18) takes two required positional parameters, fname and lname (first name and last name). It uses them to initialize the matching instance attributes.
Because we should always implement “toString”, we provide a __repr__ method (lines #20 and #21) that returns the string: "{fname} {lname}". (We implement __repr__ rather than __str__ so that we get that string in both str and repr contexts.)
Lastly, we implement the __format__ method (lines #23 to #46) to handle format strings with the syntax described above.
In that method, the first thing we do (lines #25 and #26) is check whether the format spec (fstr) is empty, in which case we default to the "{fname} {lname}" string, same as we get in an ordinary string context.
Next (line #29) we attempt a regex match on the (non-blank) format spec. If the match fails, it returns None. We check for this (line #30) and raise an exception (line #32) if so.
Now we know mobj is a valid match object (so the match succeeded), but I include a check for all expected captured elements and raise an exception if we didn’t get four (lines #34 to #37). NOTE: I haven’t checked how necessary this actually is. I think that a valid match means all groups must be present, but, as I say, I haven’t tested it. Consider it a reader’s exercise and get back to me.
In the process we get grps, a tuple of the four captured elements. All that remains is a bit of processing for the desired parts of the first and last names (lines #40 and #41) and to use whatever characters were captured for <sep> and <end> (lines #42 and #43).
At the end (line #46), we return the formatted name.
§
Now that we have our Person class, let’s test it:
002|
003| tests = [
004| ‘*f *l’,
005| ‘*f0l’,
006| ‘0f*l’,
007| ‘1f. *l’,
008| ‘1f1l’,
009| ‘1f.1l.’,
010| ‘1f. 1l.’,
011| ‘2f*l:’,
012| ]
013|
014| p = Person(‘John’, ‘Smith’)
015|
016| for ix,f in enumerate(tests):
017| print(f'{ix:d}: f-string = {f:7s} => {p:{f}}’)
018|
The tests list (lines #3 to #12) contains the eight example format specs we’ll test (feel free to add more). These are the same eight format specs listed above, and we expect the same results also listed above.
Line #14 creates a new Person object (named “John Smith”).
Lines #16 and #17 comprise a loop that tests each of the eight format specs.
When run, this prints:
0: f-string = *f *l => John Smith 1: f-string = *f0l => John 2: f-string = 0f*l => Smith 3: f-string = 1f. *l => J. Smith 4: f-string = 1f1l => JS 5: f-string = 1f.1l. => J.S. 6: f-string = 1f. 1l. => J. S. 7: f-string = 2f*l: => JoSmith:
It works!
A few years ago, in Python String Translate, I explored the str.translate method. As above, I’ll assume the reader is familiar with the material covered there. Some of what’s included here was included there, but I also have a few ones.
Let’s start with some basic definitions we’ll need:
002| EOL = ord(‘\n’)
003| SPC = ord(‘ ‘)
004|
005| UCA,UCZ = ord(‘A’),ord(‘Z’)
006| LCA,LCZ = ord(‘a’),ord(‘z’)
007| DIG0,DIG9 = ord(‘0’),ord(‘9’)
008|
Lines #1 to #3 define some common characters, and lines #5 to #7 define the “end points” of the uppercase letters (‘A’-‘Z’), the lowercase letters (‘a’-‘z’), and the digits (‘0’-‘9’).
Let’s put these to immediate use with a simple class for converting text to ROT13:
002|
003| class Rot13:
004| “””Convert text and numbers to ROT13 versions.”””
005|
006| def __getitem__ (self, key):
007| “””Return ROT13 versions of alnums.”””
008| if UCA <= key <= UCZ: return UCA+(((key–UCA)+13)%26)
009| if LCA <= key <= LCZ: return LCA+(((key–LCA)+13)%26)
010| if DIG0 <= key <= DIG9: return DIG0+(((key–DIG0)+5)%10)
011| return key
012|
013| if __name__ == ‘__main__’:
014| txt = ‘The quick brown fox jumped over the lazy dog.’
015| num = ‘Invoice Number 03483-25481’
016|
017| r13 = Rot13()
018|
019| out1 = txt.translate(r13)
020| out2 = num.translate(r13)
021|
022| print(f’txt = “{txt}”‘)
023| print(f’txt > “{out1}”‘)
024| print(f’out > “{out1.translate(r13)}”‘)
025| print()
026|
027| print(f’num = “{num}”‘)
028| print(f’num > “{out2}”‘)
029| print(f’out > “{out2.translate(r13)}”‘)
030| print()
031|
I presented a similar class in the post three years ago, so it’s not new, but the code objects here are more suited for your toolkit.
When run, this prints:
txt = "The quick brown fox jumped over the lazy dog." txt > "Gur dhvpx oebja sbk whzcrq bire gur ynml qbt." out > "The quick brown fox jumped over the lazy dog." num = "Invoice Number 03483-25481" num > "Vaibvpr Ahzore 58938-70936" out > "Invoice Number 03483-25481"
Note how the ROT13 process both encodes and decodes.
I also presented the following two classes for converting regular text (with problematic characters) to, respectively, URLs and HTML:
002|
003| class CleanUrl (dict):
004| “””Translate for URL: Accept alnums (others to %## hex).”””
005|
006| def __init__ (self):
007| “””Initialize instance with special chars.”””
008| super().__init__()
009| self[SPC] = ‘+’
010|
011| def __getitem__ (self, key):
012| “””Explicitly pass alnums. Delegate others.”””
013| if UCA <= key <= UCZ: return key
014| if LCA <= key <= LCZ: return key
015| if DIG0 <= key <= DIG9: return key
016| return super().__getitem__(key)
017|
018| def __missing__ (self, key):
019| “””Convert non-special chars to %## hex.”””
020| return ‘%%%02X’ % key
021|
022|
023| class Text2Html (dict):
024| “””Convert &, <, >, ” to HTML entities.”””
025|
026| def __init__ (self):
027| “””Initialize instance with special chars.”””
028| super().__init__()
029| self[ord(‘&’)] = ‘&’
030| self[ord(‘<‘)] = ‘<’
031| self[ord(‘>’)] = ‘>’
032| self[ord(‘”‘)] = ‘"’
033|
034| def __missing__ (self, key):
035| “””Allow all other chars.”””
036| return key
037|
038|
039| if __name__ == ‘__main__’:
040| url = r’https://foo.bar.com/base/path/sub/path?a = 21&b = <4> & c=zap’
041| htm = r’He said, <em>”Why not?”</em>, & this & that.’
042|
043| txt2url = CleanUrl()
044| txt2htm = Text2Html()
045|
046| print(f’url = “{url}”‘)
047| print(f’url > “{url.translate(txt2url)}”‘)
048| print()
049|
050| print(f’htm = “{htm}”‘)
051| print(f’htm > “{htm.translate(txt2htm)}”‘)
052| print()
053|
These should be self-explanatory. Both classes subclass the dict class and use the __init__ method to initialize the dictionary with the desired character maps. Because these maps are so simple, we set each character “by hand”.
In CleanURL (lines #3 to #20), we use the __getitem__ method (lines #11 to #16) similarly to how we did in the Rot13 class, but rather than returning key if not matched, we call the dict superclass to handle it. This opens the potential for a client to add additional mappings. We use the __missing__ method (lines #18 to #20) to catch anything the superclass doesn’t match (in which case we return a coded hex representation of the character).
In Text2Html, we again use the __init__ method (lines #26 to #32) to create the map, and since this suffices for the characters that we’re interested in, we only need the __missing__ method (lines #34 to #36) to catch unmapped characters.
When run, this prints:
url = "https://foo.bar.com/base/path/sub/path?a = 21&b = <4> & c=zap"
url > "https%3A%2F%2Ffoo%2Ebar%2Ecom%2Fbase%2Fpath%2Fsub%2Fpath%3Fa+
%3D+21%26b+%3D+%3C4%3E+%26+c%3Dzap"
htm = "He said, <em>"Why not?"</em>, & this & that."
htm > "He said, <em>"Why not?"</em>, &
this & that."
(Manually wrapped the too-long output lines.)
Here are two new classes, one for removing specified characters, and one for filtering only specified characters:
002| “””Strip characters in mask from string.”””
003|
004| def __init__ (self, mask):
005| “””Initialize with mask chars.”””
006| self.m = [ord(char) for char in mask]
007|
008| def __getitem__ (self, key):
009| “””Strip chars in mask, allow all others.”””
010| return (None if key in self.m else key)
011|
012|
013| class CharFilter (dict):
014| “””Filter for characters in mask, otherwise strip.”””
015|
016| def __init__ (self, mask):
017| “””Initialize self with mask chars.”””
018| super().__init__()
019| for char in mask:
020| self[ord(char)] = ord(char)
021|
022| def __missing__ (self, key):
023| “””Strip chars not in map.”””
024| return None
025|
026|
027| if __name__ == ‘__main__’:
028| txt = ‘The quick brown fox jumped over the lazy dog.’
029|
030| cs = CharStripper(‘aeiou y’)
031| cf = CharFilter(‘aeiouy’)
032|
033| print(f’txt = “{txt}”‘)
034| print(f’txt > “{txt.translate(cs)}”‘)
035| print(f’txt > “{txt.translate(cf)}”‘)
036| print()
037|
Both classes require a string of characters to strip or filter — these classes are user-configurable.
The CharStripper class is standalone and initializes the m attribute (m for mask) with the ordinal numbers of the characters to strip. It uses the __getitem__ method to explicitly check for those values (remember that translate deals only with character ordinals). Returning None tells translate to skip the character.
The CharFilter class (lines #13 to #24) subclasses the dict class and uses a for loop to initialize the map. The __missing__ method returns None, so any characters not found in the map are skipped.
When run, this prints:
txt = "The quick brown fox jumped over the lazy dog." txt > "Thqckbrwnfxjmpdvrthlzdg." txt > "euiooueoeeayo"
The stripper removes all vowels (and spaces); the filter passes only the vowels.
One can create an explicit map by creating a list that includes outputs for all characters in ordinal order:
002|
003| CMAP = [None for _ in range(32)] \
004| + [‘_’] + [‘$’ for _ in range(15)] \
005| + [DIG0+c for c in range(10)] + [‘$’ for _ in range(6)] \
006| + [‘$’] + [UCA+c for c in range(26)] + [‘$’ for _ in range(5)] \
007| + [‘$’] + [LCA+c for c in range(26)] + [‘$’ for _ in range(5)] \
008| + [‘?’]
009|
010| if __name__ == ‘__main__’:
011| txt = ‘The quick brown fox jumped over the lazy dog.’
012|
013| print(f’txt = “{txt}”‘)
014| print(f’txt > “{txt.translate(CMAP)}”‘)
015| print()
016|
Doing this requires a knowledge of ASCII (or Unicode for a much bigger but more inclusive map). The CMAP list (lines #3 to #8) maps ASCII 0x00 to 0x80. The digits and alphas are mapped to themselves, the space character is mapped to ‘_’, any controls characters (such as TAB or NEWLINE) are mapped to None (and thus filtered out), and any other characters in the range are mapped to ‘$’.
The behavior of translate with such a list is to pass characters with ordinal numbers beyond the list length (so all Unicode characters from U+0081 or greater are passed).
When run, this prints:
txt = "The quick brown fox jumped over the lazy dog." txt > "The_quick_brown_fox_jumped_over_the_lazy_dog$"
Lastly, here’s a little reward for reading this far:
002| # punctuation !, &, ?
003| 33:0x00a1, 38:0x214b, 63:0x00bf,
004|
005| # digits 0 to 9
006| 48:0x0030, 49:0x21c2, 50:0x218a, 51:0x218b, 52:0x07c8,
007| 53:0x100c, 54:0x0039, 55:0x1d613, 56:0x0038, 57:0x0036,
008|
009| # upper case alphas A to Z
010| 65:0x2c6f, 66:0xa4ed, 67:0x0186, 68:0xa4f7, 69:0x018e,
011| 70:0x2132, 71:0x2141, 72:0x0048, 73:0x0049, 74:0xa4e9,
012| 75:0xa7b0, 76:0xa780, 77:0xa7fd, 78:0x004e, 79:0x004f,
013| 80:0x0500, 81:0xa779, 82:0xa4e4, 83:0x0053, 84:0xa7b1,
014| 85:0x0548, 86:0x0245, 87:0x10935, 88:0x0058, 89:0x2144,
015| 90:0x005a,
016|
017| # lower case alphas a to z
018| 97:0x0250, 98:0x0071, 99:0x0254, 100:0x0070, 101:0x01dd,
019| 102:0x025f, 103:0x0253, 104:0x0265, 105:0x1d09, 106:0x017f,
020| 107:0x029e, 108:0xa781, 109:0x026f, 110:0x0075, 111:0x006f,
021| 112:0x0064, 113:0x0062, 114:0x0279, 115:0x0073, 116:0x0287,
022| 117:0x006e, 118:0x028c, 119:0x028d, 120:0x0078, 121:0x028e,
023| 122:0x007a,
024| }
025|
026| if __name__ == ‘__main__’:
027| txt = ‘The quick brown fox jumped over the lazy dog.’
028|
029| print(f’txt = “{txt}”‘)
030| print(f’txt > “{txt.translate(UpsideDown)}”‘)
031| print()
032|
We have an explicit dictionary mapping three punctuation characters (!, &, and ?), the ten digits, and the 52 upper- and lower-case alphas to Unicode characters that come close to being upside-down versions of those characters.
When run, this prints:
txt = "The quick brown fox jumped over the lazy dog." txt > "Ʇɥǝ bnᴉɔʞ qɹoʍu ɟox ſnɯdǝp oʌǝɹ ʇɥǝ ꞁɐzʎ poɓ."
Which is fun but fairly useless.
But these examples should give you some useful tools and/or ideas.
That’s all for now.
Link: Zip file containing all code fragments used in this post.
∅
ATTENTION: The WordPress Reader strips the style information from posts, which can destroy certain important formatting elements. If you’re reading this in the Reader, I highly recommend (and urge) you to [A] stop using the Reader and [B] always read blog posts on their website.
This post is: Simple Python Tricks #15
Pingback: Elementary Cellular Automaton | The Hard-Core Coder