Tags
Python str
instances have many useful methods. I use strip
, split
, startswith
, and others, quite a lot, for instance. One method I haven’t had reason to use so far is translate
. It takes a dictionary argument and uses it to map the existing string to a new string.
It’s flexible and useful, so it’s worth knowing how to use.
It’s a built-in method of the built-in str
class, so using it doesn’t require importing anything (let alone installing anything). A typical use case is something like:
s_trans = s_orig.translate(table)
Where s_orig is the string to translate, table is a dictionary of character translations, and s_trans is the translated result string.
The key is the table parameter. Which translate
expects to be an object implementing the __getitem__
method. Canonically, it would be a dict
object or something that subclasses dict
. It can also be any user-defined object that implements the __getitem__
method.
The translate method behaves roughly as if it was written like this:
02| “””Simulation of str.translate() method.”””
03| buf = []
04| for char in str(s_orig):
05| cnum = ord(char)
06| try:
07| cout = t_map[cnum]
08| if cout is None:
09| continue
10| except:
11| cout = cnum
12| sout = cout if isinstance(cout,str) else chr(cout)
13| buf.append(sout)
14| return ”.join(buf)
15|
It takes an original string (s_orig) and a translation map (t_map) and returns a new string that is a translation of the original. Note that it uses the ordinal number of each character to index the map. This allows using sequences (instead of dictionaries) as maps.
Note, however, that using a string or list requires it be at least as long as the ordinal value of the highest character to translate. Assuming is translating strings with only the basic ASCII characters, this means a list with 128 items.
§
That translate takes lists, even strings, along with the more usual dictionary objects might lead to an experiment something like this:
01| 13924 Schulte Way
01| Apt. 2402
01| Forest City, CA 92508
01| [code:QRTD8945-2740-21]”’
02|
03| ts = [
04| s.translate({}),
05| s.translate({32:32}),
06| s.translate({32:‘_’}),
07| s.translate({32:None}),
08|
09| s.translate(”),
10| s.translate(‘aeiou’),
11|
12| s.translate([]),
13| s.translate([0,1,2,3,4,5,6,7,8,9]),
14|
15| s.translate(()),
16| s.translate((”,‘a’,‘b’,‘c’,‘d’)),
17| ]
18|
19| for t in ts:
20| print(t)
21| print()
22|
23| print()
24|
Which gives results that might at first surprise. In almost all cases, the translated string is the same as the source string. The two exceptions are the third and fourth translations (from lines #6 and #7 respectively). The first one converts all spaces to underlines, the second removes them entirely.
The others seem to have no effect. In fact, some of them actually would alter the string if it contained different text. The first one, from line #10, would alter control chars NUL and ^A through ^D to, respectively, ‘a’, ‘e’, ‘i’, ‘o’, and ‘u’ (which would be weird). The second one, from line #16, would again alter NUL and ^A through ^D but to the respective values in the tuple, ‘a’ through ‘d’ (which would make at least some sense).
The example on line #13 maps NUL and control chars ^A through ^I to themselves, so it ends up not altering the string even though a translation does occur if those characters are in the string. They’re just translated to themselves.
All examples taking empty containers (lines #4, #9, #12, #15) translate all characters in any string to themselves. The examples on line #12 and #13 are effectively the same. So would this:
s.translate(list(range(128)))
When it comes to sequences, translate
tries to index the nth item of the sequence, where n is the ordinal number of the current character. For example, the ordinal number of an uppercase ‘A’ is 65, so translate
looks up the 65th item (counting from zero, of course, so in fact the 66th in the list).
When translate indexes a sequence, there are three possible outcomes:
- Character’s ordinal number is larger than the list size. The lookup fails because list is too short. If list is empty, all lookups fail this way. In this case,
translate
passes the character to the result string unchanged. - The indexed list slot has the value
None
. In this case,translate
skips the character and it does not appear in the result string. - The indexed list slot has an integer or string value. In this case,
translate
replaces the original character with the string or character value of the integer.
When given a dictionary, translate uses the character’s ordinal number as a key to an item in the dictionary. There are, again, three possible outcomes:
- Character’s ordinal number is not a key in the dictionary. In this case,
translate
passes the character to the result unchanged. (Similar to not finding the indexed slot in a list.) - The ordinal number is a key in the dictionary and has a value of
None
. As with a sequence,translate
skips the character; it does not appear in the result string. - The ordinal number is a key in the dictionary and has an integer or string value. Again,
translate
replaces the original character with the string or character value of the integer.
The behaviors are nearly identical, but dictionaries are simpler yet also more powerful. User-defined objects, or subclassed instances of existing dictionary classes, offer even more flexibility. The simple examples in lines #6 and #7 above aren’t as elegant:
s.translate([cn for cn in range(32)]+['_']) s.translate([cn for cn in range(32)]+[None])
Respectively. In both cases, the list needs to include (at least) enough slots for the desired character mapping to index to the ordinal number of the original character. In this case, that’s just the 32 control characters, so it isn’t too bad. The uppercase letters have ordinal values from 65 to 90, and the lowercase ones from 97 to 122, so sequences for translating just basic alpha characters have to go at least to 122.
The basic ASCII set is 128 characters, and early 8-bit character sets had 256. A list of Unicode characters would require thousands, if not millions (depending on how inclusive it was). Plus, sequences are wasteful and redundant when most characters are passed through unchanged. In such cases, each slot must have the ordinal number that matches the index (or the character itself). The alternative is the value None
, which removes the character from the result.
In contrast, dictionaries need only contain entries for the characters they’re altering. Characters without entries in the dictionary pass through unchanged. Removing a character requires an entry for that character with its value set to None
.
In both cases, translate
replaces the character it finds in the table with the value from the table. Which can be an integer value — the codepoint of a Unicode character — or a string (or None
). Integers are converted to characters; strings are emitted as is. (Which means translate
can convert single characters to multi-character sequences.)
§
In HTML, certain characters, chiefly the less-than (<), the greater-than (>), and the ampersand (&), need to be encoded as special sequences. The string translate
method offers a convenient way to do that. The code fragment below illustrates two versions that accomplish the same thing:
02|
03| html_string_map = {
04| ord(‘&’): ‘&’,
05| ord(‘<‘): ‘<’,
06| ord(‘>’): ‘>’,
07| ord(‘”‘): ‘"’,
08| }
09| s1 = s.translate(html_string_map)
10|
11| class html_string (dict):
12| “””Prepare string for HTML.”””
13| def __init__ (self):
14| “””Initialize instance with special chars.”””
15| super().__init__()
16| self[ord(‘&’)] = ‘&’
17| self[ord(‘<‘)] = ‘<’
18| self[ord(‘>’)] = ‘>’
19| self[ord(‘”‘)] = ‘"’
20|
21| cm = html_string()
22| s2 = s.translate(cm)
23|
The first version, lines #3 through #9, uses a dict
with the map. The second version, lines #11 through #22, uses a dict
subclass with the same map. The second version is debatably more reusable but for the most part the two are roughly the same.
Creating a subclass, however, is more powerful when we want to do more than map a handful of characters to other characters (or character sequences). They’re especially useful if we want to do a similar translation on a large set of characters.
§
For example, suppose we want to implement a ROT13 translator. That requires altering uppercase and lowercase letters as well as digits (but using ROT5). It should leave all other characters unchanged. The code below illustrates two nearly identical approaches:
02| LCA,LCZ = ord(‘a’),ord(‘z’)
03| DIG0,DIG9 = ord(‘0’),ord(‘9’)
04|
05| s_in = ‘Now is the time for a pretty good party!’
06|
07| # Version 1: subclass of dict
08| class rot13_string (dict):
09| “””ROT13 encoder, str.translate style. (subclass)”””
10| def __getitem__ (self, key):
11| “””Convert alphnums and digits.”””
12| if UCA <= key <= UCZ: return UCA+(((key–UCA)+13)%26)
13| if LCA <= key <= LCZ: return LCA+(((key–LCA)+13)%26)
14| if DIG0 <= key <= DIG9: return DIG0+(((key–DIG0)+5)%10)
15| return super().__getitem__(key)
16|
17| rot13 = rot13_string()
18| s_out = s_in.translate(rot13)
19| s_txt = s_out.translate(rot13)
20| print(s_in)
21| print(s_out)
22| print(s_txt)
23| print()
24|
25| # Version 2: class implementing __getitem__
26| class rot13_object:
27| “””ROT13 encoder, str.translate style. (object)”””
28| def __getitem__ (self, key):
29| “””Convert alphnums and digits.”””
30| if UCA <= key <= UCZ: return UCA+(((key–UCA)+13)%26)
31| if LCA <= key <= LCZ: return LCA+(((key–LCA)+13)%26)
32| if DIG0 <= key <= DIG9: return DIG0+(((key–DIG0)+5)%10)
33| return key
34|
35| rot13 = rot13_object()
36| s_out = s_in.translate(rot13)
37| s_txt = s_out.translate(rot13)
38| print(s_in)
39| print(s_out)
40| print(s_txt)
41| print()
42|
The difference is that the first version subclasses dict
(and therefore inherits all its functionality) while the second version is a standalone class. Both implement __getitem__
— the first version overriding the dict
method.
When run, the output looks like this:
Now is the time for a pretty good party! Abj vf gur gvzr sbe n cerggl tbbq cnegl! Now is the time for a pretty good party! Now is the time for a pretty good party! Abj vf gur gvzr sbe n cerggl tbbq cnegl! Now is the time for a pretty good party!
What’s important in both cases is that the __getitem__
method handles a range of key values. In fact, all alphas and digits. For non-alpha, non-digit characters, the second version returns the character’s ordinal number, which effectively passes the character through unchanged (in reality, it is translated, but to itself). The first version, however, invokes the __getitem__
method of dict
for non-alphanumeric characters. Since those characters have no keys in the dictionary, dict
raises a LookupError
exception.
The translate method treats a LookupError
exception as an indicator pass the current character unchanged to the result string. That’s why missing keys in a dictionary, or sequence indexes larger than the sequence size, result in non-translation of the character. Missing entries mean unchanged characters. That’s why empty lists and dictionaries cause translate
to return the entire string unchanged.
§
And advantage of subclassing dict is that we can take advantage of the __missing__
method, which dictionary objects invoke when a key isn’t found.
Suppose we want to translate strings to versions acceptable in a URL. For maximum security, we’ll pass alphas and digits unchanged, but convert anything else to the %## hex format:
02| UCA,UCZ = ord(‘A’),ord(‘Z’)
03| LCA,LCZ = ord(‘a’),ord(‘z’)
04| DIG0,DIG9 = ord(‘0’),ord(‘9’)
05|
06| class url_string (dict):
07| “””Translate for URL: Accept alnums (others to %## hex).”””
08| def __init__ (self):
09| “””Initialize instance with special chars.”””
10| super().__init__()
11| self[SPC] = ‘+’
12|
13| def __getitem__ (self, key):
14| “””Explicitly pass alnums. Delegate others.”””
15| if UCA <= key <= UCZ: return key
16| if LCA <= key <= LCZ: return key
17| if DIG0 <= key <= DIG9: return key
18| return super().__getitem__(key)
19|
20| def __missing__ (self, key):
21| “””Convert non-special chars to %## hex.”””
22| return ‘%%%02X’ % key
23|
Note that we create an entry in the dictionary for the space character. Such entries apply if __getitem__
passes control to the dict
superclass. If no key exists, then dict
invokes the __missing__
method.
Note that we could do this as a standalone object. Then the __getitem__
method would just return the same %## hex string as the __missing__
method does here.
We could also, in either case, return None
if we wanted to strip the non-alphanumeric characters from the result.
§
In some cases, the __missing__
method can do all the work and the __getitem__
method need not be overridden. Dictionary entries get looked up as usual, and the __missing__
method only comes into play on, ta da, missing entries.
Here’s a translator that passes alphanumeric characters, converts spaces to underlines and control codes to <##> hex formats (with special handling for TAB and EOL), and strips all other characters:
02| EOL = ord(‘\n’)
03| SPC = ord(‘ ‘)
04| UCA,UCZ = ord(‘A’),ord(‘Z’)
05| LCA,LCZ = ord(‘a’),ord(‘z’)
06| DIG0,DIG9 = ord(‘0’),ord(‘9’)
07|
08| class basic_text_string (dict):
09| “””Convert LC=>UC; Pass digits; ^C=>hex; Else skip.”””
10| def __init__ (self):
11| “””Initialize instance with special chars.”””
12| super().__init__()
13| self[TAB] = ‘<TAB>’
14| self[EOL] = ‘<NL>’
15| self[SPC] = ‘_’
16|
17| def __missing__ (self, key):
18| “””Handle everything else here.”””
19| if UCA <= key <= UCZ: return key+32
20| if LCA <= key <= LCZ: return key
21| if DIG0 <= key <= DIG9: return key
22| if 0 <= key < SPC:
23| return (‘<%02x>’ % key)
24| return None
25|
26| s = ”’\
26| Samantha Ann Bear
26| 13924 Schulte Way
26| \tApt. 2402
26| Forest City, CA 92508
26| [code:QRTD8945-2740-21]
26| ”’
27| cmap = basic_text_string()
28| print(s)
29| print(s.translate(cmap))
30| print()
31|
Using only the __missing__
method. (And the __init__
method to set up the special mappings.)
When run, it prints:
Samantha Ann Bear 13924 Schulte Way Apt. 2402 Forest City, CA 92508 [code:QRTD8945-2740-21] samantha_ann_bear<NL>13924_schulte_way<NL><TAB>apt_2402<NL> forest_city_ca_92508<NL>codeqrtd8945274021<NL>
(Note that the translated string is a single line. It’s shown wrapped here for clarity.)
§
One last example just for fun:
02| “””Pass spaces; UC=>LC; digits,TAB,EOL==>words.”””
03| def __init__ (self):
04| “””Initialize character translations.”””
05| super().__init__()
06| self[SPC] = SPC
07| self[TAB] = ‘ TAB ‘
08| self[EOL] = ‘ || ‘
09| self[ord(‘0’)] = ‘ ZERO ‘
10| self[ord(‘1’)] = ‘ ONE ‘
11| self[ord(‘2’)] = ‘ TWO ‘
12| self[ord(‘3’)] = ‘ THREE ‘
13| self[ord(‘4’)] = ‘ FOUR ‘
14| self[ord(‘5’)] = ‘ FIVE ‘
15| self[ord(‘6’)] = ‘ SIX ‘
16| self[ord(‘7’)] = ‘ SEVEN ‘
17| self[ord(‘8’)] = ‘ EIGHT ‘
18| self[ord(‘9’)] = ‘ NINE ‘
19| self[ord(‘.’)] = ‘ STOP ‘
20| self[ord(‘,’)] = ‘ COMMA ‘
21| self[ord(‘:’)] = ‘ COLON ‘
22|
23| def __missing__ (self, key):
24| if UCA <= key <= UCZ: return key
25| if LCA <= key <= LCZ: return key–32
26| return None
27|
28| cmap = telegraph_string()
29| print(s)
30| print(s.translate(cmap))
31| print()
32|
It converts text to a “telegraph” format. It even translates digits to words. It strips anything that isn’t alphanumeric or a space. (It does have special handing for TAB and EOL.)
The output, when run against the address above, looks like this:
Samantha Ann Bear 13924 Schulte Way Apt. 2402 Forest City, CA 92508 [code:QRTD8945-2740-21] SAMANTHA ANN BEAR || ONE THREE NINE TWO FOUR SCHULTE WAY || TAB APT STOP TWO FOUR ZERO TWO || FOREST CITY COMMA CA NINE TWO FIVE ZERO EIGHT || CODE COLON QRTD EIGHT NINE FOUR FIVE TWO SEVEN FOUR ZERO TWO ONE ||
Ready for the telegraph office! (Again, the output is one line but here wrapped for clarity.)
As a user exercise, try making a translator to convert alpha characters to Morse Code!
§ §
The string translate
method does have some limitations. It’s stateless and only maps single chars. It can’t process sequences of characters (which would be necessary in, say, something like UTF-8). It has no memory or accumulation of information. It only uses each character’s ordinal number to lookup a possible translation.
∅
ATTENTION: The WordPress Reader strips the style information from posts, which destroys certain important formatting elements. If you’re reading this in the Reader, I highly recommend (and urge) you to [A] stop using the Reader and [B] always read blog posts on their website.
This post is: Python String Translate