str instances have many useful methods. I use
startswith, and others, quite a lot, for instance. One method I haven’t had reason to use so far is
translate. It takes a dictionary argument and uses it to map the existing string to a new string.
It’s flexible and useful, so it’s worth knowing how to use.
It’s a built-in method of the built-in
str class, so using it doesn’t require importing anything (let alone installing anything). A typical use case is something like:
s_trans = s_orig.translate(table)
Where s_orig is the string to translate, table is a dictionary of character translations, and s_trans is the translated result string.
The key is the table parameter. Which
translate expects to be an object implementing the
__getitem__ method. Canonically, it would be a
dict object or something that subclasses
dict. It can also be any user-defined object that implements the
The translate method behaves roughly as if it was written like this:
It takes an original string (s_orig) and a translation map (t_map) and returns a new string that is a translation of the original. Note that it uses the ordinal number of each character to index the map. This allows using sequences (instead of dictionaries) as maps.
Note, however, that using a string or list requires it be at least as long as the ordinal value of the highest character to translate. Assuming is translating strings with only the basic ASCII characters, this means a list with 128 items.
That translate takes lists, even strings, along with the more usual dictionary objects might lead to an experiment something like this:
Which gives results that might at first surprise. In almost all cases, the translated string is the same as the source string. The two exceptions are the third and fourth translations (from lines #6 and #7 respectively). The first one converts all spaces to underlines, the second removes them entirely.
The others seem to have no effect. In fact, some of them actually would alter the string if it contained different text. The first one, from line #10, would alter control chars NUL and ^A through ^D to, respectively, ‘a’, ‘e’, ‘i’, ‘o’, and ‘u’ (which would be weird). The second one, from line #16, would again alter NUL and ^A through ^D but to the respective values in the tuple, ‘a’ through ‘d’ (which would make at least some sense).
The example on line #13 maps NUL and control chars ^A through ^I to themselves, so it ends up not altering the string even though a translation does occur if those characters are in the string. They’re just translated to themselves.
All examples taking empty containers (lines #4, #9, #12, #15) translate all characters in any string to themselves. The examples on line #12 and #13 are effectively the same. So would this:
When it comes to sequences,
translate tries to index the nth item of the sequence, where n is the ordinal number of the current character. For example, the ordinal number of an uppercase ‘A’ is 65, so
translate looks up the 65th item (counting from zero, of course, so in fact the 66th in the list).
When translate indexes a sequence, there are three possible outcomes:
- Character’s ordinal number is larger than the list size. The lookup fails because list is too short. If list is empty, all lookups fail this way. In this case,
translatepasses the character to the result string unchanged.
- The indexed list slot has the value
None. In this case,
translateskips the character and it does not appear in the result string.
- The indexed list slot has an integer or string value. In this case,
translatereplaces the original character with the string or character value of the integer.
When given a dictionary, translate uses the character’s ordinal number as a key to an item in the dictionary. There are, again, three possible outcomes:
- Character’s ordinal number is not a key in the dictionary. In this case,
translatepasses the character to the result unchanged. (Similar to not finding the indexed slot in a list.)
- The ordinal number is a key in the dictionary and has a value of
None. As with a sequence,
translateskips the character; it does not appear in the result string.
- The ordinal number is a key in the dictionary and has an integer or string value. Again,
translatereplaces the original character with the string or character value of the integer.
The behaviors are nearly identical, but dictionaries are simpler yet also more powerful. User-defined objects, or subclassed instances of existing dictionary classes, offer even more flexibility. The simple examples in lines #6 and #7 above aren’t as elegant:
s.translate([cn for cn in range(32)]+['_']) s.translate([cn for cn in range(32)]+[None])
Respectively. In both cases, the list needs to include (at least) enough slots for the desired character mapping to index to the ordinal number of the original character. In this case, that’s just the 32 control characters, so it isn’t too bad. The uppercase letters have ordinal values from 65 to 90, and the lowercase ones from 97 to 122, so sequences for translating just basic alpha characters have to go at least to 122.
The basic ASCII set is 128 characters, and early 8-bit character sets had 256. A list of Unicode characters would require thousands, if not millions (depending on how inclusive it was). Plus, sequences are wasteful and redundant when most characters are passed through unchanged. In such cases, each slot must have the ordinal number that matches the index (or the character itself). The alternative is the value
None, which removes the character from the result.
In contrast, dictionaries need only contain entries for the characters they’re altering. Characters without entries in the dictionary pass through unchanged. Removing a character requires an entry for that character with its value set to
In both cases,
translate replaces the character it finds in the table with the value from the table. Which can be an integer value — the codepoint of a Unicode character — or a string (or
None). Integers are converted to characters; strings are emitted as is. (Which means
translate can convert single characters to multi-character sequences.)
In HTML, certain characters, chiefly the less-than (<), the greater-than (>), and the ampersand (&), need to be encoded as special sequences. The string
translate method offers a convenient way to do that. The code fragment below illustrates two versions that accomplish the same thing:
The first version, lines #3 through #9, uses a
dict with the map. The second version, lines #11 through #22, uses a
dict subclass with the same map. The second version is debatably more reusable but for the most part the two are roughly the same.
Creating a subclass, however, is more powerful when we want to do more than map a handful of characters to other characters (or character sequences). They’re especially useful if we want to do a similar translation on a large set of characters.
For example, suppose we want to implement a ROT13 translator. That requires altering uppercase and lowercase letters as well as digits (but using ROT5). It should leave all other characters unchanged. The code below illustrates two nearly identical approaches:
The difference is that the first version subclasses
dict (and therefore inherits all its functionality) while the second version is a standalone class. Both implement
__getitem__ — the first version overriding the
When run, the output looks like this:
Now is the time for a pretty good party! Abj vf gur gvzr sbe n cerggl tbbq cnegl! Now is the time for a pretty good party! Now is the time for a pretty good party! Abj vf gur gvzr sbe n cerggl tbbq cnegl! Now is the time for a pretty good party!
What’s important in both cases is that the
__getitem__ method handles a range of key values. In fact, all alphas and digits. For non-alpha, non-digit characters, the second version returns the character’s ordinal number, which effectively passes the character through unchanged (in reality, it is translated, but to itself). The first version, however, invokes the
__getitem__ method of
dict for non-alphanumeric characters. Since those characters have no keys in the dictionary,
dict raises a
The translate method treats a
LookupError exception as an indicator pass the current character unchanged to the result string. That’s why missing keys in a dictionary, or sequence indexes larger than the sequence size, result in non-translation of the character. Missing entries mean unchanged characters. That’s why empty lists and dictionaries cause
translate to return the entire string unchanged.
And advantage of subclassing dict is that we can take advantage of the
__missing__ method, which dictionary objects invoke when a key isn’t found.
Suppose we want to translate strings to versions acceptable in a URL. For maximum security, we’ll pass alphas and digits unchanged, but convert anything else to the %## hex format:
Note that we create an entry in the dictionary for the space character. Such entries apply if
__getitem__ passes control to the
dict superclass. If no key exists, then
dict invokes the
Note that we could do this as a standalone object. Then the
__getitem__ method would just return the same %## hex string as the
__missing__ method does here.
We could also, in either case, return
None if we wanted to strip the non-alphanumeric characters from the result.
In some cases, the
__missing__ method can do all the work and the
__getitem__ method need not be overridden. Dictionary entries get looked up as usual, and the
__missing__ method only comes into play on, ta da, missing entries.
Here’s a translator that passes alphanumeric characters, converts spaces to underlines and control codes to <##> hex formats (with special handling for TAB and EOL), and strips all other characters:
Using only the
__missing__ method. (And the
__init__ method to set up the special mappings.)
When run, it prints:
Samantha Ann Bear 13924 Schulte Way Apt. 2402 Forest City, CA 92508 [code:QRTD8945-2740-21] samantha_ann_bear<NL>13924_schulte_way<NL><TAB>apt_2402<NL> forest_city_ca_92508<NL>codeqrtd8945274021<NL>
(Note that the translated string is a single line. It’s shown wrapped here for clarity.)
One last example just for fun:
It converts text to a “telegraph” format. It even translates digits to words. It strips anything that isn’t alphanumeric or a space. (It does have special handing for TAB and EOL.)
The output, when run against the address above, looks like this:
Samantha Ann Bear 13924 Schulte Way Apt. 2402 Forest City, CA 92508 [code:QRTD8945-2740-21] SAMANTHA ANN BEAR || ONE THREE NINE TWO FOUR SCHULTE WAY || TAB APT STOP TWO FOUR ZERO TWO || FOREST CITY COMMA CA NINE TWO FIVE ZERO EIGHT || CODE COLON QRTD EIGHT NINE FOUR FIVE TWO SEVEN FOUR ZERO TWO ONE ||
Ready for the telegraph office! (Again, the output is one line but here wrapped for clarity.)
As a user exercise, try making a translator to convert alpha characters to Morse Code!
translate method does have some limitations. It’s stateless and only maps single chars. It can’t process sequences of characters (which would be necessary in, say, something like UTF-8). It has no memory or accumulation of information. It only uses each character’s ordinal number to lookup a possible translation.