Python String Translate

Tags

Python str instances have many useful methods. I use strip, split, startswith, and others, quite a lot, for instance. One method I haven’t had reason to use so far is translate. It takes a dictionary argument and uses it to map the existing string to a new string.

It’s flexible and useful, so it’s worth knowing how to use.

It’s a built-in method of the built-in str class, so using it doesn’t require importing anything (let alone installing anything). A typical use case is something like:

s_trans = s_orig.translate(table)

Where s_orig is the string to translate, table is a dictionary of character translations, and s_trans is the translated result string.

The key is the table parameter. Which translate expects to be an object implementing the __getitem__ method. Canonically, it would be a dict object or something that subclasses dict. It can also be any user-defined object that implements the __getitem__ method.

The translate method behaves roughly as if it was written like this:

def translate_string (s_orig, t_map=[]):

    “””Simulation of str.translate() method.”””

    buf = []

    for char in str(s_orig):

        cnum = ord(char)

        try:

            cout = t_map[cnum]

            if cout is None:

                continue

        except:

            cout = cnum

        sout = cout if isinstance(cout,str) else chr(cout)

        buf.append(sout)

    return ”.join(buf)



It takes an original string (s_orig) and a translation map (t_map) and returns a new string that is a translation of the original. Note that it uses the ordinal number of each character to index the map. This allows using sequences (instead of dictionaries) as maps.

Note, however, that using a string or list requires it be at least as long as the ordinal value of the highest character to translate. Assuming is translating strings with only the basic ASCII characters, this means a list with 128 items.

That translate takes lists, even strings, along with the more usual dictionary objects might lead to an experiment something like this:

s = ”’Samantha Ann Bear

13924 Schulte Way

Apt. 2402

Forest City, CA 92508

[code:QRTD8945-2740-21]”’

 

ts = [

    s.translate({}),

    s.translate({32:32}),

    s.translate({32:‘_’}),

    s.translate({32:None}),

 

    s.translate(”),

    s.translate(‘aeiou’),

 

    s.translate([]),

    s.translate([0,1,2,3,4,5,6,7,8,9]),

 

    s.translate(()),

    s.translate((”,‘a’,‘b’,‘c’,‘d’)),

]

 

for t in ts:

    print(t)

    print()

 

print()



Which gives results that might at first surprise. In almost all cases, the translated string is the same as the source string. The two exceptions are the third and fourth translations (from lines #6 and #7 respectively). The first one converts all spaces to underlines, the second removes them entirely.

The others seem to have no effect. In fact, some of them actually would alter the string if it contained different text. The first one, from line #10, would alter control chars NUL and ^A through ^D to, respectively, ‘a’, ‘e’, ‘i’, ‘o’, and ‘u’ (which would be weird). The second one, from line #16, would again alter NUL and ^A through ^D but to the respective values in the tuple, ‘a’ through ‘d’ (which would make at least some sense).

The example on line #13 maps NUL and control chars ^A through ^I to themselves, so it ends up not altering the string even though a translation does occur if those characters are in the string. They’re just translated to themselves.

All examples taking empty containers (lines #4, #9, #12, #15) translate all characters in any string to themselves. The examples on line #12 and #13 are effectively the same. So would this:

s.translate(list(range(128)))

When it comes to sequences, translate tries to index the nth item of the sequence, where n is the ordinal number of the current character. For example, the ordinal number of an uppercase ‘A’ is 65, so translate looks up the 65th item (counting from zero, of course, so in fact the 66th in the list).

When translate indexes a sequence, there are three possible outcomes:

Character’s ordinal number is larger than the list size. The lookup fails because list is too short. If list is empty, all lookups fail this way. In this case, translate passes the character to the result string unchanged.
The indexed list slot has the value None. In this case, translate skips the character and it does not appear in the result string.
The indexed list slot has an integer or string value. In this case, translate replaces the original character with the string or character value of the integer.

When given a dictionary, translate uses the character’s ordinal number as a key to an item in the dictionary. There are, again, three possible outcomes:

Character’s ordinal number is not a key in the dictionary. In this case, translate passes the character to the result unchanged. (Similar to not finding the indexed slot in a list.)
The ordinal number is a key in the dictionary and has a value of None. As with a sequence, translate skips the character; it does not appear in the result string.
The ordinal number is a key in the dictionary and has an integer or string value. Again, translate replaces the original character with the string or character value of the integer.

The behaviors are nearly identical, but dictionaries are simpler yet also more powerful. User-defined objects, or subclassed instances of existing dictionary classes, offer even more flexibility. The simple examples in lines #6 and #7 above aren’t as elegant:

s.translate([cn for cn in range(32)]+['_'])
s.translate([cn for cn in range(32)]+[None])

Respectively. In both cases, the list needs to include (at least) enough slots for the desired character mapping to index to the ordinal number of the original character. In this case, that’s just the 32 control characters, so it isn’t too bad. The uppercase letters have ordinal values from 65 to 90, and the lowercase ones from 97 to 122, so sequences for translating just basic alpha characters have to go at least to 122.

The basic ASCII set is 128 characters, and early 8-bit character sets had 256. A list of Unicode characters would require thousands, if not millions (depending on how inclusive it was). Plus, sequences are wasteful and redundant when most characters are passed through unchanged. In such cases, each slot must have the ordinal number that matches the index (or the character itself). The alternative is the value None, which removes the character from the result.

In contrast, dictionaries need only contain entries for the characters they’re altering. Characters without entries in the dictionary pass through unchanged. Removing a character requires an entry for that character with its value set to None.

In both cases, translate replaces the character it finds in the table with the value from the table. Which can be an integer value — the codepoint of a Unicode character — or a string (or None). Integers are converted to characters; strings are emitted as is. (Which means translate can convert single characters to multi-character sequences.)

In HTML, certain characters, chiefly the less-than (<), the greater-than (>), and the ampersand (&), need to be encoded as special sequences. The string translate method offers a convenient way to do that. The code fragment below illustrates two versions that accomplish the same thing:

 s=‘<p>”<strong>Hello</strong>, <em>World</em>!” (&U2!)</p>’

 

html_string_map = {

     ord(‘&’): ‘&amp;’,

     ord(‘<‘): ‘&lt;’,

     ord(‘>’): ‘&gt;’,

     ord(‘”‘): ‘&quot;’,

 }

 s1 = s.translate(html_string_map)

 

 class html_string (dict):

     “””Prepare string for HTML.”””

     def __init__ (self):

         “””Initialize instance with special chars.”””

         super().__init__()

         self[ord(‘&’)] = ‘&amp;’

         self[ord(‘<‘)] = ‘&lt;’

         self[ord(‘>’)] = ‘&gt;’

         self[ord(‘”‘)] = ‘&quot;’

 

 cm = html_string()

 s2 = s.translate(cm)



The first version, lines #3 through #9, uses a dict with the map. The second version, lines #11 through #22, uses a dict subclass with the same map. The second version is debatably more reusable but for the most part the two are roughly the same.

Creating a subclass, however, is more powerful when we want to do more than map a handful of characters to other characters (or character sequences). They’re especially useful if we want to do a similar translation on a large set of characters.

For example, suppose we want to implement a ROT13 translator. That requires altering uppercase and lowercase letters as well as digits (but using ROT5). It should leave all other characters unchanged. The code below illustrates two nearly identical approaches:

 UCA,UCZ = ord(‘A’),ord(‘Z’)

 LCA,LCZ = ord(‘a’),ord(‘z’)

 DIG0,DIG9 = ord(‘0’),ord(‘9’)

 

 s_in = ‘Now is the time for a pretty good party!’

 

 # Version 1: subclass of dict

 class rot13_string (dict):

     “””ROT13 encoder, str.translate style. (subclass)”””

     def __getitem__ (self, key):

         “””Convert alphnums and digits.”””

         if UCA <= key <= UCZ: return UCA+(((key–UCA)+13)%26)

         if LCA <= key <= LCZ: return LCA+(((key–LCA)+13)%26)

         if DIG0 <= key <= DIG9: return DIG0+(((key–DIG0)+5)%10)

         return super().__getitem__(key)

 

 rot13 = rot13_string()

 s_out = s_in.translate(rot13)

 s_txt = s_out.translate(rot13)

 print(s_in)

 print(s_out)

 print(s_txt)

 print()

 

 # Version 2: class implementing __getitem__

 class rot13_object:

     “””ROT13 encoder, str.translate style. (object)”””

     def __getitem__ (self, key):

         “””Convert alphnums and digits.”””

         if UCA <= key <= UCZ: return UCA+(((key–UCA)+13)%26)

         if LCA <= key <= LCZ: return LCA+(((key–LCA)+13)%26)

         if DIG0 <= key <= DIG9: return DIG0+(((key–DIG0)+5)%10)

         return key

 

 rot13 = rot13_object()

 s_out = s_in.translate(rot13)

 s_txt = s_out.translate(rot13)

 print(s_in)

 print(s_out)

 print(s_txt)

 print()



The difference is that the first version subclasses dict (and therefore inherits all its functionality) while the second version is a standalone class. Both implement __getitem__ — the first version overriding the dict method.

When run, the output looks like this:

Now is the time for a pretty good party!
Abj vf gur gvzr sbe n cerggl tbbq cnegl!
Now is the time for a pretty good party!

Now is the time for a pretty good party!
Abj vf gur gvzr sbe n cerggl tbbq cnegl!
Now is the time for a pretty good party!

What’s important in both cases is that the __getitem__ method handles a range of key values. In fact, all alphas and digits. For non-alpha, non-digit characters, the second version returns the character’s ordinal number, which effectively passes the character through unchanged (in reality, it is translated, but to itself). The first version, however, invokes the __getitem__ method of dict for non-alphanumeric characters. Since those characters have no keys in the dictionary, dict raises a LookupError exception.

The translate method treats a LookupError exception as an indicator pass the current character unchanged to the result string. That’s why missing keys in a dictionary, or sequence indexes larger than the sequence size, result in non-translation of the character. Missing entries mean unchanged characters. That’s why empty lists and dictionaries cause translate to return the entire string unchanged.

And advantage of subclassing dict is that we can take advantage of the __missing__ method, which dictionary objects invoke when a key isn’t found.

Suppose we want to translate strings to versions acceptable in a URL. For maximum security, we’ll pass alphas and digits unchanged, but convert anything else to the %## hex format:

 SPC = ord(‘ ‘)

 UCA,UCZ = ord(‘A’),ord(‘Z’)

 LCA,LCZ = ord(‘a’),ord(‘z’)

 DIG0,DIG9 = ord(‘0’),ord(‘9’)

 

 class url_string (dict):

     “””Translate for URL: Accept alnums (others to %## hex).”””

     def __init__ (self):

         “””Initialize instance with special chars.”””

         super().__init__()

         self[SPC] = ‘+’

 

     def __getitem__ (self, key):

         “””Explicitly pass alnums. Delegate others.”””

         if UCA <= key <= UCZ: return key

         if LCA <= key <= LCZ: return key

         if DIG0 <= key <= DIG9: return key

         return super().__getitem__(key)

 

     def __missing__ (self, key):

         “””Convert non-special chars to %## hex.”””

         return ‘%%%02X’ % key



Note that we create an entry in the dictionary for the space character. Such entries apply if __getitem__ passes control to the dict superclass. If no key exists, then dict invokes the __missing__ method.

Note that we could do this as a standalone object. Then the __getitem__ method would just return the same %## hex string as the __missing__ method does here.

We could also, in either case, return None if we wanted to strip the non-alphanumeric characters from the result.

In some cases, the __missing__ method can do all the work and the __getitem__ method need not be overridden. Dictionary entries get looked up as usual, and the __missing__ method only comes into play on, ta da, missing entries.

Here’s a translator that passes alphanumeric characters, converts spaces to underlines and control codes to <##> hex formats (with special handling for TAB and EOL), and strips all other characters:

 TAB = ord(‘\t’)

 EOL = ord(‘\n’)

 SPC = ord(‘ ‘)

 UCA,UCZ = ord(‘A’),ord(‘Z’)

 LCA,LCZ = ord(‘a’),ord(‘z’)

 DIG0,DIG9 = ord(‘0’),ord(‘9’)

 

 class basic_text_string (dict):

     “””Convert LC=>UC; Pass digits; ^C=>hex; Else skip.”””

     def __init__ (self):

         “””Initialize instance with special chars.”””

         super().__init__()

         self[TAB] = ‘<TAB>’

         self[EOL] = ‘<NL>’

         self[SPC] = ‘_’

 

     def __missing__ (self, key):

         “””Handle everything else here.”””

         if UCA <= key <= UCZ: return key+32

         if LCA <= key <= LCZ: return key

         if DIG0 <= key <= DIG9: return key

         if 0 <= key < SPC:

             return (‘<%02x>’ % key)

         return None

 

 s = ”’\

 Samantha Ann Bear

 13924 Schulte Way

 \tApt. 2402

 Forest City, CA 92508

 [code:QRTD8945-2740-21]

 ”’

 cmap = basic_text_string()

 print(s)

 print(s.translate(cmap))

 print()



Using only the __missing__ method. (And the __init__ method to set up the special mappings.)

When run, it prints:

Samantha Ann Bear
13924 Schulte Way
Apt. 2402
Forest City, CA 92508
[code:QRTD8945-2740-21]

samantha_ann_bear<NL>13924_schulte_way<NL><TAB>apt_2402<NL>
forest_city_ca_92508<NL>codeqrtd8945274021<NL>

(Note that the translated string is a single line. It’s shown wrapped here for clarity.)

One last example just for fun:

 class telegraph_string (dict):

     “””Pass spaces; UC=>LC; digits,TAB,EOL==>words.”””

     def __init__ (self):

         “””Initialize character translations.”””

         super().__init__()

         self[SPC] = SPC

         self[TAB] = ‘ TAB ‘

         self[EOL] = ‘ || ‘

         self[ord(‘0’)] = ‘ ZERO ‘

         self[ord(‘1’)] = ‘ ONE ‘

         self[ord(‘2’)] = ‘ TWO ‘

         self[ord(‘3’)] = ‘ THREE ‘

         self[ord(‘4’)] = ‘ FOUR ‘

         self[ord(‘5’)] = ‘ FIVE ‘

         self[ord(‘6’)] = ‘ SIX ‘

         self[ord(‘7’)] = ‘ SEVEN ‘

         self[ord(‘8’)] = ‘ EIGHT ‘

         self[ord(‘9’)] = ‘ NINE ‘

         self[ord(‘.’)] = ‘ STOP ‘

         self[ord(‘,’)] = ‘ COMMA ‘

         self[ord(‘:’)] = ‘ COLON ‘

 

     def __missing__ (self, key):

         if UCA <= key <= UCZ: return key

         if LCA <= key <= LCZ: return key–32

         return None

 

 cmap = telegraph_string()

 print(s)

 print(s.translate(cmap))

 print()



It converts text to a “telegraph” format. It even translates digits to words. It strips anything that isn’t alphanumeric or a space. (It does have special handing for TAB and EOL.)

The output, when run against the address above, looks like this:

Samantha Ann Bear
13924 Schulte Way
	Apt. 2402
Forest City, CA 92508
[code:QRTD8945-2740-21]

SAMANTHA ANN BEAR ||
ONE THREE NINE TWO FOUR SCHULTE WAY ||
TAB APT STOP TWO FOUR ZERO TWO ||
FOREST CITY COMMA CA NINE TWO FIVE ZERO EIGHT ||
CODE COLON QRTD EIGHT NINE FOUR FIVE TWO SEVEN FOUR ZERO TWO ONE ||

Ready for the telegraph office! (Again, the output is one line but here wrapped for clarity.)

As a user exercise, try making a translator to convert alpha characters to Morse Code!

§ §

The string translate method does have some limitations. It’s stateless and only maps single chars. It can’t process sequences of characters (which would be necessary in, say, something like UTF-8). It has no memory or accumulation of information. It only uses each character’s ordinal number to lookup a possible translation.

∅

2 thoughts on “Python String Translate”

Wyrd Smythe said:

October 13, 2022 at 11:13 am

ATTENTION: The WordPress Reader strips the style information from posts, which destroys certain important formatting elements. If you’re reading this in the Reader, I highly recommend (and urge) you to [A] stop using the Reader and [B] always read blog posts on their website.

This post is: Python String Translate

Pingback: Calculating the Number e in Python | The Hard-Core Coder

The Hard-Core Coder

~ I can't stop writing code!

Python String Translate

2 thoughts on “Python String Translate”

Over to you... Cancel reply

Share this:

Related

2 thoughts on “Python String Translate”

Over to you... Cancel reply