Python Tokenize

Tags

Recently I thought to myself, “Hey self… Although I’ve never looked into them, I know Python has tools for parsing Python code. I wonder if they might make generating syntax-highlighted HTML pages of Python scripts pretty easy?”

I found the help page for the tokenize module, and the introduction says it’s “useful for implementing ‘pretty-printers,’ including colorizers for on-screen displays.” That sounds like a strong yes.

In fact, in a couple of hours I had a working version 1.0 that’s useful enough to put into production making webpages of some of my Python scripts. I already had a py2html module I wrote a while back, but it used its own home-grown tokenizer. This new version uses Python tools for that, so it’s much cleaner and almost certainly better.

§ §

To jump right in, let’s tokenize a Python script:

 from tokenize import tokenize

 

 def parse_python (fname, fpath=BasePath):

     ”’Tokenize Python Source File.”’

     print(‘Python: %s’ % fname)

     fn = path.join(fpath, ‘%s.py’ % fname)

     fp = open(fn, mode=‘rb’)

     try:

         toks = list(tokenize(fp.readline))

     except:

         raise

     else:

         print(‘read: %s’ % fn)

     finally:

         fp.close()

     return toks



One key is to open the file in binary mode. The tokenizer is expecting bytes.

This drove me crazy until I RTFM: “The tokenize() generator requires one argument, readline, which must be a callable object […]. Each call to the function should return one line of input as bytes.” Oh, okay. That last sentence didn’t sink in the first time I read it. Works quite well when you do it right.

I think I was a little thrown by passing in readline rather than the file object. I suppose that does give you the option of passing something that acts like a binary readline but doesn’t have that name. When the text file was giving me errors, I wasn’t sure if I was even passing the function correctly.

Call parse_python with a Python script name (no “.py”) and an optional path name (if file isn’t in BasePath). The function opens the file (in binary mode!) and has tokenize do the rest. It returns the list of tokens generated. Each token has the structure:

TokenType (type) — a number indicating what kind of token this is.
TokenString (string) — a string of the actual token text.
RowColStart (start) — a pair of numbers, row & column token begins.
RowColEnd (end) — a pair of numbers, row & column token ends.
LineText (line) — The text of the line the token is on.
(exact_type) — s/a type but resolves operator tokens.

The names in parenthesis are the names Python assigns the named tuple is uses for tokens. You can refer to token fields with these names or with index numbers. (Or, as you’ll see, you skip the need to do either entirely.)

We can add a simple function to list the tokens to an output file:

 from token import tok_name

 

 def write_output_listing (toks, fname, fpath=BasePath, lines=0):

     ”’Write Listing Output File.”’

     cstr = lambda x: ‘ ‘.join([‘%02x’ % ord(ch) for ch in x])

     outfmt = ‘%4d[%3d:%3d|%3d:%3d] %-9s %-24s%s’

     print(‘LIST: %s’ % fname)

     # Open output file…

     fn = path.join(fpath, ‘%s.list’ % fname)

     fp = open(fn, mode=‘w’, encoding=‘utf-8’)

     try:

         # For each token…

         for ix,tok in enumerate(toks):

             # Unpack token…

             tt,ts,rc0,rc1,ln = tok

             # Translate type number to type string…

             ttn = tok_name[tt].lower()

             # Special handling for newline tokens…

             ts = cstr(ts) if ttn in [‘nl’,‘newline’] else ts

             # Optional full line output…

             ls = (‘| %s’ % ln.rstrip()) if lines else ”

             # Print the output…

             args = (ix+1,rc0[0],rc0[1],rc1[0],rc1[1],ttn,ts,ls)

             print(outfmt % args, file=fp)

         print(file=fp)

     except:

         raise

     else:

         print(‘wrote: %s’ % fn)

     finally:

         fp.close()



As before, the function takes a script name (no extension) along with an optional pathname override. It also takes a list of tokens to save to file.

As you see, another way to deal with the token tuple is to unpack it into separate variables. Most of the cruft in the for-loop is to pretty-print the tokens. In particular note how we import tok_name so we can convert TokenType numbers to TokenTypeName strings.

Just taking the project this far is fun. We can see how Python views our scripts. That can be instructive. In a bit we’ll see that we can go the other way and turn a list of tokens into a Python script.

But first let’s see how to generate HTML. Any color display has a protocol for how to tell it what color to use. HTML has an obsolete way we’ll ignore (using <FONT> tags), and an approved way — CSS styles. There are two options. In both cases we enclose each text token in <SPAN> tags. Option one uses the style attribute to explicitly set the style of each token:

<SPAN style="color:#0000ff;">import</SPAN>

Option two uses the class attribute to divide tokens into classes to which a <STYLE> section assigns global values:

<SPAN class="keyword">import</SPAN>

Since the tokenize function already divides tokens into classes, option two is a natural fit. We can just use the TokenTypeName as the style class name. That’s a nice unity.

I think it makes cleaner HTML, too. The flip side might be that it does require a <STYLE> section or a <LINK> to a CSS stylesheet. These decisions all depend on your environment and what you’re doing.

Here’s the function that creates the HTML file:

 def write_output_html (toks, fname, fpath=BasePath):

     ”’Write HTML Output File.”’

     p = ‘<p class=”%s”>%s</p>’

     print(‘HTML: %s’ % fname)

     fn = path.join(fpath, ‘%s.html’ % fname)

     fp = open(fn, mode=‘w’, encoding=‘utf-8’)

     try:

         fp.write(‘<html>\n’)

         fp.write(‘<head>\n’)

         fp.write(‘<title>%s</title>\n’ % fname)

         fp.write(‘<ZCZC type=”text/css”>\n’)

         fp.write(‘.keyword {color:#0000ff;font-weight:bold;}\n’)

         fp.write(‘.function {color:#0033cc;}\n’)

         fp.write(‘.op {color:#000099;}\n’)

         fp.write(‘.string {color:#009900;}\n’)

         fp.write(‘.number {color:#ff0000;font-weight:bold;}\n’)

         fp.write(‘.comment {color:#999999;font-style:italic;}\n’)

         fp.write(‘</ZCZC>\n’)

         fp.write(‘</head>\n’)

         fp.write(‘<body>\n’)

         fp.write(‘<h1>%s</h1>\n’ % fname)

         fp.write(‘<hr>\n’)

         fp.write(‘<div style=”font-family:monospace;”>\n’)

         ccursor = 0 # Column Cursor

         for tt,ts,rc0,rc1,ln in toks:

             ttn = tok_name[tt].lower()

             # Ignore certain token types…

             if ttn in [‘encoding’, ‘indent’, ‘dedent’]:

                 continue

             # Handle newline tokens…

             if ttn in [‘newline’, ‘nl’]:

                 fp.write(‘<br>\n’)

                 ccursor = 0

                 continue

             # Check for special names…

             if ttn == ‘name’:

                 if ts in Keywords:

                     ttn  = ‘keyword’

                 elif ts in Functions:

                     ttn = ‘function’

             # If the token begins after the cursor…

             if ccursor < rc0[1]:

                 # Write spaces to compensate…

                 fp.write(‘&nbsp;’*(rc0[1]–ccursor))

             # Special handling for HTML output…

             ts = ts.replace(‘&’,‘&amp;’)

             ts = ts.replace(‘<‘,‘&lt;’)

             ts = ts.replace(‘>’,‘&gt;’)

             # Emit the token…

             fp.write(p % (ttn.lower(),ts))

             # Update the cursor…

             ccursor = rc1[1]

         fp.write(‘</div>\n’)

         fp.write(‘<hr>\n’)

         fp.write(‘</body>’)

         fp.write(‘</html>’)

         fp.write(‘\n’)

     except:

         raise

     else:

         print(‘wrote: %s’ % fn)

     finally:

         fp.close()



Yep, it’s kinda long. To make it this short I left out the two global data items Keywords and Functions. They’re just list objects filled with strings. The lists are, per their names, Python’s keywords and built-in function names. The tokenizer sees all names as equal, so we check to see if a name is a keyword or built-in Python function and color those specially.

Since the tokenizer gives us start and end columns, we can use those to place the tokens within the line. The easiest way is to maintain a cursor that tracks the current column. If a token goes in a column beyond the cursor, we insert spaces to compensate. We use the HTML special no-break space to avoid HTML’s whitespace compression.

Lastly, need to make sure we translate any output characters that are special in HTML to their HTML-safe equivalents. Note that we write the many supporting lines of HTML the tedious way, with individual write statements. There are more creative ways.

Note: Replace the two occurrences of ZCZC with STYLE. The WP Source Code widget wasn’t happy with that embedded <STYLE> element and turned it into complete garbage.

There is also an untokenize function that turns a list of tokens into Python source code. It’s a bit tedious, and the source code doesn’t look that great (IMO), but it’s possible to do.

Essentially, just pass untokenize the same list tokenize returns. You only need the first two fields, TokenType and TokenString. Any others are ignored.

Here’s an example:

 # Create a reverse lookup for token-name to token-id…

 rlu = dict([(valu,name) for name,valu in tok_name.items()])

 

 # Create a token-maker function…

 Token = lambda tt,ts: (rlu[tt], ts)

 toks = []

 toks.append(Token(‘ENCODING’, ‘utf-8’))

 toks.append(Token(‘STRING’, ‘{{{Fibonacci Function.}}}’.))

 toks.append(Token(‘NEWLINE’, ‘\r\n’))

 

 toks.append(Token(‘NAME’, ‘from’))

 toks.append(Token(‘NAME’, ‘sys’))

 toks.append(Token(‘NAME’, ‘import’))

 toks.append(Token(‘NAME’, ‘argv’))

 toks.append(Token(‘NEWLINE’, ‘\r\n’))

 toks.append(Token(‘NL’, ‘\r\n’))

 

 toks.append(Token(‘NAME’, ‘def’))

 toks.append(Token(‘NAME’, ‘fib’))

 toks.append(Token(‘OP’, ‘(‘))

 toks.append(Token(‘NAME’, ‘n’))

 toks.append(Token(‘OP’, ‘)’))

 toks.append(Token(‘OP’, ‘:’))

 toks.append(Token(‘NEWLINE’, ‘\r\n’))

 

 toks.append(Token(‘INDENT’, ‘ ‘*4))

 toks.append(Token(‘STRING’, ‘{{{Get Nth Fib#. RECURSIVE!}}}’))

 toks.append(Token(‘NEWLINE’, ‘\r\n’))

 

 toks.append(Token(‘NAME’, ‘if’))

 toks.append(Token(‘NAME’, ‘n’))

 toks.append(Token(‘OP’, ‘<‘))

 toks.append(Token(‘NUMBER’, ‘2’))

 toks.append(Token(‘OP’, ‘:’))

 toks.append(Token(‘NEWLINE’, ‘\r\n’))

 

 toks.append(Token(‘INDENT’, ‘ ‘*8))

 toks.append(Token(‘NAME’, ‘return’))

 toks.append(Token(‘NAME’, ‘n’))

 toks.append(Token(‘NEWLINE’, ‘\r\n’))

 

 toks.append(Token(‘DEDENT’, ”))

 toks.append(Token(‘NAME’, ‘return’))

 toks.append(Token(‘NAME’, ‘fib’))

 toks.append(Token(‘OP’, ‘(‘))

 toks.append(Token(‘NAME’, ‘n’))

 toks.append(Token(‘OP’, ‘-‘))

 toks.append(Token(‘NUMBER’, ‘1’))

 toks.append(Token(‘OP’, ‘)’))

 toks.append(Token(‘OP’, ‘+’))

 toks.append(Token(‘NAME’, ‘fib’))

 toks.append(Token(‘OP’, ‘(‘))

 toks.append(Token(‘NAME’, ‘n’))

 toks.append(Token(‘OP’, ‘-‘))

 toks.append(Token(‘NUMBER’, ‘2’))

 toks.append(Token(‘OP’, ‘)’))

 toks.append(Token(‘NEWLINE’, ‘\r\n’))

 toks.append(Token(‘NL’, ‘\r\n’))

 

 toks.append(Token(‘DEDENT’, ”))

 toks.append(Token(‘ENDMARKER’, ”))

 

 bs = untokenize(toks)

 ps = bs.decode(‘utf-8’)

 print(ps)



I used blank lines to set off the source code lines. Note that the INDENT token requires the number of spaces (or whatevers) to indent. Note also that untokenize returns a byte string that needs to be converted to a string. (Or it could be written to a file in binary mode.)

Note: As with the previous listing, I had to take steps to prevent the source code display from screwing up. I used triple-curly-braces. Replace those with triple-double-quotes. (Damn half-assed software people write these days.)

When run it prints:

"""Fibonacci Function."""
from sys import argv 
def fib (n ):
    """Get Nth Fib#. RECURSIVE!"""
    if n <2 :
        return n 
    return fib (n -1 )+fib (n -2 )

Not the prettiest code — lots of spurious spaces (and none after the minus signs) — but it’s valid code. The tokenize and untokenize function pair guarantee accurate translation back and forth between tokens and the generated source.

§ §

I can’t imagine using tokens to generate Python source, but maybe I’ll run into a use case someday. Being able to parse source into tokens is a nice tool for the kit, though. Very handy for syntax-highlighting!

Link: Zip file containing all code fragments used in this post.

1 thought on “Python Tokenize”

Wyrd Smythe said:

December 3, 2021 at 11:53 am

For one thing, now I can include syntax highlighted comments!

001| ”’Fibonacci Algorithm.
001|
001| fib series: 0 1 1 2 3 5 8…
001|
001| definition: fib(n) = fib(n-1) + fib(n-2)
001| exceptions: fib(0) = 0; fib(1) = 1
001| ”’
002| def fib (n, lst=None):
003|     if not lst:
004|         return fib(n, [0, 1])
005|     if len(lst) < 2:
006|         lst.append(lst[0]+1)
007|     ix = len(lst) – 1
008|     c = lst[ix] + lst[ix–1]
009|     lst.append(c)
010|     if n <= len(lst):
011|         return lst
012|     return fib(n, lst)
013|
014| ns = fib(42)[1:]
015| print(ns)
016| print()
017|

And if I ever have to start using the Block Editor, I may be able to import HTML blocks, which would be good, because I understand the Block Editor doesn’t have source code blocks.

The Hard-Core Coder

~ I can't stop writing code!

Python Tokenize

1 thought on “Python Tokenize”

Leave a reply to Wyrd Smythe Cancel reply

Share this:

Related

1 thought on “Python Tokenize”

Leave a reply to Wyrd Smythe Cancel reply