Tags
Recently I thought to myself, “Hey self,… Although I’ve never looked into them, I know Python has tools for parsing Python code. I wonder if they might make generating syntax-highlighted HTML pages of Python scripts pretty easy?”
I found the help page for the tokenize
module, and the introduction says it’s “useful for implementing ‘pretty-printers,’ including colorizers for on-screen displays.” That sounds like a strong yes.
In fact, in a couple of hours I had a working version 1.0 that’s useful enough to put into production making webpages of some of my Python scripts. I already had a py2html
module I wrote a while back, but it used its own home-grown tokenizer. This new version uses Python tools for that, so it’s much cleaner and almost certainly better.
§ §
To jump right in, let’s tokenize a Python script:
from tokenize import tokenize def parse_python (fname, fpath=BasePath): '''Tokenize Python Source File.''' print('Python: %s' % fname) fn = path.join(fpath, '%s.py' % fname) fp = open(fn, mode='rb') try: toks = list(tokenize(fp.readline)) except: raise else: print('read: %s' % fn) finally: fp.close() return toks #
One key is to open the file in binary mode. The tokenizer is expecting bytes.
This drove me crazy until I RTFM: “The tokenize()
generator requires one argument, readline
, which must be a callable object […]. Each call to the function should return one line of input as bytes.” Oh, okay. That last sentence didn’t sink in the first time I read it. Works quite well when you do it right.
I think I was a little thrown by passing in readline
rather than the file object. I suppose that does give you the option of passing something that acts like a binary readline
but doesn’t have that name. When the text file was giving me errors, I wasn’t sure if I was even passing the function correctly.
Call parse_python with a Python script name (no “.py”) and an optional path name (if file isn’t in BasePath). The function opens the file (in binary mode!) and has tokenize
do the rest. It returns the list of tokens generated. Each token has the structure:
- TokenType (
type
) — a number indicating what kind of token this is. - TokenString (
string
) — a string of the actual token text. - RowColStart (
start
) — a pair of numbers, row & column token begins. - RowColEnd (
end
) — a pair of numbers, row & column token ends. - LineText (
line
) — The text of the line the token is on. - (
exact_type
) — s/a type, but resolves operator tokens.
The names in parenthesis are the names Python assigns the named tuple is uses for tokens. You can refer to token fields with these names or with index numbers. (Or, as you’ll see, you skip the need to do either entirely.)
§
We can add a simple function to list the tokens to an output file:
from token import tok_name def write_output_listing (toks, fname, fpath=BasePath, lines=0): '''Write Listing Output File.''' cstr = lambda x: ' '.join(['%02x' % ord(ch) for ch in x]) outfmt = '%4d[%3d:%3d|%3d:%3d] %-9s %-24s%s' print('LIST: %s' % fname) # Open output file... fn = path.join(fpath, '%s.list' % fname) fp = open(fn, mode='w', encoding='utf-8') try: # For each token... for ix,tok in enumerate(toks): # Unpack token... tt,ts,rc0,rc1,ln = tok # Translate type number to type string... ttn = tok_name[tt].lower() # Special handling for newline tokens... ts = cstr(ts) if ttn in ['nl','newline'] else ts # Optional full line output... ls = ('| %s' % ln.rstrip()) if lines else '' # Print the output... args = (ix+1,rc0[0],rc0[1],rc1[0],rc1[1],ttn,ts,ls) print(outfmt % args, file=fp) print(file=fp) except: raise else: print('wrote: %s' % fn) finally: fp.close() #
As before, the function takes a script name (no extension) along with an optional pathname override. It also takes a list of tokens to save to file.
As you see, another way to deal with the token tuple is to unpack it into separate variables. Most of the cruft in the for-loop is to pretty-print the tokens. In particular note how we import tok_name so we can convert TokenType numbers to TokenTypeName strings.
Just taking the project this far is fun. We can see how Python views our scripts. That can be instructive. In a bit we’ll see that we can go the other way and turn a list of tokens into a Python script.
§
But first let’s see how to generate HTML. Any color display has a protocol for how to tell it what color to use. HTML has an obsolete way we’ll ignore (using <FONT> tags), and an approved way — CSS styles. There are two options. In both cases we enclose each text token in <SPAN> tags. Option one uses the style attribute to explicitly set the style of each token:
<SPAN style="color:#0000ff;">import</SPAN>
Option two uses the class attribute to divide tokens into classes to which a <STYLE> section assigns global values:
<SPAN class="keyword">import</SPAN>
Since the tokenize
function already divides tokens into classes, option two is a natural fit. We can just use the TokenTypeName as the style class name. That’s a nice unity.
I think it makes cleaner HTML, too. The flip side might be that it does require a <STYLE> section or a <LINK> to a CSS stylesheet. These decisions all depend on your environment and what you’re doing.
Here’s the function that creates the HTML file:
def write_output_html (toks, fname, fpath=BasePath): '''Write HTML Output File.''' p = '<p class="%s">%s</p>' print('HTML: %s' % fname) fn = path.join(fpath, '%s.html' % fname) fp = open(fn, mode='w', encoding='utf-8') try: fp.write('<html>\n') fp.write('<head>\n') fp.write('<title>%s</title>\n' % fname) fp.write('<ZCZC type="text/css">\n') fp.write('.keyword {color:#0000ff;font-weight:bold;}\n') fp.write('.function {color:#0033cc;}\n') fp.write('.op {color:#000099;}\n') fp.write('.string {color:#009900;}\n') fp.write('.number {color:#ff0000;font-weight:bold;}\n') fp.write('.comment {color:#999999;font-style:italic;}\n') fp.write('</ZCZC>\n') fp.write('</head>\n') fp.write('<body>\n') fp.write('<h1>%s</h1>\n' % fname) fp.write('<hr>\n') fp.write('<div style="font-family:monospace;">\n') ccursor = 0 # Column Cursor for tt,ts,rc0,rc1,ln in toks: ttn = tok_name[tt].lower() # Ignore certain token types... if ttn in ['encoding', 'indent', 'dedent']: continue # Handle newline tokens... if ttn in ['newline', 'nl']: fp.write('<br>\n') ccursor = 0 continue # Check for special names... if ttn == 'name': if ts in Keywords: ttn = 'keyword' elif ts in Functions: ttn = 'function' # If the token begins after the cursor... if ccursor < rc0[1]: # Write spaces to compensate... fp.write(' '*(rc0[1]-ccursor)) # Special handling for HTML output... ts = ts.replace('&','&') ts = ts.replace('<','<') ts = ts.replace('>','>') # Emit the token... fp.write(p % (ttn.lower(),ts)) # Update the cursor... ccursor = rc1[1] fp.write('</div>\n') fp.write('<hr>\n') fp.write('</body>') fp.write('</html>') fp.write('\n') except: raise else: print('wrote: %s' % fn) finally: fp.close() #
Yep, it’s kinda long. To make it this short I left out the two global data items Keywords
and Functions
. They’re just list
objects filled with strings. The lists are, per their names, Python’s keywords and built-in function names. The tokenizer sees all names as equal, so we check to see if a name is a keyword or built-in Python function and color those specially.
Since the tokenizer gives us start and end columns, we can use those to place the tokens within the line. The easiest way is to maintain a cursor that tracks the current column. If a token goes in a column beyond the cursor, we insert spaces to compensate. We use the HTML special no-break space to avoid HTML’s whitespace compression.
Lastly, need to make sure we translate any output characters that are special in HTML to their HTML-safe equivalents. Note that we write the many supporting lines of HTML the tedious way, with individual write
statements. There are more creative ways.
Note: Replace the two occurrences of ZCZC with STYLE. The WP Source Code widget wasn’t happy with that embedded <STYLE> element and turned it into complete garbage.
§
There is also an untokenize
function that turns a list of tokens into Python source code. It’s a bit tedious, and the source code doesn’t look that great (IMO), but it’s possible to do.
Essentially, just pass untokenize
the same list tokenize
returns. You only need the first two fields, TokenType and TokenString. Any others are ignored.
Here’s an example:
# Create a reverse lookup for token-name to token-id... rlu = dict([(valu,name) for name,valu in tok_name.items()]) # Create a token-maker function... Token = lambda tt,ts: (rlu[tt], ts) toks = [] toks.append(Token('ENCODING', 'utf-8')) toks.append(Token('STRING', '{{{Fibonacci Function.}}}'.)) toks.append(Token('NEWLINE', '\r\n')) toks.append(Token('NAME', 'from')) toks.append(Token('NAME', 'sys')) toks.append(Token('NAME', 'import')) toks.append(Token('NAME', 'argv')) toks.append(Token('NEWLINE', '\r\n')) toks.append(Token('NL', '\r\n')) toks.append(Token('NAME', 'def')) toks.append(Token('NAME', 'fib')) toks.append(Token('OP', '(')) toks.append(Token('NAME', 'n')) toks.append(Token('OP', ')')) toks.append(Token('OP', ':')) toks.append(Token('NEWLINE', '\r\n')) toks.append(Token('INDENT', ' '*4)) toks.append(Token('STRING', '{{{Get Nth Fib#. RECURSIVE!}}}')) toks.append(Token('NEWLINE', '\r\n')) toks.append(Token('NAME', 'if')) toks.append(Token('NAME', 'n')) toks.append(Token('OP', '<')) toks.append(Token('NUMBER', '2')) toks.append(Token('OP', ':')) toks.append(Token('NEWLINE', '\r\n')) toks.append(Token('INDENT', ' '*8)) toks.append(Token('NAME', 'return')) toks.append(Token('NAME', 'n')) toks.append(Token('NEWLINE', '\r\n')) toks.append(Token('DEDENT', '')) toks.append(Token('NAME', 'return')) toks.append(Token('NAME', 'fib')) toks.append(Token('OP', '(')) toks.append(Token('NAME', 'n')) toks.append(Token('OP', '-')) toks.append(Token('NUMBER', '1')) toks.append(Token('OP', ')')) toks.append(Token('OP', '+')) toks.append(Token('NAME', 'fib')) toks.append(Token('OP', '(')) toks.append(Token('NAME', 'n')) toks.append(Token('OP', '-')) toks.append(Token('NUMBER', '2')) toks.append(Token('OP', ')')) toks.append(Token('NEWLINE', '\r\n')) toks.append(Token('NL', '\r\n')) toks.append(Token('DEDENT', '')) toks.append(Token('ENDMARKER', '')) bs = untokenize(toks) ps = bs.decode('utf-8') print(ps) #
I used blank lines to set off the source code lines. Note that the INDENT token requires the number of spaces (or whatevers) to indent. Note also that untokenize
returns a byte
string that needs to be converted to a string. (Or it could be written to a file in binary mode.)
Note: As with the previous listing, I had to take steps to prevent the source code display from screwing up. I used triple-curly-braces. Replace those with triple-double-quotes. (Damn half-assed software people write these days.)
When run it prints:
"""Fibonacci Function.""" from sys import argv def fib (n ): """Get Nth Fib#. RECURSIVE!""" if n <2 : return n return fib (n -1 )+fib (n -2 )
Not the prettiest code — lots of spurious spaces (and none after the minus signs) — but it’s valid code. The tokenize
and untokenize
function pair guarantee accurate translation back and forth between tokens and the generated source.
§ §
I can’t imagine using tokens to generate Python source, but maybe I’ll run into a use case someday. Being able to parse source into tokens is a nice tool for the kit, though. Very handy for syntax-highlighting!
Ø
For one thing, now I can include syntax highlighted comments!
001|
001| fib series: 0 1 1 2 3 5 8…
001|
001| definition: fib(n) = fib(n-1) + fib(n-2)
001| exceptions: fib(0) = 0; fib(1) = 1
001| ”’
002| def fib (n, lst=None):
003| if not lst:
004| return fib(n, [0, 1])
005| if len(lst) < 2:
006| lst.append(lst[0]+1)
007| ix = len(lst) – 1
008| c = lst[ix] + lst[ix–1]
009| lst.append(c)
010| if n <= len(lst):
011| return lst
012| return fib(n, lst)
013|
014| ns = fib(42)[1:]
015| print(ns)
016| print()
017|
And if I ever have to start using the Block Editor, I may be able to import HTML blocks, which would be good, because I understand the Block Editor doesn’t have source code blocks.