Last time I introduced state engines and state tables. I showed parts of a simple implementation of one in Python. It parsed the language introduced in Little Programming Languages. This post continues that, so be sure you’ve read that first article.
I got as far as the state table implementing the process, and that’s where this post picks up. I’ll also get into the
SourceReader class that does the heavy lifting.
StateTable listing shown last time, here again are the conditions for the
[ (IsWS , 'alpha' , None ), (IsChar(';') , 'comment' , None ), (IsChar(')') , 'alpha' , 'ExitDescr' ), (IsName , 'name1' , 'StartName1' ), (IsAny , 'ERROR' , 'Error' ) ]
The second and third fields of each item are just strings. The former is the name of the next state, the latter is the name of a task to perform (or
None if no task).
The first field of each item is a function the engine calls and passes the current character. The function determines if the character matches some criteria and returns true or false.
IsChar() function is special in that it is a function returning a function. At run time, it turns into a function that tests for the character cited.
Here are the (
lambda) functions defined for character matching:
IsAny = lambda c: True IsNL = lambda c: c == '\n' IsSPC = lambda c: c in [' ', ',', '\t'] IsWS = lambda c: c in [' ', ',', '\t', '\f', '\r', '\n'] IsNbr = lambda c: c.isdigit() IsName = lambda c: c.isalpha() or c in ['_', '.', '$'] IsWord = lambda c: c.isalnum() or c in ['_', '-', '.', '/'] IsChar = lambda test_char: lambda c: c == test_char
Note how the
IsChar definition is a double
lambda since it’s a function that returns a function.
Now the inner loop of the state engine (shown last time) should make more sense (if they didn’t before). Here are the relevant code lines:
ss = StateTable[state] # Get conditions for this state for test,nxt,ftn in ss: # Loop thru conditions if test(c): # If the character matches... if ftn: # . If there's a task... ftn() # . . Do the task. state = nxt # . Set next state break # . Don't check other conditions.
(For simplicity, I abbreviated the way the engine calls the task function.)
As a passing note, this state engine does have a potential design flaw (or feature in some situations). It assumes it always finds a matching condition.
If it doesn’t, it just exits the
for loop normally instead of due to a
break. In this case, the state remains the same and there is no task. Effectively, the input token is ignored. That can be a feature. Don’t provide matches for inputs you want to ignore. Or allow the input stream to have tokens the process is blind to (allows for extensions or user data).
Alternately, if each state provides a final “match all” condition, the engine always does find a match. (Plus the option of being able to ignore tokens in some states.)
However, the failure to account for a state can cause problems in other situations, so be aware of that property of this algorithm (which is intended for demonstration purposes only, anyway).
All that’s needed for a fix is a flag set when a condition matches and the loop exits with a
break. After the loop the flag not set indicates a fail to match. The engine can throw an exception if that’s what the process needs. (It can be important that every condition is specifically accounted for, no general matches or defaults.)
Now I can start to show you the
class SourceReader (object): def __init__ (self, source_text): self.src = source_text self.reset() def reset (self): self._obj = ('Source', '-', 'input', ) # object (output) self._buf =  self.cp = 0 # character pointer self.name1 = '' # object name self.name2 = '' # action name self.number = '' def __str__ (self): return self.src
A new instance takes a
source_text string (the text to parse) and calls reset(). That clears the output object buffer (
_buf) and resets the text character pointer (
cp). It also clears some accumulator buffers used to build output.
str() method is redefined to return the source text. (It’s almost always sensible to redefine an object’s
toString method — or whatever a given language calls it. It’s a good productivity and robustness measure.)
The next two methods provide client access to the text stream. More importantly, the
nc() method (next character) advances the character pointer as the engine reads the text. The
cc() method returns the current character.
def nc (self): if self.cp < len(self.src): self.cp += 1 return self.cc() def cc (self): return self.src[self.cp:1+self.cp]
Both methods return an empty string to indicate EOT (End Of Text).
The goal of the process is to create a list of compiled objects representing the input. The
_obj instance member contains a top-level default object. All compiled objects are part of this top-level object’s list. Some objects are nested.
The two filter objects end up on the top-level object’s list of objects. Both also have nested objects within them.
Two methods in
SourceReader support object construction:
def append (self, obj): self._obj.append(obj) # add obj to current object def make_reference (self): s = self._object_name() # construct a proper name t = ('String', s) # create a String object self.append(t) # append to current object self.name1 = '' # clear buffers self.name2 = '' def make_object (self): n1 = self.name1 if self.name2 else '' n2 = self.name2 if self.name2 else self.name1 t = ('Object', n1,n2, ) # creat an Object s = self._object_name() # construct a proper name self.append(t) # append to current object self._buf.append(self._obj) # buffer current object self._obj = t # new object is now current self.name1 = '' # clear buffers self.name2 = '' def exit_object (self): self._obj = self._buf.pop() # restore pushed object
append method adds the passed object to the object list of the current object.
A reference is a naked name (such as
final in the code snippet above). In other words, a variable or constant.
An object is something that looked like
An important key to all of this is the set of builder methods that buffer characters to build strings. They tend to look a lot alike, and they’re usually very simple, so here’s a small sample:
def StartName1 (self): self.name1 = self.cc() self.name2 = '' def AddToName1 (self): self.name1 += self.cc() def StartName2 (self): self.name2 = self.cc() def AddToName2 (self): self.name2 += self.cc() def EndReference (self): s = self.make_reference()</pre>
Starting a token just means starting a buffer collecting characters for that token. This breaks into a first-character (which starts the process with a fresh buffer) and all others, which are appended.
And end task uses whatever has been accumulated in the buffer.
StateTable shown in the last post should make more sense. All that remains is to explore the overall operation. That’s where I’ll pick up next time.