Tags

, , , , , ,

Last time I introduced state engines and state tables. I showed parts of a simple implementation of one in Python. It parsed the language introduced in Little Programming Languages. This post continues that, so be sure you’ve read that first article.

I got as far as the state table implementing the process, and that’s where this post picks up. I’ll also get into the SourceReader class that does the heavy lifting.

From the StateTable listing shown last time, here again are the conditions for the alpha state:

[   (IsWS        , 'alpha'   , None         ),
    (IsChar(';') , 'comment' , None         ),
    (IsChar(')') , 'alpha'   , 'ExitDescr'  ),
    (IsName      , 'name1'   , 'StartName1' ),
    (IsAny       , 'ERROR'   , 'Error'      )
]

The second and third fields of each item are just strings. The former is the name of the next state, the latter is the name of a task to perform (or None if no task).

The first field of each item is a function the engine calls and passes the current character. The function determines if the character matches some criteria and returns true or false.

The IsChar() function is special in that it is a function returning a function. At run time, it turns into a function that tests for the character cited.

Here are the (lambda) functions defined for character matching:

IsAny  = lambda c: True
IsNL   = lambda c: c == '\n'
IsSPC  = lambda c: c in [' ', ',', '\t']
IsWS   = lambda c: c in [' ', ',', '\t', '\f', '\r', '\n']
IsNbr  = lambda c: c.isdigit()
IsName = lambda c: c.isalpha() or c in ['_', '.', '$']
IsWord = lambda c: c.isalnum() or c in ['_', '-', '.', '/']
IsChar = lambda test_char: lambda c: c == test_char

Note how the IsChar definition is a double lambda since it’s a function that returns a function.

§

Now the inner loop of the state engine (shown last time) should make more sense (if they didn’t before). Here are the relevant code lines:

ss = StateTable[state]    # Get conditions for this state
for test,nxt,ftn in ss:   # Loop thru conditions
    if test(c):           # If the character matches...
        if ftn:           # . If there's a task...
            ftn()         # . . Do the task.
        state = nxt       # . Set next state
        break             # . Don't check other conditions.

(For simplicity, I abbreviated the way the engine calls the task function.)

As a passing note, this state engine does have a potential design flaw (or feature in some situations). It assumes it always finds a matching condition.

If it doesn’t, it just exits the for loop normally instead of due to a break. In this case, the state remains the same and there is no task. Effectively, the input token is ignored. That can be a feature. Don’t provide matches for inputs you want to ignore. Or allow the input stream to have tokens the process is blind to (allows for extensions or user data).

Alternately, if each state provides a final “match all” condition, the engine always does find a match. (Plus the option of being able to ignore tokens in some states.)

However, the failure to account for a state can cause problems in other situations, so be aware of that property of this algorithm (which is intended for demonstration purposes only, anyway).

All that’s needed for a fix is a flag set when a condition matches and the loop exits with a break. After the loop the flag not set indicates a fail to match. The engine can throw an exception if that’s what the process needs. (It can be important that every condition is specifically accounted for, no general matches or defaults.)

§

Now I can start to show you the SourceReader class:

class SourceReader (object):
    def __init__ (self, source_text):
        self.src = source_text
        self.reset()

    def reset (self):
        self._obj = ('Source', '-', 'input', []) # object (output)
        self._buf = []
        self.cp = 0         # character pointer
        self.name1 = ''     # object name
        self.name2 = ''     # action name
        self.number = ''

    def __str__ (self):
        return self.src

A new instance takes a source_text string (the text to parse) and calls reset(). That clears the output object buffer (_buf) and resets the text character pointer (cp). It also clears some accumulator buffers used to build output.

The class str() method is redefined to return the source text. (It’s almost always sensible to redefine an object’s toString method — or whatever a given language calls it. It’s a good productivity and robustness measure.)

The next two methods provide client access to the text stream. More importantly, the nc() method (next character) advances the character pointer as the engine reads the text. The cc() method returns the current character.

    def nc (self):
        if self.cp < len(self.src):
            self.cp += 1
        return self.cc()

    def cc (self):
        return self.src[self.cp:1+self.cp]

Both methods return an empty string to indicate EOT (End Of Text).

The goal of the process is to create a list of compiled objects representing the input. The _obj instance member contains a top-level default object. All compiled objects are part of this top-level object’s list. Some objects are nested.

For example:

filter(GameType(regular)) filter(GameStatus(final))

The two filter objects end up on the top-level object’s list of objects. Both also have nested objects within them.

Two methods in SourceReader support object construction:

    def append (self, obj):
        self._obj[3].append(obj)     # add obj to current object

    def make_reference (self):
        s = self._object_name()      # construct a proper name
        t = ('String', s)            # create a String object
        self.append(t)               # append to current object
        self.name1 = ''              # clear buffers
        self.name2 = ''

    def make_object (self):
        n1 = self.name1 if self.name2 else ''
        n2 = self.name2 if self.name2 else self.name1
        t = ('Object', n1,n2, [])    # creat an Object
        s = self._object_name()      # construct a proper name
        self.append(t)               # append to current object
        self._buf.append(self._obj)  # buffer current object
        self._obj = t                # new object is now current
        self.name1 = ''              # clear buffers
        self.name2 = ''

    def exit_object (self):
        self._obj = self._buf.pop()  # restore pushed object

The append method adds the passed object to the object list of the current object.

A reference is a naked name (such as regular or final in the code snippet above). In other words, a variable or constant.

An object is something that looked like [name:]keyword(description)

An important key to all of this is the set of builder methods that buffer characters to build strings. They tend to look a lot alike, and they’re usually very simple, so here’s a small sample:

    def StartName1 (self):
        self.name1 = self.cc()
        self.name2 = ''

    def AddToName1 (self):
        self.name1 += self.cc()

    def StartName2 (self):
        self.name2 = self.cc()

    def AddToName2 (self):
        self.name2 += self.cc()

    def EndReference (self):
        s = self.make_reference()</pre>

Starting a token just means starting a buffer collecting characters for that token. This breaks into a first-character (which starts the process with a fresh buffer) and all others, which are appended.

And end task uses whatever has been accumulated in the buffer.

Now the StateTable shown in the last post should make more sense. All that remains is to explore the overall operation. That’s where I’ll pick up next time.