Tags

, , , ,

Last time I introduced a general Definition Language (DL) I created for defining structured information. The end goal was an extension of DL, called Data Definition Language (DDL), intended for defining memory and file formats. It was intended for tools that examine that data, allowing them more knowledgeable output than a raw hex dump.

I mentioned that DL has been on my mind lately, and as it turns out I spent the day yesterday writing a DL parser in Python.

Here’s the BNF for DL; it’s very simple:

{definition} := {tag} {name} {content} "END" {tag}
             := {tag} {name} {content} ";"

{content} := NULL
          := {attribute}{content}
          := {definition}{content}

{attribute} := {tag} ":" {name}

{comment} := "--" {any-text} EOL

A {definition} is a nested object (much like an XML element) that can contain {attribute}s and more {definition}s. Unlike XML, there is no text content; any such content must go in an {attribute}.

Here’s a simple DL file:

module "My DL Demo"
    version: 1.0
    author: "Wyrd Smythe"

    section Header
        datetime created date:2021/11/11 time:16:40:42 ;
        datetime updated date:2021/11/11 time:19:58:11 ;
        email: thewyrdsymthe@foomail.net
        encoding: UTF-8
    end section

    section Body
        access: public
        section Signature
            type: byte
            count: 8
        end section
        section Parameters
            -- more definitions
        end section
        section ControlData
            -- more definitions
        end section
    end section
    --
    -- ....etc.....
    --
end module

Note that DL is just a format. Concrete use requires defining the {definition} and {attribute} tags meaningfully.

Such a simple format seems like it would be easy to parse, and it is.

§ §

To begin, we’ll break the text into a list of tokens for a recursive descent parser. We’ll strip comments in the lexer, so the parser will only need to concern itself with three types of objects: a {definition}, an {attribute} and the content of a {definition}. In fact there are only two types; the third is just a list of some mix of the first two.

The first item is the class that defines an InputToken. The lexer creates and returns a list of these; the parser takes one.

class InputToken (object):
    '''DL File Input Token class.'''
    def __init__ (self, char, char_index, char_line):
        self.word = char
        self.cpos = char_index
        self.lnbr = char_line

    def add (self, char): self.word += char

    def __bool__ (self): return True
    def __call__ (self): return self.word
    def __len__ (self): return len(self.word)
    def __iter__ (self): return iter(self.word)
    def __getitem__ (self, ix): return self.word[ix]

    def __str__ (self):
        t = (self.lnbr, self.cpos, self.word)
        s = '[%4d,%6d] %s'
        return s%t
#

Very simple object: word is one or more characters; cpos is the character position in the file; lnbr is its line number. The add method appends characters to the word variable.

Of course, we always define __str__ so that we can pretty print an instance.

The lexer function, dl_parser, looks basically like this:

def dl_parser (srcfile):
    '''DL Parser. Reads "srcfile" and returns InputTokens.'''
    attribute_char = ':'    # attribute ends with
    terminal_char = ';'     # alternate END {tag} character
    comment_char = '-'      # comments begin with two
    cdata_sot = '<<<'       # CDATA begins with
    cdata_eot = '>>>'       # CDATA ends with
    EOL = '\n'              # End Of Line character

    tokens = []             # buffer for InputTokens
    curr_token = None       # current InputToken
    line_nbr = 1            # current line number
    comment_flag = 0        # in-comment state flag
    string_flag = ''        # in-string state flag
    cdata_flag = 0          # in-CDATA state flag

    # Read the source file...
    src = textfile(srcfile)
    src.load()

    # Parse the source file...
    for ix,ch in enumerate(src):
        #
        #
        # ... the DL parser guts ...
        #
        #
        # Otherwise, start a new token...
        curr_token = InputToken(ch, ix, line_nbr)
    # Return list of InputTokens...
    return tokens
#

There are some constants and variables defined, we loop over the characters in the source file, and then return the list of tokens. Note that the last line of the loop, if we make it there, uses the current character to start a new token. As you’ll see, we only make it to the end of the loop if the character isn’t handled in any other way.

(The textfile class is my little set of file objects. It can easily be replaced with the standard Python open function.)

§

The guts of the parser, the body of the for loop, divides into three sections. The first one includes the line counter and string handlers. (Note that these three sections have two levels of indent I’m not showing here. They’re part of that for loop.)

# If we see an end of line char, bump the line number...
if ch == EOL:
    line_nbr += 1

# Special handling if in STRING mode...
if string_flag:
    curr_token.add(ch)
    if curr_token[-1] == string_flag:
        string_flag = ''
        tokens.append(curr_token)
        curr_token = None
        Log.trace('string-exit: %s' % ch)
    continue

# Special handling if in CDATA mode...
if cdata_flag:
    curr_token.add(ch)
    if curr_token().endswith(cdata_eot):
        cdata_flag = 0
        tokens.append(curr_token)
        curr_token = None
    continue
#

The string handlers will make more sense once we see the last section. The middle section handles (and removes) comments. (A comment begins with two hyphen characters and continues to the end of line.)

# If we're in comment mode,...
if comment_flag == 2:
    # If we've found the end of line, exit comment mode...
    if ch == EOL:
        comment_flag = 0
    # Otherwise stay in comment mode...
    continue

# If we've seen one comment char,...
if comment_flag == 1:
    # Another one puts us in comment mode...
    if ch == comment_char:
        comment_flag = 2
        continue
    # Otherwise reset the flag and emit the buffered char...
    comment_flag = 0
    if not curr_token:
        # Start a new token if necessary...
        curr_token = InputToken('', ix, line_nbr)
    curr_token.add(comment_char)
    curr_token.add(ch)
    continue

# If we're not in comment mode at all,...
if comment_flag == 0:
    # Seeing a comment char puts us in comment mode...
    if ch == comment_char:
        # Set the flag and next char...
        comment_flag = 1
        continue
#

The last section does some serious character handling. It recognizes whitespace (which will end any token we’re building), strings, and other special characters.

# Whitespace ends any token we're building...
if ch.isspace():
    # If we're building a token, add it to the list...
    if curr_token:
        tokens.append(curr_token)
        curr_token = None
    # Next char...
    continue

# Special handling for special END char...
if ch == terminal_char:
    # If we're building a token, add it to the list...
    if curr_token:
        tokens.append(curr_token)
        curr_token = None
    # Create special END token...
    tok = InputToken(terminal_char, ix, line_nbr)
    tokens.append(tok)
    continue

# If we're building a token...
if curr_token:
    # Add character to token...
    curr_token.add(ch)
    # Special handling for start of strings...
    if curr_token[0] in ['"', "'"]:
        string_flag = curr_token[0]
        continue
    # Special handling for start of CDATA...
    if curr_token().startswith(cdata_sot):
        cdata_flag = 1
        continue
    # Special handling for attribute names...
    if ch == attribute_char:
        tokens.append(curr_token)
        curr_token = None
        continue
    # In any event, next char...
    continue
#

Now the first section should make more sense. In the two string-handling states we consume characters immediately until we exit the string.

You’ll note I make heavy use of the continue statement. There are purists who dislike any kind of “jump” in the code. A more common dislike along these lines is for multiple return statements in a function. The purist approach says there can be only one. I am not a purist, and I think multiple returns are more readable. Likewise the continue and break statements. Love them.

§

Now that we have a nice list of InputToken we can pass them to a recursive descent parser. As noted, we only need three functions, one for each BNF definition: {definition}, {attribute}, or {content}.

The first one, dl_definition, parses a {definition}, and it’s the first one called, because a {definition} is always the root object of the DL tree structure.

def dl_definition (toks):
    # Tag, Name...
    tok1 = toks.pop(0)
    tok2 = toks.pop(0)
    define_obj = dlDefinition(tok1(), tok2())

    # Content...
    dl_content(toks, define_obj)

    # END keyword...
    tok3 = toks.pop(0)
    if tok3() == ';':
        Log.trace('definition: %s' % define_obj)
        return define_obj
    if tok3().upper() != 'END':
        raise SyntaxError(E_NO_END_KEY % (tok3(), tok3.lnbr))

    # End Tag...
    tok4 = toks.pop(0)
    if tok4().upper() != tok1().upper():
        raise SyntaxError(E_NO_END_TAG % (tok4(), tok4.lnbr))

    # Return the definition object...
    Log.trace('definition: %s' % define_obj)
    return define_obj
#

All three functions take a list of tokens. That list is consumed as the parsing progresses. Note the call to dl_content; it’s where the recursion begins.

Speaking of which, here’s that function:

def dl_content (toks, define_obj):
    # Peek at the first token...
    while 0 < len(toks):
        tok = toks[0]
        # If it's the end of a def, there's no more content...
        if tok().upper() in ['END', ';']:
            Log.trace('content: END')
            return

        # Handle Attributes...
        if tok().endswith(':'):
            dl_attribute(toks, define_obj)
            continue

        # It's a nested definition...
        nested_def = dl_definition(toks)
        define_obj.children.append(nested_def)
#

Which just leaves dl_attribute:

def dl_attribute (toks, define_obj):
    # Name, Value...
    tok1 = toks.pop(0)
    tok2 = toks.pop(0)

    # Add property...
    define_obj.props[tok1()] = tok2()
#

And that’s all there is to it. Note that the dl_content and dl_attribute functions also take a define_object argument.

The output is the root dlDefinition object, and it contains the nested content of more dlDefinition objects. Each object has its attributes and children. These are defined:

class dlDefinition (object):
    '''DL Definition class.'''
    def __init__ (self, tag, name):
        self.tag = tag
        self.name = name
        self.props = {}
        self.children = []

    def __str__ (self):
        return f'{self.tag} {self.name}'
#

That’s all there is to the code. Using it is as simple as:

datafile = path.join(BasePath, FileName)
tokens = dl_parser(datafile)
define_obj = dl_definition(tokens)
#

The define_obj is the root {definition} object. QED!

§ §

Any questions?

To take this further I’ll need to define DDL precisely enough for the next stage, converting the object tree to a data definition that can be applied to some input stream of bytes.

I would like to see this mostly idle dream finally completed, if for no other reason than that it’s been in the back of my mind a very long time. I get distracted easily these days, and just as quickly bored, so we’ll see how that goes.

Ø