A Simple DL Parser

Tags

computer code, language design, little programming language, Python code, recursive descent parser

Last time I introduced a general Definition Language (DL) I created for defining structured information. The end goal was an extension of DL, called Data Definition Language (DDL), intended for defining memory and file formats. It was intended for tools that examine that data, allowing them more knowledgeable output than a raw hex dump.

I mentioned that DL has been on my mind lately, and as it turns out I spent the day yesterday writing a DL parser in Python.

Here’s the BNF for DL; it’s very simple:

{definition} := {tag} {name} {content} "END" {tag}
             := {tag} {name} {content} ";"

{content} := NULL
          := {attribute}{content}
          := {definition}{content}

{attribute} := {tag} ":" {name}

{comment} := "--" {any-text} EOL

A {definition} is a nested object (much like an XML element) that can contain {attribute}s and more {definition}s. Unlike XML, there is no text content; any such content must go in an {attribute}.

Here’s a simple DL file:

module "My DL Demo"
    version: 1.0
    author: "Wyrd Smythe"

    section Header
        datetime created date:2021/11/11 time:16:40:42 ;
        datetime updated date:2021/11/11 time:19:58:11 ;
        email: thewyrdsymthe@foomail.net
        encoding: UTF-8
    end section

    section Body
        access: public
        section Signature
            type: byte
            count: 8
        end section
        section Parameters
            -- more definitions
        end section
        section ControlData
            -- more definitions
        end section
    end section
    --
    -- ....etc.....
    --
end module

Note that DL is just a format. Concrete use requires defining the {definition} and {attribute} tags meaningfully.

Such a simple format seems like it would be easy to parse, and it is.

§ §

To begin, we’ll break the text into a list of tokens for a recursive descent parser. We’ll strip comments in the lexer, so the parser will only need to concern itself with three types of objects: a {definition}, an {attribute} and the content of a {definition}. In fact there are only two types; the third is just a list of some mix of the first two.

The first item is the class that defines an InputToken. The lexer creates and returns a list of these; the parser takes one.

 class InputToken (object):

     ”’DL File Input Token class.”’

     def __init__ (self, char, char_index, char_line):

         self.word = char

         self.cpos = char_index

         self.lnbr = char_line

 

     def add (self, char): self.word += char

 

     def __bool__ (self): return True

     def __call__ (self): return self.word

     def __len__ (self): return len(self.word)

     def __iter__ (self): return iter(self.word)

     def __getitem__ (self, ix): return self.word[ix]

 

     def __str__ (self):

         t = (self.lnbr, self.cpos, self.word)

         s = ‘[%4d,%6d] %s’

         return s%t

 



Very simple object: word is one or more characters; cpos is the character position in the file; lnbr is its line number. The add method appends characters to the word variable.

Of course, we always define __str__ so that we can pretty print an instance.

The lexer function, dl_parser, looks basically like this:

 def dl_parser (srcfile):

     ”’DL Parser. Reads “srcfile” and returns InputTokens.”’

     attribute_char = ‘:’    # attribute ends with

     terminal_char = ‘;’     # alternate END {tag} character

     comment_char = ‘-‘      # comments begin with two

     cdata_sot = ‘<<<‘       # CDATA begins with

     cdata_eot = ‘>>>’       # CDATA ends with

     EOL = ‘\n’              # End Of Line character

 

     tokens = []             # buffer for InputTokens

     curr_token = None       # current InputToken

     line_nbr = 1            # current line number

     comment_flag = 0        # in-comment state flag

     string_flag = ”        # in-string state flag

     cdata_flag = 0          # in-CDATA state flag

 

     # Read the source file…

     src = textfile(srcfile)

     src.load()

 

     # Parse the source file…

     for ix,ch in enumerate(src):

         #

         #

         # … the DL parser guts …

         #

         #

         # Otherwise, start a new token…

         curr_token = InputToken(ch, ix, line_nbr)

     # Return list of InputTokens…

     return tokens

 



There are some constants and variables defined, we loop over the characters in the source file, and then return the list of tokens. Note that the last line of the loop, if we make it there, uses the current character to start a new token. As you’ll see, we only make it to the end of the loop if the character isn’t handled in any other way.

(The textfile class is my little set of file objects. It can easily be replaced with the standard Python open function.)

The guts of the parser, the body of the for loop, divides into three sections. The first one includes the line counter and string handlers. (Note that these three sections have two levels of indent I’m not showing here. They’re part of that for loop.)

 # If we see an end of line char, bump the line number…

 if ch == EOL:

     line_nbr += 1

 

 # Special handling if in STRING mode…

 if string_flag:

     curr_token.add(ch)

     if curr_token[–1] == string_flag:

         string_flag = ”

         tokens.append(curr_token)

         curr_token = None

         Log.trace(‘string-exit: %s’ % ch)

     continue

 

 # Special handling if in CDATA mode…

 if cdata_flag:

     curr_token.add(ch)

     if curr_token().endswith(cdata_eot):

         cdata_flag = 0

         tokens.append(curr_token)

         curr_token = None

     continue

 



The string handlers will make more sense once we see the last section. The middle section handles (and removes) comments. (A comment begins with two hyphen characters and continues to the end of line.)

 # If we’re in comment mode,…

 if comment_flag == 2:

     # If we’ve found the end of line, exit comment mode…

     if ch == EOL:

         comment_flag = 0

     # Otherwise stay in comment mode…

     continue

 

 # If we’ve seen one comment char,…

 if comment_flag == 1:

     # Another one puts us in comment mode…

     if ch == comment_char:

         comment_flag = 2

         continue

     # Otherwise reset the flag and emit the buffered char…

     comment_flag = 0

     if not curr_token:

         # Start a new token if necessary…

         curr_token = InputToken(”, ix, line_nbr)

     curr_token.add(comment_char)

     curr_token.add(ch)

     continue

 

 # If we’re not in comment mode at all,…

 if comment_flag == 0:

     # Seeing a comment char puts us in comment mode…

     if ch == comment_char:

         # Set the flag and next char…

         comment_flag = 1

         continue

 



The last section does some serious character handling. It recognizes whitespace (which will end any token we’re building), strings, and other special characters.

 if ch.isspace():

     # If we’re building a token, add it to the list…

     if curr_token:

         tokens.append(curr_token)

         curr_token = None

     # Next char…

     continue

 

 # Special handling for special END char…

 if ch == terminal_char:

     # If we’re building a token, add it to the list…

     if curr_token:

         tokens.append(curr_token)

         curr_token = None

     # Create special END token…

     tok = InputToken(terminal_char, ix, line_nbr)

     tokens.append(tok)

     continue

 

 # If we’re building a token…

 if curr_token:

     # Add character to token…

     curr_token.add(ch)

     # Special handling for start of strings…

     if curr_token[0] in [‘”‘, “‘”]:

         string_flag = curr_token[0]

         continue

     # Special handling for start of CDATA…

     if curr_token().startswith(cdata_sot):

         cdata_flag = 1

         continue

     # Special handling for attribute names…

     if ch == attribute_char:

         tokens.append(curr_token)

         curr_token = None

         continue

     # In any event, next char…

     continue

 



Now the first section should make more sense. In the two string-handling states we consume characters immediately until we exit the string.

You’ll note I make heavy use of the continue statement. There are purists who dislike any kind of “jump” in the code. A more common dislike along these lines is for multiple return statements in a function. The purist approach says there can be only one. I am not a purist, and I think multiple returns are more readable. Likewise the continue and break statements. Love them.

Now that we have a nice list of InputToken we can pass them to a recursive descent parser. As noted, we only need three functions, one for each BNF definition: {definition}, {attribute}, or {content}.

The first one, dl_definition, parses a {definition}, and it’s the first one called, because a {definition} is always the root object of the DL tree structure.

 def dl_definition (toks):

     # Tag, Name…

     tok1 = toks.pop(0)

     tok2 = toks.pop(0)

     define_obj = dlDefinition(tok1(), tok2())

 

     # Content…

     dl_content(toks, define_obj)

 

     # END keyword…

     tok3 = toks.pop(0)

     if tok3() == ‘;’:

         Log.trace(‘definition: %s’ % define_obj)

         return define_obj

     if tok3().upper() != ‘END’:

         raise SyntaxError(E_NO_END_KEY % (tok3(), tok3.lnbr))

 

     # End Tag…

     tok4 = toks.pop(0)

     if tok4().upper() != tok1().upper():

         raise SyntaxError(E_NO_END_TAG % (tok4(), tok4.lnbr))

 

     # Return the definition object…

     Log.trace(‘definition: %s’ % define_obj)

     return define_obj

 



All three functions take a list of tokens. That list is consumed as the parsing progresses. Note the call to dl_content; it’s where the recursion begins.

Speaking of which, here’s that function:

 def dl_content (toks, define_obj):

     # Peek at the first token…

     while 0 < len(toks):

         tok = toks[0]

         # If it’s the end of a def, there’s no more content…

         if tok().upper() in [‘END’, ‘;’]:

             Log.trace(‘content: END’)

             return

 

         # Handle Attributes…

         if tok().endswith(‘:’):

             dl_attribute(toks, define_obj)

             continue

 

         # It’s a nested definition…

         nested_def = dl_definition(toks)

         define_obj.children.append(nested_def)

 



Which just leaves dl_attribute:

 def dl_attribute (toks, define_obj):

     # Name, Value…

     tok1 = toks.pop(0)

     tok2 = toks.pop(0)

 

     # Add property…

     define_obj.props[tok1()] = tok2()

 



And that’s all there is to it. Note that the dl_content and dl_attribute functions also take a define_object argument.

The output is the root dlDefinition object, and it contains the nested content of more dlDefinition objects. Each object has its attributes and children. These are defined:

 class dlDefinition (object):

     ”’DL Definition class.”’

     def __init__ (self, tag, name):

         self.tag = tag

         self.name = name

         self.props = {}

         self.children = []

 

     def __str__ (self):

         return f'{self.tag} {self.name}’

 



That’s all there is to the code. Using it is as simple as:

 datafile = path.join(BasePath, FileName)

 tokens = dl_parser(datafile)

 define_obj = dl_definition(tokens)

 



The define_obj is the root {definition} object. QED!

§ §

Any questions?

To take this further I’ll need to define DDL precisely enough for the next stage, converting the object tree to a data definition that can be applied to some input stream of bytes.

I would like to see this mostly idle dream finally completed, if for no other reason than that it’s been in the back of my mind a very long time. I get distracted easily these days, and just as quickly bored, so we’ll see how that goes.

4 thoughts on “A Simple DL Parser”

Wyrd Smythe said:

November 10, 2021 at 12:27 pm

Happiness is when you’re working on your parser and all the errors that pop up turn out to be syntax problems in the DL file, not a bug in the parser. The parser is working well enough to catch test syntax file errors. Sweet!!

Wyrd Smythe said:

November 10, 2021 at 12:29 pm

BTW, here are the definitions for the error messages:

E_UNEXPECTED = ‘Unexpected Token: %s [line %d]’
E_NO_END_KEY = ‘Expected END keyword; got: %s [line %d]’
E_NO_END_TAG = ‘Expected closing tag; got: %s [line %d]’

Wyrd Smythe said:

November 10, 2021 at 2:22 pm
The parser output of the sample DL file in the post is:
```
module "My DL Demo"
> author:: "Wyrd Smythe"
> version:: 1.0
.   section Header
.   > email:: thewyrdsymthe@foomail.net
.   > encoding:: UTF-8
.   .   datetime created
.   .   > date:: '2021/11/11'
.   .   > time:: '16:40:42'
.   .   datetime updated
.   .   > date:: '2021/11/11'
.   .   > time:: '19:58:11'
.   section Body
.   > access:: public
.   .   section Signature
.   .   > count:: 8
.   .   > type:: byte
.   .   section Parameters
.   .   section ControlData
```
- Wyrd Smythe said:
  
  November 14, 2021 at 12:35 pm
  
  This particular example made me realize time strings need to be quoted because otherwise the colon makes them look like an attribute. I’ve been thinking datetime strings should be a datatype anyway, and there was some language I recall that treated #string# as a datetime string — it recognized the pound-sign as a datetime string quote.
  
  Currently all data strings, numeric or string (or date), are just stored as strings, but one of the next step is converting numeric and datetime strings to their appropriate datatypes. True/False should likewise be seen as a boolean value.
  
  And I still need to sit down and define a DDL. (I thought I had once, but I can’t find those notes.)

The Hard-Core Coder

~ I can't stop writing code!

A Simple DL Parser

4 thoughts on “A Simple DL Parser”

Leave a reply to Wyrd Smythe Cancel reply

Share this:

Related

4 thoughts on “A Simple DL Parser”

Leave a reply to Wyrd Smythe Cancel reply