Tags
computer code, language design, little programming language, Python code, recursive descent parser
Last time I introduced a general Definition Language (DL) I created for defining structured information. The end goal was an extension of DL, called Data Definition Language (DDL), intended for defining memory and file formats. It was intended for tools that examine that data, allowing them more knowledgeable output than a raw hex dump.
I mentioned that DL has been on my mind lately, and as it turns out I spent the day yesterday writing a DL parser in Python.
Here’s the BNF for DL; it’s very simple:
{definition} := {tag} {name} {content} "END" {tag} := {tag} {name} {content} ";" {content} := NULL := {attribute}{content} := {definition}{content} {attribute} := {tag} ":" {name} {comment} := "--" {any-text} EOL
A {definition} is a nested object (much like an XML element) that can contain {attribute}s and more {definition}s. Unlike XML, there is no text content; any such content must go in an {attribute}.
Here’s a simple DL file:
module "My DL Demo" version: 1.0 author: "Wyrd Smythe" section Header datetime created date:2021/11/11 time:16:40:42 ; datetime updated date:2021/11/11 time:19:58:11 ; email: thewyrdsymthe@foomail.net encoding: UTF-8 end section section Body access: public section Signature type: byte count: 8 end section section Parameters -- more definitions end section section ControlData -- more definitions end section end section -- -- ....etc..... -- end module
Note that DL is just a format. Concrete use requires defining the {definition} and {attribute} tags meaningfully.
Such a simple format seems like it would be easy to parse, and it is.
§ §
To begin, we’ll break the text into a list of tokens for a recursive descent parser. We’ll strip comments in the lexer, so the parser will only need to concern itself with three types of objects: a {definition}, an {attribute} and the content of a {definition}. In fact there are only two types; the third is just a list of some mix of the first two.
The first item is the class that defines an InputToken. The lexer creates and returns a list of these; the parser takes one.
class InputToken (object): '''DL File Input Token class.''' def __init__ (self, char, char_index, char_line): self.word = char self.cpos = char_index self.lnbr = char_line def add (self, char): self.word += char def __bool__ (self): return True def __call__ (self): return self.word def __len__ (self): return len(self.word) def __iter__ (self): return iter(self.word) def __getitem__ (self, ix): return self.word[ix] def __str__ (self): t = (self.lnbr, self.cpos, self.word) s = '[%4d,%6d] %s' return s%t #
Very simple object: word is one or more characters; cpos is the character position in the file; lnbr is its line number. The add method appends characters to the word variable.
Of course, we always define __str__
so that we can pretty print an instance.
The lexer function, dl_parser, looks basically like this:
def dl_parser (srcfile): '''DL Parser. Reads "srcfile" and returns InputTokens.''' attribute_char = ':' # attribute ends with terminal_char = ';' # alternate END {tag} character comment_char = '-' # comments begin with two cdata_sot = '<<<' # CDATA begins with cdata_eot = '>>>' # CDATA ends with EOL = '\n' # End Of Line character tokens = [] # buffer for InputTokens curr_token = None # current InputToken line_nbr = 1 # current line number comment_flag = 0 # in-comment state flag string_flag = '' # in-string state flag cdata_flag = 0 # in-CDATA state flag # Read the source file... src = textfile(srcfile) src.load() # Parse the source file... for ix,ch in enumerate(src): # # # ... the DL parser guts ... # # # Otherwise, start a new token... curr_token = InputToken(ch, ix, line_nbr) # Return list of InputTokens... return tokens #
There are some constants and variables defined, we loop over the characters in the source file, and then return the list of tokens. Note that the last line of the loop, if we make it there, uses the current character to start a new token. As you’ll see, we only make it to the end of the loop if the character isn’t handled in any other way.
(The textfile
class is my little set of file objects. It can easily be replaced with the standard Python open
function.)
§
The guts of the parser, the body of the for loop, divides into three sections. The first one includes the line counter and string handlers. (Note that these three sections have two levels of indent I’m not showing here. They’re part of that for
loop.)
# If we see an end of line char, bump the line number... if ch == EOL: line_nbr += 1 # Special handling if in STRING mode... if string_flag: curr_token.add(ch) if curr_token[-1] == string_flag: string_flag = '' tokens.append(curr_token) curr_token = None Log.trace('string-exit: %s' % ch) continue # Special handling if in CDATA mode... if cdata_flag: curr_token.add(ch) if curr_token().endswith(cdata_eot): cdata_flag = 0 tokens.append(curr_token) curr_token = None continue #
The string handlers will make more sense once we see the last section. The middle section handles (and removes) comments. (A comment begins with two hyphen characters and continues to the end of line.)
# If we're in comment mode,... if comment_flag == 2: # If we've found the end of line, exit comment mode... if ch == EOL: comment_flag = 0 # Otherwise stay in comment mode... continue # If we've seen one comment char,... if comment_flag == 1: # Another one puts us in comment mode... if ch == comment_char: comment_flag = 2 continue # Otherwise reset the flag and emit the buffered char... comment_flag = 0 if not curr_token: # Start a new token if necessary... curr_token = InputToken('', ix, line_nbr) curr_token.add(comment_char) curr_token.add(ch) continue # If we're not in comment mode at all,... if comment_flag == 0: # Seeing a comment char puts us in comment mode... if ch == comment_char: # Set the flag and next char... comment_flag = 1 continue #
The last section does some serious character handling. It recognizes whitespace (which will end any token we’re building), strings, and other special characters.
# Whitespace ends any token we're building... if ch.isspace(): # If we're building a token, add it to the list... if curr_token: tokens.append(curr_token) curr_token = None # Next char... continue # Special handling for special END char... if ch == terminal_char: # If we're building a token, add it to the list... if curr_token: tokens.append(curr_token) curr_token = None # Create special END token... tok = InputToken(terminal_char, ix, line_nbr) tokens.append(tok) continue # If we're building a token... if curr_token: # Add character to token... curr_token.add(ch) # Special handling for start of strings... if curr_token[0] in ['"', "'"]: string_flag = curr_token[0] continue # Special handling for start of CDATA... if curr_token().startswith(cdata_sot): cdata_flag = 1 continue # Special handling for attribute names... if ch == attribute_char: tokens.append(curr_token) curr_token = None continue # In any event, next char... continue #
Now the first section should make more sense. In the two string-handling states we consume characters immediately until we exit the string.
You’ll note I make heavy use of the continue
statement. There are purists who dislike any kind of “jump” in the code. A more common dislike along these lines is for multiple return statements in a function. The purist approach says there can be only one. I am not a purist, and I think multiple returns are more readable. Likewise the continue
and break
statements. Love them.
§
Now that we have a nice list of InputToken we can pass them to a recursive descent parser. As noted, we only need three functions, one for each BNF definition: {definition}, {attribute}, or {content}.
The first one, dl_definition, parses a {definition}, and it’s the first one called, because a {definition} is always the root object of the DL tree structure.
def dl_definition (toks): # Tag, Name... tok1 = toks.pop(0) tok2 = toks.pop(0) define_obj = dlDefinition(tok1(), tok2()) # Content... dl_content(toks, define_obj) # END keyword... tok3 = toks.pop(0) if tok3() == ';': Log.trace('definition: %s' % define_obj) return define_obj if tok3().upper() != 'END': raise SyntaxError(E_NO_END_KEY % (tok3(), tok3.lnbr)) # End Tag... tok4 = toks.pop(0) if tok4().upper() != tok1().upper(): raise SyntaxError(E_NO_END_TAG % (tok4(), tok4.lnbr)) # Return the definition object... Log.trace('definition: %s' % define_obj) return define_obj #
All three functions take a list of tokens. That list is consumed as the parsing progresses. Note the call to dl_content; it’s where the recursion begins.
Speaking of which, here’s that function:
def dl_content (toks, define_obj): # Peek at the first token... while 0 < len(toks): tok = toks[0] # If it's the end of a def, there's no more content... if tok().upper() in ['END', ';']: Log.trace('content: END') return # Handle Attributes... if tok().endswith(':'): dl_attribute(toks, define_obj) continue # It's a nested definition... nested_def = dl_definition(toks) define_obj.children.append(nested_def) #
Which just leaves dl_attribute:
def dl_attribute (toks, define_obj): # Name, Value... tok1 = toks.pop(0) tok2 = toks.pop(0) # Add property... define_obj.props[tok1()] = tok2() #
And that’s all there is to it. Note that the dl_content and dl_attribute functions also take a define_object argument.
The output is the root dlDefinition object, and it contains the nested content of more dlDefinition objects. Each object has its attributes and children. These are defined:
class dlDefinition (object): '''DL Definition class.''' def __init__ (self, tag, name): self.tag = tag self.name = name self.props = {} self.children = [] def __str__ (self): return f'{self.tag} {self.name}' #
That’s all there is to the code. Using it is as simple as:
datafile = path.join(BasePath, FileName) tokens = dl_parser(datafile) define_obj = dl_definition(tokens) #
The define_obj is the root {definition} object. QED!
§ §
Any questions?
To take this further I’ll need to define DDL precisely enough for the next stage, converting the object tree to a data definition that can be applied to some input stream of bytes.
I would like to see this mostly idle dream finally completed, if for no other reason than that it’s been in the back of my mind a very long time. I get distracted easily these days, and just as quickly bored, so we’ll see how that goes.
Ø
Happiness is when you’re working on your parser and all the errors that pop up turn out to be syntax problems in the DL file, not a bug in the parser. The parser is working well enough to catch test syntax file errors. Sweet!!
BTW, here are the definitions for the error messages:
E_UNEXPECTED = ‘Unexpected Token: %s [line %d]’
E_NO_END_KEY = ‘Expected END keyword; got: %s [line %d]’
E_NO_END_TAG = ‘Expected closing tag; got: %s [line %d]’
The parser output of the sample DL file in the post is:
This particular example made me realize time strings need to be quoted because otherwise the colon makes them look like an attribute. I’ve been thinking datetime strings should be a datatype anyway, and there was some language I recall that treated #string# as a datetime string — it recognized the pound-sign as a datetime string quote.
Currently all data strings, numeric or string (or date), are just stored as strings, but one of the next step is converting numeric and datetime strings to their appropriate datatypes. True/False should likewise be seen as a boolean value.
And I still need to sit down and define a DDL. (I thought I had once, but I can’t find those notes.)