Tags
computer code, language design, little programming language, Python code, recursive descent parser
Last time I introduced a general Definition Language (DL) I created for defining structured information. The end goal was an extension of DL, called Data Definition Language (DDL), intended for defining memory and file formats. It was intended for tools that examine that data, allowing them more knowledgeable output than a raw hex dump.
I mentioned that DL has been on my mind lately, and as it turns out I spent the day yesterday writing a DL parser in Python.
Here’s the BNF for DL; it’s very simple:
{definition} := {tag} {name} {content} "END" {tag}
:= {tag} {name} {content} ";"
{content} := NULL
:= {attribute}{content}
:= {definition}{content}
{attribute} := {tag} ":" {name}
{comment} := "--" {any-text} EOL
A {definition} is a nested object (much like an XML element) that can contain {attribute}s and more {definition}s. Unlike XML, there is no text content; any such content must go in an {attribute}.
Here’s a simple DL file:
module "My DL Demo"
version: 1.0
author: "Wyrd Smythe"
section Header
datetime created date:2021/11/11 time:16:40:42 ;
datetime updated date:2021/11/11 time:19:58:11 ;
email: thewyrdsymthe@foomail.net
encoding: UTF-8
end section
section Body
access: public
section Signature
type: byte
count: 8
end section
section Parameters
-- more definitions
end section
section ControlData
-- more definitions
end section
end section
--
-- ....etc.....
--
end module
Note that DL is just a format. Concrete use requires defining the {definition} and {attribute} tags meaningfully.
Such a simple format seems like it would be easy to parse, and it is.
§ §
To begin, we’ll break the text into a list of tokens for a recursive descent parser. We’ll strip comments in the lexer, so the parser will only need to concern itself with three types of objects: a {definition}, an {attribute} and the content of a {definition}. In fact there are only two types; the third is just a list of some mix of the first two.
The first item is the class that defines an InputToken. The lexer creates and returns a list of these; the parser takes one.
002| ”’DL File Input Token class.”’
003| def __init__ (self, char, char_index, char_line):
004| self.word = char
005| self.cpos = char_index
006| self.lnbr = char_line
007|
008| def add (self, char): self.word += char
009|
010| def __bool__ (self): return True
011| def __call__ (self): return self.word
012| def __len__ (self): return len(self.word)
013| def __iter__ (self): return iter(self.word)
014| def __getitem__ (self, ix): return self.word[ix]
015|
016| def __str__ (self):
017| t = (self.lnbr, self.cpos, self.word)
018| s = ‘[%4d,%6d] %s’
019| return s%t
020|
021|
Very simple object: word is one or more characters; cpos is the character position in the file; lnbr is its line number. The add method appends characters to the word variable.
Of course, we always define __str__ so that we can pretty print an instance.
The lexer function, dl_parser, looks basically like this:
002| ”’DL Parser. Reads “srcfile” and returns InputTokens.”’
003| attribute_char = ‘:’ # attribute ends with
004| terminal_char = ‘;’ # alternate END {tag} character
005| comment_char = ‘-‘ # comments begin with two
006| cdata_sot = ‘<<<‘ # CDATA begins with
007| cdata_eot = ‘>>>’ # CDATA ends with
008| EOL = ‘\n’ # End Of Line character
009|
010| tokens = [] # buffer for InputTokens
011| curr_token = None # current InputToken
012| line_nbr = 1 # current line number
013| comment_flag = 0 # in-comment state flag
014| string_flag = ” # in-string state flag
015| cdata_flag = 0 # in-CDATA state flag
016|
017| # Read the source file…
018| src = textfile(srcfile)
019| src.load()
020|
021| # Parse the source file…
022| for ix,ch in enumerate(src):
023| #
024| #
025| # … the DL parser guts …
026| #
027| #
028| # Otherwise, start a new token…
029| curr_token = InputToken(ch, ix, line_nbr)
030| # Return list of InputTokens…
031| return tokens
032|
033|
There are some constants and variables defined, we loop over the characters in the source file, and then return the list of tokens. Note that the last line of the loop, if we make it there, uses the current character to start a new token. As you’ll see, we only make it to the end of the loop if the character isn’t handled in any other way.
(The textfile class is my little set of file objects. It can easily be replaced with the standard Python open function.)
§
The guts of the parser, the body of the for loop, divides into three sections. The first one includes the line counter and string handlers. (Note that these three sections have two levels of indent I’m not showing here. They’re part of that for loop.)
002| if ch == EOL:
003| line_nbr += 1
004|
005| # Special handling if in STRING mode…
006| if string_flag:
007| curr_token.add(ch)
008| if curr_token[–1] == string_flag:
009| string_flag = ”
010| tokens.append(curr_token)
011| curr_token = None
012| Log.trace(‘string-exit: %s’ % ch)
013| continue
014|
015| # Special handling if in CDATA mode…
016| if cdata_flag:
017| curr_token.add(ch)
018| if curr_token().endswith(cdata_eot):
019| cdata_flag = 0
020| tokens.append(curr_token)
021| curr_token = None
022| continue
023|
024|
The string handlers will make more sense once we see the last section. The middle section handles (and removes) comments. (A comment begins with two hyphen characters and continues to the end of line.)
002| if comment_flag == 2:
003| # If we’ve found the end of line, exit comment mode…
004| if ch == EOL:
005| comment_flag = 0
006| # Otherwise stay in comment mode…
007| continue
008|
009| # If we’ve seen one comment char,…
010| if comment_flag == 1:
011| # Another one puts us in comment mode…
012| if ch == comment_char:
013| comment_flag = 2
014| continue
015| # Otherwise reset the flag and emit the buffered char…
016| comment_flag = 0
017| if not curr_token:
018| # Start a new token if necessary…
019| curr_token = InputToken(”, ix, line_nbr)
020| curr_token.add(comment_char)
021| curr_token.add(ch)
022| continue
023|
024| # If we’re not in comment mode at all,…
025| if comment_flag == 0:
026| # Seeing a comment char puts us in comment mode…
027| if ch == comment_char:
028| # Set the flag and next char…
029| comment_flag = 1
030| continue
031|
032|
The last section does some serious character handling. It recognizes whitespace (which will end any token we’re building), strings, and other special characters.
002| # If we’re building a token, add it to the list…
003| if curr_token:
004| tokens.append(curr_token)
005| curr_token = None
006| # Next char…
007| continue
008|
009| # Special handling for special END char…
010| if ch == terminal_char:
011| # If we’re building a token, add it to the list…
012| if curr_token:
013| tokens.append(curr_token)
014| curr_token = None
015| # Create special END token…
016| tok = InputToken(terminal_char, ix, line_nbr)
017| tokens.append(tok)
018| continue
019|
020| # If we’re building a token…
021| if curr_token:
022| # Add character to token…
023| curr_token.add(ch)
024| # Special handling for start of strings…
025| if curr_token[0] in [‘”‘, “‘”]:
026| string_flag = curr_token[0]
027| continue
028| # Special handling for start of CDATA…
029| if curr_token().startswith(cdata_sot):
030| cdata_flag = 1
031| continue
032| # Special handling for attribute names…
033| if ch == attribute_char:
034| tokens.append(curr_token)
035| curr_token = None
036| continue
037| # In any event, next char…
038| continue
039|
040|
Now the first section should make more sense. In the two string-handling states we consume characters immediately until we exit the string.
You’ll note I make heavy use of the continue statement. There are purists who dislike any kind of “jump” in the code. A more common dislike along these lines is for multiple return statements in a function. The purist approach says there can be only one. I am not a purist, and I think multiple returns are more readable. Likewise the continue and break statements. Love them.
§
Now that we have a nice list of InputToken we can pass them to a recursive descent parser. As noted, we only need three functions, one for each BNF definition: {definition}, {attribute}, or {content}.
The first one, dl_definition, parses a {definition}, and it’s the first one called, because a {definition} is always the root object of the DL tree structure.
002| # Tag, Name…
003| tok1 = toks.pop(0)
004| tok2 = toks.pop(0)
005| define_obj = dlDefinition(tok1(), tok2())
006|
007| # Content…
008| dl_content(toks, define_obj)
009|
010| # END keyword…
011| tok3 = toks.pop(0)
012| if tok3() == ‘;’:
013| Log.trace(‘definition: %s’ % define_obj)
014| return define_obj
015| if tok3().upper() != ‘END’:
016| raise SyntaxError(E_NO_END_KEY % (tok3(), tok3.lnbr))
017|
018| # End Tag…
019| tok4 = toks.pop(0)
020| if tok4().upper() != tok1().upper():
021| raise SyntaxError(E_NO_END_TAG % (tok4(), tok4.lnbr))
022|
023| # Return the definition object…
024| Log.trace(‘definition: %s’ % define_obj)
025| return define_obj
026|
027|
All three functions take a list of tokens. That list is consumed as the parsing progresses. Note the call to dl_content; it’s where the recursion begins.
Speaking of which, here’s that function:
002| # Peek at the first token…
003| while 0 < len(toks):
004| tok = toks[0]
005| # If it’s the end of a def, there’s no more content…
006| if tok().upper() in [‘END’, ‘;’]:
007| Log.trace(‘content: END’)
008| return
009|
010| # Handle Attributes…
011| if tok().endswith(‘:’):
012| dl_attribute(toks, define_obj)
013| continue
014|
015| # It’s a nested definition…
016| nested_def = dl_definition(toks)
017| define_obj.children.append(nested_def)
018|
019|
Which just leaves dl_attribute:
002| # Name, Value…
003| tok1 = toks.pop(0)
004| tok2 = toks.pop(0)
005|
006| # Add property…
007| define_obj.props[tok1()] = tok2()
008|
009|
And that’s all there is to it. Note that the dl_content and dl_attribute functions also take a define_object argument.
The output is the root dlDefinition object, and it contains the nested content of more dlDefinition objects. Each object has its attributes and children. These are defined:
002| ”’DL Definition class.”’
003| def __init__ (self, tag, name):
004| self.tag = tag
005| self.name = name
006| self.props = {}
007| self.children = []
008|
009| def __str__ (self):
010| return f'{self.tag} {self.name}’
011|
012|
That’s all there is to the code. Using it is as simple as:
002| tokens = dl_parser(datafile)
003| define_obj = dl_definition(tokens)
004|
005|
The define_obj is the root {definition} object. QED!
§ §
Any questions?
To take this further I’ll need to define DDL precisely enough for the next stage, converting the object tree to a data definition that can be applied to some input stream of bytes.
I would like to see this mostly idle dream finally completed, if for no other reason than that it’s been in the back of my mind a very long time. I get distracted easily these days, and just as quickly bored, so we’ll see how that goes.
Ø
Happiness is when you’re working on your parser and all the errors that pop up turn out to be syntax problems in the DL file, not a bug in the parser. The parser is working well enough to catch test syntax file errors. Sweet!!
BTW, here are the definitions for the error messages:
E_UNEXPECTED = ‘Unexpected Token: %s [line %d]’
E_NO_END_KEY = ‘Expected END keyword; got: %s [line %d]’
E_NO_END_TAG = ‘Expected closing tag; got: %s [line %d]’
The parser output of the sample DL file in the post is:
This particular example made me realize time strings need to be quoted because otherwise the colon makes them look like an attribute. I’ve been thinking datetime strings should be a datatype anyway, and there was some language I recall that treated #string# as a datetime string — it recognized the pound-sign as a datetime string quote.
Currently all data strings, numeric or string (or date), are just stored as strings, but one of the next step is converting numeric and datetime strings to their appropriate datatypes. True/False should likewise be seen as a boolean value.
And I still need to sit down and define a DDL. (I thought I had once, but I can’t find those notes.)