-
Notifications
You must be signed in to change notification settings - Fork 36
FeatureModularGrammar
The grammar file format should allow referencing other grammar files.
Many artificial languages (or formats) are composed of several distinct formats. The grammar file format itself is a case in point, since it embeds support for regular expressions. Currently these types of grammars are poorly supported by Grammatica, usually forcing the sub-grammar to be matched by a single regular expression token.
In other cases it might be a matter of practicallity to split out self-containing grammar parts into separate grammar files (and thereby analyzers) to reduce complexity. Also, modular grammars would allow further object-oriented improvements to the generated parsers. Implementing modular grammars and tokenizers would also drastically, if not completely, remove the need for tokenizer modes.
Here is an example of how a modular grammar could be used in grammar.grammar
:
...
REGEXP = <<\<\<([^\\\>]|(\\.)|(\>[^\>]))*\>\>>>
...
TokenValue = QUOTED_STRING
| REGEXP ;
...
If the grammar file format allowed modular grammars, this might be expressed like this instead:
...
TokenValue = QUOTED_STRING
| TokenValueRegExp ;
TokenValueRegExp = "<<" %grammar regexp% ">>" ;
...
Here is a list of issues to consider when implementing modular grammars:
-
Input Stream Handling - Thanks to the new
ReaderBuffer
class (in Grammatica 1.5) it would be easy to transfer the current input stream state and position across tokenizers, without data copying or other issues. -
EOF Handling - Currently a
Tokenizer
only returns the EOF token (i.e.null
) if the input stream is actually exhausted. Otherwise it will throw aParseException
indicating unexpected input. It seems that the parser should pass a flag to the tokenizer when embedded and the next token could potentially be an EOF. - Token Conflicts - If the embedded grammar would allow the "end symbol" from the parent grammar as a valid token, this would result in a conflict between which grammar to use for the next token. Ideally, the embedded grammar shouldn't have to be changed when embedded, which indicates that the parent grammar should have precedence. This issue is similar to the EOF handling issue and might be resolved by providing a reference to a method returning a boolean if the current buffer position might indicate a valid EOF.
-
Parser Look-Ahead - Currently, the
Parser
class provides built-in support for token look-ahead, using a internal list of the upcoming tokens. Unfortunately, this means that the tokens in question have already been read from the tokenizer and the buffer position has been updated. So, in order to support modular grammars, it seems that a look-ahead of zero (0) tokens would be required when switching to an embedded grammar.
There are also a number of options available for how this is supposed to work. The simplest option (examplified above), would be to simply include the whole grammar in the parent grammar. A more advanced proposal would be to allow specifying the start production in the embedded grammar, optionally reducing the embedded grammar to only the tokens and productions needed.
It has been suggested that Grammatica should support including grammar files, allowing large grammars to be split across several files. Like this, for example:
%import myname regexp.grammar%
...
TokenValueRegExp = "<<" myname.RegExp ">>" ;
This would potentially allow a similar benefits in reducing the grammar file complexities, but it wouldn't at all adress the issues with complexity in the Analyzer
classes. This latter complexity is probably the more important to reduce, since the analyzer source code files are already much larger than the source grammar files. With the addition of actual analyzer source code, they quite frequently become very large an hard to maintain. Also, changes are probably more frequent in the analyzers than in the actual grammars.
For these reasons, it is suggested that Grammatica does not implement a grammar inclusion mechanism, instead favoring a solution with fully modular grammars (separating the analyzers).