19 April 2017

Wrangling Metadata the Right Way

If you're new to Regular Expressions or Grammars (at least as they're used in Perl 6), then I'd suggest starting with the 1st part of a full tutorial on Perl 6 grammars. Now to the heart of the matter!

I'm rewriting the ANTLR -> Perl 6 grammar translator, and the code was simply too old to really be refactored - it predated the Great List Refactor, ask any old Perl 6 hand about that. This does require some familiarity with how grammars and Abstract Syntax Trees are related, so please bear with me.
Grammar rules (rules that tell Perl 6 how you want to match text) look something like:
rule parserRuleSpec
        {
        <ID>
        ':'
        <parserAltList>
        ';'
        }
Here an <ID> is simply the name of a parser rule, and <parserAltList> is its body, consisting of a series of "alternations". Like YACC, Marpa or ANTLR, you can tell Perl 6 to run a block of code when it successfully parses a <parserRuleSpec> in the target string. Let's assume that our target looks like:
integer : \d+ ;
Even if you don't know the exact details, it shouldn't be hard to guess that <ID> matches 'integer', and <parserAltList> matches '\d+'. If you don't ask Perl 6 to do anything to this, the 'integer' and '\d+' bits tend to sort of get lost in what can look like a confusing blizzard of hashes and arrays, it certainly was confusing to me. But we can help make sense of all of this. When this particular rule succeeds, we can stop right there and do something useful with that data. It'd be nice to get back a neat little hash, with some consistently-named keys, so we can do something like this:
method parserRuleSpec( $/ )              # rule parserRuleSpec
        {                                #         {
        make {                           #
             name =>                     #
                 $/<ID>.Str,             #         <ID>
                 # ':'                   #         ':' 
             body =>                     #
                 $/<parserAltList>.Str   #         <parserAltList>
                 # ';'                   #         ';'
             }                           #
        }                                #         }
[If this doesn't render neatly, my apologies, but I wanted to make it clear what I've done.] I've added a parserRuleSpec() method that will be run whenever the parser completely and correctly matches a parserRuleSpec rule. Whenever I want to inspect a parserRuleSpec rule, I don't have to dig around inside Match objects and look for strings, all I need to do is something like 'say $/<parserRuleSpec>.ast.gist' and now I'll get a nice simple representation of a parserRuleSpec as a hash with 'name' and 'body' keys:
say $/<parserRuleSpec>.ast.gist;

# {name => 'integer', body => '\d+'}
So our potentially-gnarly parserRuleSpec is now tamed, and resides in a nice neat data structure. And if you note, it's a pretty mechanical process. Copy the original rule, change 'rule' to 'method', wrap with 'make { }', and give names to the <ID> etc. tags. In fact, you could probably automate it, but that's another article for another time :) My point here is that you don't always want this kind of routinely-generated data structure.

If you have a rule like 'integer : \d+ ;', then it's not unreasonable to assume that other rules will follow, like 'float : \d+ "." \d+ ;'. Eventually, you'll want to collect these into some other data structure, maybe like this:
my %rules;
for $/<parserRules> -> $rule
        {
        %rules{ $rule.<name> } = $rule.<body>;
        }
say %rules.gist;

# {integer => '\d+', float => '\d+ "." \d+'}
Or you could even write it more compactly:
my %rules;
%rules{ $_.<name> } = $_.<body> for $/<parserRules>;
say %rules.gist;

# {integer => '\d+', float => '\d+ "." \d+'}
There are of course other ways to write this, but there's another way altogether to avoid the complications. When I started out on this current project (early last year, as it happens) I was of the opinion that the AST return blocks should contain all of the information they needed, so that I never had to "go back" for more data. During the rewrite I realized that's not necessarily the case, and I can make things more flexible in this case by simply not returning the name as part of the AST, and letting the outer layer do what it wants with the actual name. So, this code now looks like: (skipping the original rule declaration)
method parserRuleSpec( $/ )
        {
        make $/<parserAltList>.Str
        }

my %rules;
for $/<parserRules> -> $rule
        {
        %rules{ $rule<ID>>.Str } = $rule.ast;
        }
say %rules.gist;

# {integer => '\d+', float => '\d+ "." \d+'}

A bit of philosophy, if I may

There are two subtle points to be made about this. The first is that I'm separating the metadata (just the name, but ANTLR decorates rules with other things) from the meat of the rule, so that the user doesn't have to wade through as much detail initially. When the user wants to look at the rule, a simple 'say $rule.ast.gist' will give them an idea of what the rule contains, without a blizzard of metadata like actions and exceptions.

Second, and more subtle, as the author of the code I don't have to care about what's in a rule, I just have to grab the .ast and return it. Say, for instance, I got halfway through coding these rules up, and realized that alongside the 'content' I needed to pass an action parameter as well, so
my %rules;
for $/<parserRules> -> $rule
        {
        %rules{ $rule.<name> } = $rule.<body>;
        }
say %rules.gist;

# {integer => '\d+', float => '\d+ "." \d+'}
now had to have an 'action' hash key added:
my %rules;
for $/<parserRules> -> $rule
        {
        %rules{ $rule.<name> } = # If only this were $rule.ast...
            {
            body => $rule.<body>,
            action => rule.<action>
            };
        }
say %rules.gist;

# {integer => {body => '\d+', action => '$int-count++'}, float => {body => '\d+ "." \d+', action => '$float-count++'}}
Now, every time I want to add a new argument alongside 'body', I have to add it in at least two places - the original method that generated the .ast data structure, and the place in the driver code where I capture the data and store it in a higher-level .ast data structure. Actually I've forgotten about the test code (sigh, yes, again) so make that three places. And the documentation makes at least four. Those last two don't count as much because no matter what you do you've got to update test data, and if your test suite didn't catch the change, it really shouldn't be called a test suite unless you're doing some very specific black-box testing.


In conclusion

After this long writeup I'm going to go back to the ANTLR4 test bed with an eye towards eliminating this redundancy. The idea of capturing everything I needed in a grammar layer in a single pass appealed to me when I was first starting to write this, and I imagine the idea of "one grammar rule returns one AST hash" has some appeal. It's of course not as simple, especially since grammar rules can match optional things, match X or Y or Z but not all, and so on, but knowing that everything you need is in a single hash is handy when you're starting out, but I'm now seeing this as a counterexample to the KISS (Keep It Simple, you Silly thing) principle, and am going to check in the current ANTLR layer and refactor with this in mind before doing more work.
Gentle Reader, if you've gotten this far, thank you. I do read all the comments that I get, at least eventually. I'm also @DrForr on Twitter, 'Jeff Goff' on Facebook and LinkedIn. Thank you for your time, and I hope it was worth your while.

No comments:

Post a Comment