Tech Talk

ANTLR - ANother Language Recognizer

By John Merrells

Every couple of years or so I bump onto an engineering problem most suitably solved by a simple lexical analyzer and parser. Typical examples being a configuration file processor, or a URL processor. Each time, I dust off my old lex and yacc manuals and try to remember how a DFSA differs from an NDFSA and how a shift-reduce error differs from a reduce-reduce error. Invariably I give up and revert to the tried and true method of just hacking up some code myself. Surely there must be a better way?

There is, but first a brief introduction to some language processing terms.

A lexical analyzer, also known as a lexer or scanner, simply transforms an input stream of characters into an output stream of tokens. For example, a lexer might be programmed by a set of rules to transform a stream such as (10+20) into the token stream LP N PLUS N RP. Lexers specialize in grouping sequences of characters into larger constructs named tokens.

A parser takes a stream of tokens from a lexer and groups them into higher order constructions called productions. Following on from our example above, a parser might be programmed by a set of rules to recognize the token stream LP N PLUS N RP as the production named expression.

ANTLR, or ANother Language Recognizer, is a combined lexer and parser generator, with a number of features that make it much simpler and nicer to use than the traditional combination of lex and yacc, or the GNU alternative of flex and bison.

 

That’s enough terms for now, lets get down to some code. We’ll work through the canonical calculator example. What’s the result of the expression (1+1)? Well, first we’ll need a lexer.

// calc.g
options
{
  language="Cpp";
}
class CalcLexer extends Lexer;
public
  INTEGER: (DIGIT)+;
  LPAREN: '(';
  RPAREN: ')';
  PLUS: '+';
  protected
  DIGIT: '0'..'9';

The definition code for the lexer is comprised of four sections; options, class derivation, public tokens and protected tokens. The options section contains parameters for the ANTLR tool. In this case the ‘language’ statement selects the generation of C++ code. ANTLR can also generate Java and Sather code, but for this article we’ll use C++. The class statement declares that the subsequent sections are defining a lexer, and allows the programmer to name their scanner. The next two sections provide the token definitions. The case of the token names is significant. The tool knows to treat identifiers that begin with an uppercase letter as tokens. In our example lexer, a DIGIT is defined to be any character in the range ‘0’ through to ‘9’, and an INTEGER is a sequence of DIGITS. The definitions use the standard regular expression syntax. Here the ‘+’ in (DIGIT)+ means a sequence of one or more. Other important regular expression operators to keep in mind are (x)? which means x is optional, x | y which means x or y, and (x)* which means a sequence of zero or many x’s.

We need a parser to consume the token stream and recognize productions so that we can evaluate our expressions. The rules used to describe the parser are called production rules and collectively describe a grammar. The grammar for our calculator, in ANTLR notation, looks like this.

class CalcParser extends Parser;
options{
  buildAST=true;
}
expr: LPAREN INTEGER (PLUS INTEGER)* RPAREN;

These three sections of code can appear either before or after the lexer definition. The first section again declares the start of the rules that describe the parser. The options section can contain many statements, but for our simple demonstration we need only to turn on the automatic construction of an abstract syntax tree (AST). The AST is a tree of nodes that is built to represent the productions as they are recognized within the token stream. And the third section provides a definition of the expr production rule. This rule states that an expression begins with a left parenthesis followed by an integer and then followed by zero or more integer additions, followed by a closing right parenthesis.

The lexer and parser source code is generated by executing the following command line. (You’ll have to install ANTLR, the Java SDK, and have set up your class path appropriately.)

> java antlr.Tool calc.g

The files resulting from this execution will be the lexer code: CalcLexer.cpp, CalcLexer.hpp, CalcLexerTokenTypes.hpp; and the parser code: CalcParser.cpp, CalcParser.hpp, CalcParserTokenTypes.hpp. One of the nice things about ANTLR, over lex, flex, yacc, and bision, is that the code it generates is implemented as nested case statements, rather than tables of numbers. This makes the code much simpler to understand when implementing a complex grammar or whilst debugging.

Now we need some main() code to drive our lexer and parser.

//main.cpp
#include <iostream>
#include "CalcLexer.hpp"
#include "CalcParser.hpp"
using namespace std;
using namespace antlr;
void main()
{
  try {
    CalcLexer lexer(cin);
    CalcParser parser(lexer);
    parser.expr();
    RefAST t = parser.getAST();
    cout << t->toStringList() << endl;
  } catch(exception& e) {
    cerr << "exception: " << e.what() << endl;
  }
}

A stream of characters it taken from the standard input stream, passed into the lexer, from which a token stream is passed into the parser, from which productions are recognized, and an abstract syntax tree generated.

Compile main.cpp and the tool generated code, and link with the ANTLR C++ library. (You’ll need to compile this library yourself, but instructions are provided.)

Here’s a sample session showing a sample input string, and the resultant AST.

[c:\temp\test]test
(1+1)
( 1 + 1 )
[c:\temp\test]test
(1+2+3)
( 1 + 2 + 3 )
[c:\temp\test] test
(1*1)

exception: unexpected char: *

The next stage is clearly for us to actually calculate the resultant value of the expression. A common approach is to place actions with the production rules so that as the production is recognized the result of the expression in evaluated. This solution is cumbersome for large languages as the recognition and evaluation code is intermingled. We’d prefer to separate these issues, just as we separated the concerns of lexical analysis and parsing. Typically, for a complex language, the actions that adorn the grammar do no more than build the abstract syntax tree. But, ANTLR will do this for us.

The default AST being built isn’t quite optimal. In the output above the tree includes nodes that represent the parenthesis. These are just there for syntactic sugar. Also, ,the root node in the tree is also the left parenthesis. We’d much rather it was the plus, as this is the operation we’re trying to perform. The expression production should be changed thus:

expr: LPAREN! INTEGER (PLUS^ INTEGER)* RPAREN!;

Note the addition of the ‘!’ and ‘^’ notations. The ‘!’ indicates that for AST generation the token should be ignored, and the ‘^’ after the PLUS token signifies that the PLUS is to be the root of the tree. Our interactive session now produces this:

[c:\temp\test]test
(1)
1
[c:\temp\test]test
(1+1)
( + 1 1 )
[c:\temp\test]test
(1+2+3)
( + ( + 1 2 ) 3 )

The AST is flattened for printing. Each pair of parenthesis marks the children of the preceding node.

To actually calculate the result of the expression we must build a walker to walk the AST tree implementing the evaluation actions. Again, ANTLR will write this code for us. Here’s the specification of our walker.

class CalcTreeWalker extends TreeParser;
expr returns [float r]
{
  float a,b;
  r=0;
}
:#(PLUS a=expr b=expr){r=a+b;}
|i:INTEGER{r=atof(i->getText().c_str());}
;

This introduces quite a bit of new syntax, but the underlying structure is still the same. The walker takes an abstract syntax tree and recognizes productions within that tree. The rule above describes a production called expr, which has a return value of type float. The statements between braces, {…}, are regular C++ code, which I won’t drill into. The #(PLUS …) construction describes an AST node for the walker to match with. A PLUS node is expected to have an expression on the left and an expression on the right. The value returned from the left and right hand sides is assigned to the temporary variables a and b.

Our main code must be enhanced to set the walker off down the AST.

CalcLexer lexer(cin);
CalcParser parser(lexer);
parser.expr();
RefAST t = parser.getAS);
CalcTreeWalker walker;
float r = walker.expr(t);
cout << t->toStringList() << " = " << r << endl;

And, our test session will now produce the answer to the (1+1) question.

[c:\temp\test]test
(1)
1 = 1
[c:\temp\test]test
(1+1)
( + 1 1 ) = 2
[c:\temp\test]test
(1+2+3)
( + ( + 1 2 ) 3 ) = 6

This is just a taste of some of the powerful features provided by ANTLR. A heartily recommended tool for both pleasure and profit.

John Merrells
merrells@acm.org

References:

[1] www.antlr.org - The home ANTLR.

Editorial

By Reg. Charney

What can I do?

We are in a war and I have asked myself what I can do to help win it. Like you, I have thought about giving blood, gave money, and hung out the flag. But, also like many of you, I have felt it was not enough. I want to be actively involved.

In a war, there are the troops in the front lines, but there are equally important folks back home. They run the infrastructure that empowers the troops. In this new war, we don’t need more bullets or guns. We need to strengthen our infrastructures to sustain ourselves through the coming difficult times. Our enemies are going to take advantage of our openness, freedom of movement, free speech, and our highly complex social structures. In fighting the terrorists, we can not sacrifice those things that we hold most dear. While things will need to change, prudence says that we need to do this carefully and with considerable thought and consensus. Doing any of the following will also buffer us and make us more resilient.

You can make an active difference in this war by volunteering. You can also join the National Guard, if you qualify. Here are some other active steps that you and I can take. Find out about various emergency management teams in your area, then join. See what FEMA is doing (www.fema.gov). They have a large number of preparedness courses you can take.

Is it just hatred?

As a friend of mine said, “Hatred is not a primary emotion—no one happily accepts death without a good reason.” We have a problem that blinds us to what motivates the terrorists. It is my contention that these people are so frustrated and angry about their poverty, the injustices they feel inflicted on them and their people, and their apparent helplessness that the only way they are able to make a difference is to attack a perceived enemy. That this will not solve their problems is not an issue for them. It gives a focus to their lives.

Book Review

By Allan Kelly

UML Distilled, 2nd edition

by Martin Fowler & Scott Kendall, Addison Wesley, ISBN 0-201-65783-X

Remarkably, this update to Fowler’s popular UML book has managed to keep it’s page count to under 200 pages! For me this book says everything that I need to know about UML, OK, I’m not a hard-core UML user but I need to know what it’s all about and I need to know where to go I suddenly find I need to know more.

The chatty style makes for an easy read - itself a pleasant change from so many methodology books! Nor does Fowler make the mistake of so many methodologists in demanding things are just so. He frequently tells you he does it this way but others do it that way. He knows that methodologies are not prescriptions to be followed exactly but techniques to be integrated into different environments.

He also restricts his subject matter to UML, deliberately shying away from it’s sibling “Rational Unified Process” (RUP), and has actually reduced the non-UML content like refactoring – although a sceptic may say this was to encourage you to buy his book on refactoring!

Fowler himself is increasingly associated with the Extreme Programming movement. This book doesn’t advocate XP as such but does advocate many of the same techniques and as such proves that UML and design has a place in XP. This will come as a surprise to those who regard XP as a licence to hack.

If you are looking for an introduction to UML, or just want to know what Use cases are all about this book is recommended. Even if you don’t expect to touch UML in your daily programming life, read this book to remind yourself of how many programmers approach projects. You’ll find something here to refresh your view on your and others approach to development.

Trends

By Ali Çehreli

H-1B holders don’t need to leave.

I was an H-1B visa holder for six years before I received my green card. It was very stressful waiting in that insecure position H1-B put me in. Would I lose my job and have to leave the country? How much time did I have to find another H-1B job?

Obviously no one would know the answer to the first question but the answer to the second one was readily available as one of the three I have heard so far: 10 days, 1 month, 2 months. Apparently none of these were true. I invite you to read a related article at www.wired.com to see how "the visa holder can stay in the country until the date on the entry ticket in the passport." [1]

One reason why H-1B holders may choose to stay in the country for some more time is that the high-tech job market is showing signs of improvement. Not very significantly but 15% of the data items we collect have actually increased in the past month.

The trends has not changed much. Windows 2000 and Linux remain to be the most trendy among the platform jobs (No surprises in the technology jobs either; ASIC technology is still the leader. It has been the most trendy since March, and the only one with positive trend since July. In addition to being more trendy, ASIC jobs are also the highest posted (Figure 2).

Silicon Valley seems to be the right place to be. According to Figure 3, being in the Silicon Valley--especially as a software engineer--is better compared to the rest of the US.

References

[1] "New Life for Fired H-1B Workers" by Swaroopa Iyengar on www.wired.com. You can use "10-day deadline" (including the double quotes) as the search text to find the article.