|
By John Merrells
Every couple of years or so I bump onto an engineering problem
most suitably solved by a simple lexical analyzer and parser.
Typical examples being a configuration file processor, or a URL
processor. Each time, I dust off my old lex and yacc manuals and try
to remember how a DFSA differs from an NDFSA and how a shift-reduce
error differs from a reduce-reduce error. Invariably I give up and
revert to the tried and true method of just hacking up some code
myself. Surely there must be a better way?
There is, but first a brief introduction to some language
processing terms.
A lexical analyzer, also known as a lexer or scanner, simply
transforms an input stream of characters into an output stream of
tokens. For example, a lexer might be programmed by a set of rules
to transform a stream such as (10+20) into the token stream LP N
PLUS N RP. Lexers specialize in grouping sequences of characters
into larger constructs named tokens.
A parser takes a stream of tokens from a lexer and groups them
into higher order constructions called productions. Following on
from our example above, a parser might be programmed by a set of
rules to recognize the token stream LP N PLUS N RP as the production
named expression.
ANTLR, or ANother Language Recognizer, is a combined lexer and
parser generator, with a number of features that make it much
simpler and nicer to use than the traditional combination of lex and
yacc, or the GNU alternative of flex and bison.
That’s enough terms for now, lets get down to some code. We’ll
work through the canonical calculator example. What’s the result
of the expression (1+1)? Well, first we’ll need a lexer.
// calc.g
options
{
language="Cpp";
}
class CalcLexer extends Lexer;
public
INTEGER: (DIGIT)+;
LPAREN: '(';
RPAREN: ')';
PLUS: '+';
protected
DIGIT: '0'..'9';
The definition code for the lexer is comprised of four sections;
options, class derivation, public tokens and protected tokens. The
options section contains parameters for the ANTLR tool. In this case
the ‘language’ statement selects the generation of C++ code.
ANTLR can also generate Java and Sather code, but for this article
we’ll use C++. The class statement declares that the subsequent
sections are defining a lexer, and allows the programmer to name
their scanner. The next two sections provide the token definitions.
The case of the token names is significant. The tool knows to treat
identifiers that begin with an uppercase letter as tokens. In our
example lexer, a DIGIT is defined to be any character in the range
‘0’ through to ‘9’, and an INTEGER is a sequence of DIGITS.
The definitions use the standard regular expression syntax. Here the
‘+’ in (DIGIT)+ means a sequence of one or more. Other important
regular expression operators to keep in mind are (x)? which means x
is optional, x | y which means x or y, and (x)* which means a
sequence of zero or many x’s.
We need a parser to consume the token stream and recognize
productions so that we can evaluate our expressions. The rules used
to describe the parser are called production rules and collectively
describe a grammar. The grammar for our calculator, in ANTLR
notation, looks like this.
class CalcParser extends Parser;
options{
buildAST=true;
}
expr: LPAREN INTEGER (PLUS INTEGER)* RPAREN;
These three sections of code can appear either before or after
the lexer definition. The first section again declares the start of
the rules that describe the parser. The options section can contain
many statements, but for our simple demonstration we need only to
turn on the automatic construction of an abstract syntax tree (AST).
The AST is a tree of nodes that is built to represent the
productions as they are recognized within the token stream. And the
third section provides a definition of the expr production
rule. This rule states that an expression begins with a left
parenthesis followed by an integer and then followed by zero or more
integer additions, followed by a closing right parenthesis.
The lexer and parser source code is generated by executing the
following command line. (You’ll have to install ANTLR, the Java
SDK, and have set up your class path appropriately.)
> java antlr.Tool calc.g
The files resulting from this execution will be the lexer code:
CalcLexer.cpp, CalcLexer.hpp, CalcLexerTokenTypes.hpp; and the
parser code: CalcParser.cpp, CalcParser.hpp,
CalcParserTokenTypes.hpp. One of the nice things about ANTLR, over
lex, flex, yacc, and bision, is that the code it generates is
implemented as nested case statements, rather than tables of
numbers. This makes the code much simpler to understand when
implementing a complex grammar or whilst debugging.
Now we need some main() code to drive our lexer and parser.
//main.cpp
#include <iostream>
#include "CalcLexer.hpp"
#include "CalcParser.hpp"
using namespace std;
using namespace antlr;
void main()
{
try {
CalcLexer lexer(cin);
CalcParser parser(lexer);
parser.expr();
RefAST t = parser.getAST();
cout << t->toStringList() <<
endl;
} catch(exception& e) {
cerr << "exception: " <<
e.what() << endl;
}
}
A stream of characters it taken from the standard input stream,
passed into the lexer, from which a token stream is passed into the
parser, from which productions are recognized, and an abstract
syntax tree generated.
Compile main.cpp and the tool generated code, and link with the
ANTLR C++ library. (You’ll need to compile this library yourself,
but instructions are provided.)
Here’s a sample session showing a sample input string, and the
resultant AST.
[c:\temp\test]test
(1+1)
( 1 + 1 )
[c:\temp\test]test
(1+2+3)
( 1 + 2 + 3 )
[c:\temp\test] test
(1*1)
exception: unexpected char: *
The next stage is clearly for us to actually calculate the
resultant value of the expression. A common approach is to place
actions with the production rules so that as the production is
recognized the result of the expression in evaluated. This solution
is cumbersome for large languages as the recognition and evaluation
code is intermingled. We’d prefer to separate these issues, just
as we separated the concerns of lexical analysis and parsing.
Typically, for a complex language, the actions that adorn the
grammar do no more than build the abstract syntax tree. But, ANTLR
will do this for us.
The default AST being built isn’t quite optimal. In the output
above the tree includes nodes that represent the parenthesis. These
are just there for syntactic sugar. Also, ,the root node in the tree
is also the left parenthesis. We’d much rather it was the plus, as
this is the operation we’re trying to perform. The expression
production should be changed thus:
expr: LPAREN! INTEGER (PLUS^ INTEGER)*
RPAREN!;
Note the addition of the ‘!’ and ‘^’ notations. The ‘!’
indicates that for AST generation the token should be ignored, and
the ‘^’ after the PLUS token signifies that the PLUS is to be
the root of the tree. Our interactive session now produces this:
[c:\temp\test]test
(1)
1
[c:\temp\test]test
(1+1)
( + 1 1 )
[c:\temp\test]test
(1+2+3)
( + ( + 1 2 ) 3 )
The AST is flattened for printing. Each pair of parenthesis marks
the children of the preceding node.
To actually calculate the result of the expression we must build
a walker to walk the AST tree implementing the evaluation actions.
Again, ANTLR will write this code for us. Here’s the specification
of our walker.
class CalcTreeWalker extends
TreeParser;
expr returns [float r]
{
float a,b;
r=0;
}
:#(PLUS a=expr b=expr){r=a+b;}
|i:INTEGER{r=atof(i->getText().c_str());}
;
This introduces quite a bit of new syntax, but the underlying
structure is still the same. The walker takes an abstract syntax
tree and recognizes productions within that tree. The rule above
describes a production called expr, which has a return value
of type float. The statements between braces, {…}, are regular C++
code, which I won’t drill into. The #(PLUS …) construction
describes an AST node for the walker to match with. A PLUS node is
expected to have an expression on the left and an expression on the
right. The value returned from the left and right hand sides is
assigned to the temporary variables a and b.
Our main code must be enhanced to set the walker off down the
AST.
CalcLexer lexer(cin);
CalcParser parser(lexer);
parser.expr();
RefAST t = parser.getAS);
CalcTreeWalker walker;
float r = walker.expr(t);
cout << t->toStringList() << " = "
<< r << endl;
And, our test session will now produce the answer to the (1+1)
question.
[c:\temp\test]test
(1)
1 = 1
[c:\temp\test]test
(1+1)
( + 1 1 ) = 2
[c:\temp\test]test
(1+2+3)
( + ( + 1 2 ) 3 ) = 6
This is just a taste of some of the powerful features provided by
ANTLR. A heartily recommended tool for both pleasure and profit.
John Merrells
merrells@acm.org
References:
[1] www.antlr.org - The home ANTLR.
By Reg. Charney
We are in a war and I have asked myself what I can do to help win
it. Like you, I have thought about giving blood, gave money, and
hung out the flag. But, also like many of you, I have felt it was
not enough. I want to be actively involved.
In a war, there are the troops in the front lines, but there are
equally important folks back home. They run the infrastructure that
empowers the troops. In this new war, we don’t need more bullets
or guns. We need to strengthen our infrastructures to sustain
ourselves through the coming difficult times. Our enemies are going
to take advantage of our openness, freedom of movement, free speech,
and our highly complex social structures. In fighting the
terrorists, we can not sacrifice those things that we hold most
dear. While things will need to change, prudence says that we need
to do this carefully and with considerable thought and consensus.
Doing any of the following will also buffer us and make us more
resilient.
You can make an active difference in this war by volunteering.
You can also join the National Guard, if you qualify. Here are some
other active steps that you and I can take. Find out about various
emergency management teams in your area, then join. See what FEMA is
doing (www.fema.gov). They have a large number of preparedness
courses you can take.
As a friend of mine said, “Hatred is not a primary emotion—no
one happily accepts death without a good reason.” We have a
problem that blinds us to what motivates the terrorists. It is my
contention that these people are so frustrated and angry about their
poverty, the injustices they feel inflicted on them and their
people, and their apparent helplessness that the only way they are
able to make a difference is to attack a perceived enemy. That this
will not solve their problems is not an issue for them. It gives a
focus to their lives.
By Allan Kelly
UML
Distilled, 2nd edition
by Martin Fowler & Scott Kendall, Addison Wesley, ISBN
0-201-65783-X
Remarkably, this update to Fowler’s popular UML book has
managed to keep it’s page count to under 200 pages! For me this
book says everything that I need to know about UML, OK, I’m not a
hard-core UML user but I need to know what it’s all about and I
need to know where to go I suddenly find I need to know more.
The chatty style makes for an easy read - itself a pleasant
change from so many methodology books! Nor does Fowler make the
mistake of so many methodologists in demanding things are just so.
He frequently tells you he does it this way but others do it that
way. He knows that methodologies are not prescriptions to be
followed exactly but techniques to be integrated into different
environments.
He also restricts his subject matter to UML, deliberately shying
away from it’s sibling “Rational Unified Process” (RUP), and
has actually reduced the non-UML content like refactoring –
although a sceptic may say this was to encourage you to buy his book
on refactoring!
Fowler himself is increasingly associated with the Extreme
Programming movement. This book doesn’t advocate XP as such but
does advocate many of the same techniques and as such proves that
UML and design has a place in XP. This will come as a surprise to
those who regard XP as a licence to hack.
If you are looking for an introduction to UML, or just want to
know what Use cases are all about this book is recommended.
Even if you don’t expect to touch UML in your daily programming
life, read this book to remind yourself of how many programmers
approach projects. You’ll find something here to refresh your view
on your and others approach to development.
By Ali Çehreli
H-1B holders don’t
need to leave.
I was an H-1B visa holder for six years before I received my
green card. It was very stressful waiting in that insecure position
H1-B put me in. Would I lose my job and have to leave the country?
How much time did I have to find another H-1B job?
Obviously no one would know the answer to the first question but
the answer to the second one was readily available as one of the
three I have heard so far: 10 days, 1 month, 2 months. Apparently
none of these were true. I invite you to read a related article at www.wired.com
to see how "the visa holder can stay in the country until the
date on the entry ticket in the passport." [1]
One reason why H-1B holders may choose to stay in the country for
some more time is that the high-tech job market is showing signs of
improvement. Not very significantly but 15% of the data items we
collect have actually increased in the past month.

The trends has not changed much. Windows 2000 and Linux remain to
be the most trendy among the platform jobs (No surprises in the
technology jobs either; ASIC technology is still the leader. It has
been the most trendy since March, and the only one with positive
trend since July. In addition to being more trendy, ASIC jobs are
also the highest posted (Figure 2).

Silicon Valley seems to be the right place to be. According to
Figure 3, being in the Silicon Valley--especially as a software
engineer--is better compared to the rest of the US.

References
[1] "New Life for Fired H-1B Workers" by Swaroopa
Iyengar on www.wired.com. You can
use "10-day deadline" (including the double quotes) as the
search text to find the article.
|