|
By Jesus Monroy, Jr.
(jessemonroy@email.com )
What is Lex?
Following up a bit from the previous article. We mentioned that
regular expressions are used in a variety of things. However, much
more interesting is that many tools, like perl and the C compiler,
use Lex. They use it to break up the code and transform it into the
binary executables. Lex, short for lexical analyzer, forms the
center of many tools. In addition, Lex is an excellent tool for
creating configuration files. That is, if you’ve decide that your
own project has too many options for the command line, or demands so
much information it’s best stored in a configuration file, then
look at Lex.
So what is Lex? Rather than come up with my own description, we
will start with a slightly modified version of the man page for
flex(1).
Lex is a tool for generating scanners: programs which recognized
lexical patterns in text.
In other words, Lex is a program for creating programs or
routines that scan text. It is a specialize engine for automatic
code creation. The programs (or routines) which use this tool are
called scanners(2). How does it do this? Let us continue:
Lex reads the given input files(3) for a description of the
scanner to generate. The description is in the form of pairs. The
pairs used to generate the scanner are: regular expressions and C
code(4). Lex, when run with this description, generates a C source
file. This C file is compiled and linked to produce an executable.
When the executable is run, it analyzes its input for occurrences
of the regular expressions. Whenever it finds one, it executes the
corresponding C code.
The Lex Description File
From this point on, if you are not familar with regular
expressions, you'll have trouble. If your a regular ACCent reader,
you can read May’s article on “The Basics of RE.”
The lex description file has three (3) sections. From the
classical descriptions it looks like this:
...definitions section...
%%
...rules section...
%%
...user subroutines...
The above illustration is a representation of what the file looks
like. The only things actually used in the description file are the
double percent symbols (%%). The
double percent symbols are delimiters, or separators, for each of
the sections.
For the purpose of this article, we will skip most of the
definitions section with the exception of C code inclusion. We will
instead concentrate on the rules section. The user section is where
we do most of the work -- based on the what we define in the rules
section.
C Code Inclusion
Before we start on the most important part (the rules section),
let use discuss C code inclusion. The first two (2) sections
(definitions and rules) can have C code included.
To include the code it must be encapsulated with a pair of unique
symbols. They are %{ and %}.
The %{ must be at the start of the
code. The code must end with %}.
Those symbols MUST be the first characters on the line. There can be
no spaces, no tabs or any other characters before these symbols.
Between these two symbols any C code can be included.
Typically, #include files and
global variables are put in the definitions section. C code can also
be included in the rules section, however its exact placement is
said to vary from implementation to implementation.
In addition to the %{ and %}
symbols in the Rules Section, any line that starts with a
space or tab is include in the Lex generated C source file. We'll
see more of that in the next section.
The Rules Section
Let's start with an example of how a definition is layed out.
hello { hello_cnt++; }
As we can see, there appear to be two (2) columns in this
example. In the left hand column we have the regular expression. And
in the right hand column has what looks like C code in the
traditional curly braces.
The way it works is – in the right hand column we have regular
expressions. In the above example, we use the word ‘hello’.
However, we could have just as easily used one of many regular
expressions that we might use in perl, grep or awk. Between the
regular expression (in the first column) and the stuff in the curly
braces (in the second column) we have spaces and/or tabs. These
spaces and/or tabs work as delimiters (or seperators).
In the second (2nd) column, we have C code in between curly
braces. The code is what is run immediately after the regular
expression is matched.
A Working Example
For the purpose of this article I have not included the entire
program or lex description file. That associated files can be
downloaded from the ACCU++ website.
Here is a snippet of the Lex description file, after removing
redundant parts.
|
M_B40x25 |
{ good_words++; return M_B40x25; } |
|
M_VESA_FULL_1280 |
{ good_words++; return M_VESA_FULL_1280;} |
|
[ \t]+ |
{ /* throw away whites spaces */ } |
|
[a-z0-9].* |
bad_words++; |
|
\n |
new_lines++; |
|
<<EOF>> |
{ eof = 1; return 0; } |
Table 1: Simple configuration
file example
First let me appoligize for this quantum leap. However, if you'll
follow along a bit, we'll see it's not too difficult.
Again we appear to have two (2) columns. In the left hand column
are mostly regular expressions. In the right hand colum is C code.
The area in between is spaces or tabs.
For the most part the regular expressions look like definitions
from a header file; in fact they are. In the lower part of our
listing we see some familiar grouping sets for RE. One of the
groupings binds to spaces or tabs; at least one. The other grouping
binds to lower letters of the alphabet or numbers; at least one
(note the period). The last real RE is a ‘new line’ character
(\n). The last line is not an RE. It is a special token (or symbol)
used by Lex to tell us when it has really reach the end of the file.
On the right hand column we some variables (mostly counting) and
return statements.
The variable good_words counts
the numbers of words we have identified. The variable bad_words
counts the numbers of words we don't recognize. The variable new_lines
counts the total number of lines we have scanned.
As for the return statements, that will become clear in just a
second.
The Main loop
As you might have guess, the description section is one giant
switch() statement. Our code is just code in each of the case
statements. As such, the return we used in the right hand columns
return to some outer loop. From the example code, here is a snippet.
while (!
eof ) {
vid.vi_mode = yylex();
if (! eof) {
retval = ioctl(console, CONS_MODEINFO,
&vid);
switch (retval) {
case 0: /* This means okay!
*/
VidInfoArray[
vid.vi_mode ].available = 1;
VidInfoArray[
vid.vi_mode ].height = vid.vi_height ;
VidInfoArray[
vid.vi_mode ].width = vid.vi_width ;
VidInfoArray[
vid.vi_mode ].depth = vid.vi_depth ;
break;
default:
/* okay whatever
happened it’s an error and we ignore it. */
}
}
} |
As you can see, we have a while() loop. This while() loop is our
outer loop. The giant switch() statement, where our Lex C code
resides, has been placed in a function called yylex(). In the outer
loop, you'll see a call to yylex(), our black box(5) function.
It should now be somewhat clear that the return statements in our
Lex description file returns some sort of numeric value.
That value, stored as an int in the struct vid, is used when we
call ioctl(). ioctl(), in turn, is used to retrive information about
the videos' color depth, pixel height and width.
Finishing up, as you'll note every successful call to ioctl()
stored some information. That information, in turn produces
configuration information for the /etc/XF86Config file. In short,
this entire exercise is just one example on how to use a
configuration file to produce a configuration file. This may seem a
bit contorted, but given the complex nature of configuring an X11R6,
you might find this a blessing.
That's it for now. If time permits, next month we'll see how to
extend Lex with some things I’ve neglected to mention.
References
- 1. Flex is an Open Source Lex version.
- 2. Technically, they are called lexical scanners.
- 3. Or its standard input if no file names are given.
- 4. The technical name for this pair combination is rules.
- 5. The term refers to any object in which the inner workings
are not known, but given a know input, the output is known (or
can be relied on).
By Reg. Charney
Controlled
by Microsoft
I am very concerned about Microsoft’s upcoming Windows XP
operating system’s invasion of our privacy and the user upset it
will cause. The new operating system will be closely tied to the
machine on which it is installed. To do that, Microsoft has taken
upon themselves the right to investigate your machine. In
registering your machine they will use this information to compute
some form of key that you will need to install the software.
However, I, for one, resent the fact that I need to tell the vendor
how and where I am going to use the product I purchased—and for
the legal types, please don’t give me any crap about license
versus purchase—I know the legal difference, but I use the product
the same way whether I purchase it or license it. I use it legally,
but I do change my configuration on a regular basis and I often
change disk drives and reconfigure partitions. Under all these
conditions, I would need to phone/contact Microsoft to get an
updated license. Oh, did I tell you? You are only allowed 10 changes
to your configuration before everything closes down. I ran into this
type of problem with Lotus and their 123 product. I had disk crashes
and corrupted system diskettes. I ended up screaming on the phone at
their poor tech support people and looking at all the alternative
software I could find. I ended up using QuattroPro—a truly
excellent package.
Fight Back
I have also made a commitment to myself and those I deal with:
Even though I have a Universal Subscription to MSDN, I will not
install or use any of the XP line of products, neither their O/S nor
their Office products. Starting this weekend, I am going to convert
my machines to using Linux and/or Open Source software like Sun’s
StarOffice or ApplixWare or AbiWord.
Freedom, FREEdom, FREEDOM!!!
By Reg. Charney
Some Silver
Clouds—Sort of
Job openings have continued to decline as you and almost every
pundit knows. The job-related figures in the IT graphs to the left
don’t do much to lighten the gloom. The total job openings
nationwide and in Silicon Valley continue to decline. This is
reflected in each of the categories that we track. In this issue of
ACCent, we have shown the following categories: Language skills,
Platform-based skills; and Technology-based skills.
January 2000 is taken as our base line. That is, whatever the
number of job openings the category had in January 2000 is taken as
1.00. Since then, the total number of IT job openings has decreased
about 63% nationwide and 78% in Silicon Valley. Software engineering
job openings nationwide have decreased 66% and in Silicon Valley by
83%.
Job openings related to language skills have dropped
precipitously. C/C++ and Java have dropped to 18% and 12%
respectively of what they were in January 2000. HTML language skills
are only 1% of their former demand. By platform, the slowdown shows
up in decreases for all platform-related skill openings. Most
platforms suffered a decline in demand of 85±5%. However, demand
for Windows 2000 and Linux each decreased by less than 50% of what
it was.
Along with the dot-com bust, the number of internet-related job
openings have also decreased significantly. They are only 4% of what
they were at the start of last year. In this technology category,
only non-internet related job openings have remained steady at their
January 2000 levels.



|