Tech Talk

An Easy Lesson in Lex

By Jesus Monroy, Jr. (jessemonroy@email.com )

What is Lex?

Following up a bit from the previous article. We mentioned that regular expressions are used in a variety of things. However, much more interesting is that many tools, like perl and the C compiler, use Lex. They use it to break up the code and transform it into the binary executables. Lex, short for lexical analyzer, forms the center of many tools. In addition, Lex is an excellent tool for creating configuration files. That is, if you’ve decide that your own project has too many options for the command line, or demands so much information it’s best stored in a configuration file, then look at Lex.

So what is Lex? Rather than come up with my own description, we will start with a slightly modified version of the man page for flex(1).

Lex is a tool for generating scanners: programs which recognized lexical patterns in text.

In other words, Lex is a program for creating programs or routines that scan text. It is a specialize engine for automatic code creation. The programs (or routines) which use this tool are called scanners(2). How does it do this? Let us continue:

Lex reads the given input files(3) for a description of the scanner to generate. The description is in the form of pairs. The pairs used to generate the scanner are: regular expressions and C code(4). Lex, when run with this description, generates a C source file. This C file is compiled and linked to produce an executable.

When the executable is run, it analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding C code.

The Lex Description File

From this point on, if you are not familar with regular expressions, you'll have trouble. If your a regular ACCent reader, you can read May’s article on “The Basics of RE.”

The lex description file has three (3) sections. From the classical descriptions it looks like this:

...definitions section...

%%

...rules section...

%%

...user subroutines...

The above illustration is a representation of what the file looks like. The only things actually used in the description file are the double percent symbols (%%). The double percent symbols are delimiters, or separators, for each of the sections.

For the purpose of this article, we will skip most of the definitions section with the exception of C code inclusion. We will instead concentrate on the rules section. The user section is where we do most of the work -- based on the what we define in the rules section.

C Code Inclusion

Before we start on the most important part (the rules section), let use discuss C code inclusion. The first two (2) sections (definitions and rules) can have C code included.

To include the code it must be encapsulated with a pair of unique symbols. They are %{ and %}. The %{ must be at the start of the code. The code must end with %}. Those symbols MUST be the first characters on the line. There can be no spaces, no tabs or any other characters before these symbols. Between these two symbols any C code can be included.

Typically, #include files and global variables are put in the definitions section. C code can also be included in the rules section, however its exact placement is said to vary from implementation to implementation.

In addition to the %{ and %} symbols in the Rules Section, any line that starts with a space or tab is include in the Lex generated C source file. We'll see more of that in the next section.

The Rules Section

Let's start with an example of how a definition is layed out.

hello { hello_cnt++; }

As we can see, there appear to be two (2) columns in this example. In the left hand column we have the regular expression. And in the right hand column has what looks like C code in the traditional curly braces.

The way it works is – in the right hand column we have regular expressions. In the above example, we use the word ‘hello’. However, we could have just as easily used one of many regular expressions that we might use in perl, grep or awk. Between the regular expression (in the first column) and the stuff in the curly braces (in the second column) we have spaces and/or tabs. These spaces and/or tabs work as delimiters (or seperators).

In the second (2nd) column, we have C code in between curly braces. The code is what is run immediately after the regular expression is matched.

A Working Example

For the purpose of this article I have not included the entire program or lex description file. That associated files can be downloaded from the ACCU++ website.

Here is a snippet of the Lex description file, after removing redundant parts.

M_B40x25

{ good_words++; return M_B40x25; }

M_VESA_FULL_1280

{ good_words++; return M_VESA_FULL_1280;}

[ \t]+

{ /* throw away whites spaces */ }

[a-z0-9].*

bad_words++;

\n

new_lines++;

<<EOF>>

{ eof = 1; return 0; }

Table 1: Simple configuration file example

First let me appoligize for this quantum leap. However, if you'll follow along a bit, we'll see it's not too difficult.

Again we appear to have two (2) columns. In the left hand column are mostly regular expressions. In the right hand colum is C code. The area in between is spaces or tabs.

For the most part the regular expressions look like definitions from a header file; in fact they are. In the lower part of our listing we see some familiar grouping sets for RE. One of the groupings binds to spaces or tabs; at least one. The other grouping binds to lower letters of the alphabet or numbers; at least one (note the period). The last real RE is a ‘new line’ character (\n). The last line is not an RE. It is a special token (or symbol) used by Lex to tell us when it has really reach the end of the file.

On the right hand column we some variables (mostly counting) and return statements.

The variable good_words counts the numbers of words we have identified. The variable bad_words counts the numbers of words we don't recognize. The variable new_lines counts the total number of lines we have scanned.

As for the return statements, that will become clear in just a second.

The Main loop

As you might have guess, the description section is one giant switch() statement. Our code is just code in each of the case statements. As such, the return we used in the right hand columns return to some outer loop. From the example code, here is a snippet.

while (! eof ) {
  vid.vi_mode = yylex();
  if (! eof) {
    retval = ioctl(console, CONS_MODEINFO, &vid);
    switch (retval) {
      case 0: /* This means okay! */
        VidInfoArray[ vid.vi_mode ].available = 1;
        VidInfoArray[ vid.vi_mode ].height = vid.vi_height ;
        VidInfoArray[ vid.vi_mode ].width = vid.vi_width ;
        VidInfoArray[ vid.vi_mode ].depth = vid.vi_depth ;
        break;
      default:
        /* okay whatever happened it’s an error and we ignore it. */
    }
  }
}

As you can see, we have a while() loop. This while() loop is our outer loop. The giant switch() statement, where our Lex C code resides, has been placed in a function called yylex(). In the outer loop, you'll see a call to yylex(), our black box(5) function.

It should now be somewhat clear that the return statements in our Lex description file returns some sort of numeric value.

That value, stored as an int in the struct vid, is used when we call ioctl(). ioctl(), in turn, is used to retrive information about the videos' color depth, pixel height and width.

Finishing up, as you'll note every successful call to ioctl() stored some information. That information, in turn produces configuration information for the /etc/XF86Config file. In short, this entire exercise is just one example on how to use a configuration file to produce a configuration file. This may seem a bit contorted, but given the complex nature of configuring an X11R6, you might find this a blessing.

That's it for now. If time permits, next month we'll see how to extend Lex with some things I’ve neglected to mention.

References

  1. 1. Flex is an Open Source Lex version.
  2. 2. Technically, they are called lexical scanners.
  3. 3. Or its standard input if no file names are given.
  4. 4. The technical name for this pair combination is rules.
  5. 5. The term refers to any object in which the inner workings are not known, but given a know input, the output is known (or can be relied on).

Editorial

By Reg. Charney

Controlled by Microsoft

I am very concerned about Microsoft’s upcoming Windows XP operating system’s invasion of our privacy and the user upset it will cause. The new operating system will be closely tied to the machine on which it is installed. To do that, Microsoft has taken upon themselves the right to investigate your machine. In registering your machine they will use this information to compute some form of key that you will need to install the software. However, I, for one, resent the fact that I need to tell the vendor how and where I am going to use the product I purchased—and for the legal types, please don’t give me any crap about license versus purchase—I know the legal difference, but I use the product the same way whether I purchase it or license it. I use it legally, but I do change my configuration on a regular basis and I often change disk drives and reconfigure partitions. Under all these conditions, I would need to phone/contact Microsoft to get an updated license. Oh, did I tell you? You are only allowed 10 changes to your configuration before everything closes down. I ran into this type of problem with Lotus and their 123 product. I had disk crashes and corrupted system diskettes. I ended up screaming on the phone at their poor tech support people and looking at all the alternative software I could find. I ended up using QuattroPro—a truly excellent package.

Fight Back

I have also made a commitment to myself and those I deal with: Even though I have a Universal Subscription to MSDN, I will not install or use any of the XP line of products, neither their O/S nor their Office products. Starting this weekend, I am going to convert my machines to using Linux and/or Open Source software like Sun’s StarOffice or ApplixWare or AbiWord.

Freedom, FREEdom, FREEDOM!!!

Trends

By Reg. Charney

Some Silver Clouds—Sort of

Job openings have continued to decline as you and almost every pundit knows. The job-related figures in the IT graphs to the left don’t do much to lighten the gloom. The total job openings nationwide and in Silicon Valley continue to decline. This is reflected in each of the categories that we track. In this issue of ACCent, we have shown the following categories: Language skills, Platform-based skills; and Technology-based skills.

January 2000 is taken as our base line. That is, whatever the number of job openings the category had in January 2000 is taken as 1.00. Since then, the total number of IT job openings has decreased about 63% nationwide and 78% in Silicon Valley. Software engineering job openings nationwide have decreased 66% and in Silicon Valley by 83%.

Job openings related to language skills have dropped precipitously. C/C++ and Java have dropped to 18% and 12% respectively of what they were in January 2000. HTML language skills are only 1% of their former demand. By platform, the slowdown shows up in decreases for all platform-related skill openings. Most platforms suffered a decline in demand of 85±5%. However, demand for Windows 2000 and Linux each decreased by less than 50% of what it was.

Along with the dot-com bust, the number of internet-related job openings have also decreased significantly. They are only 4% of what they were at the start of last year. In this technology category, only non-internet related job openings have remained steady at their January 2000 levels.