Tech Talk

The Basics of RE (Regular Expressions) — Part 1

By Jesus Monroy, Jr.

RE is a method familiar to the UNIX world for searching, parsing and manipulating text. Of benefit to non-UNIX programmers, subsets of RE are used for some generalized, but powerful applications, like SQL. On the UNIX platform, RE can be seen in applications like vi, sed, awk, grep, lex, yacc and perl.

While we can do the same thing with C or C++, RE provides a generic method to do the same job. As such, RE forms its own language, syntax and form of expression (no pun intended).

To assist us in understanding RE, I will use two articles, with this article having two sections. The first section will show how RE uses ‘plain text’ as a RE. The second section will show how RE uses certain ‘plain text’ characters as ‘symbols’ to create RE templates. These RE templates, in turn, are said to bind, or match, to the pattern(s) expressed by the template. In the second article, I will explain more complex RE, and some special cases of RE.

Plain Text Matching

This first part of RE is rather easy, but at times easy forget it. Plainly, a ‘plain text’ RE can match any string, or part of a string. Here is an example using a snippet of the /etc/passwd file (shown in Listing #1) as a reference to which we apply a grep expression like the following:

root:*:0:0:JR:/root:/bin/csh
toor:*:0:0:Superuser:/root:
daemon:*:1:1:Owner of many processes:/root:/sbin/nologin
operator:*:2:5:System &:/:/sbin/nologin
bin:*:3:7:Binaries Commands and Source,,,:/:/sbin/nologin
tty:*:4:65533:Tty Sandbox:/:/sbin/nologin

Listing #1: Sample /etc/passwd file

# grep root /etc/passwd

The result is:

root:*:0:0:JR:/root:/bin/csh
toor:*:0:0:Superuser:/root:

In this example, we use grep to look for the string ‘root’ in the file (/etc/passwd). Comparing the output to the original snippet of passwd, you’ll see that many lines from the original file were NOT printed. But we did get all the lines containing the word ‘root’. So ‘plain test’ RE is that simple.

It did not matter that ‘root’ was in the middle of a longer string of characters, grep just found any line containing ‘root’ and printed it. One side effect of our grep example is that the RE was case-sensitive. This is usually the case, although options can change this.

Meta Character Matching

Now let's start by looking at the five (5) symbols that provide the foundation for RE in the UNIX world. They are shown in Table #1.

.  period
^  carot
$  dollar sign
{}  curly braces
[]  square brackets

Table #1: Meta characters

These symbols are meta [1]-characters, or place holders in templates.

That is, unless explicitly stated otherwise, each of the symbols IS NOT used for it's traditional meaning. In the case of the period, it is not a period, but something else.

What the period *is* used for is a symbol. A symbol to indicate a single character — any character. What this means when you write a period within an RE construct [2] is that the application will interpret the period to mean that you are looking for any character. In RE parlance, we would say “the period” ‘binds’ to a single character. The term ‘bind’ may seem a bit confusing, but if we consider that regular expressions are used in multiple ways with multiple applications; it's a bit clearer.

To wit, ^ and $ mean to ‘bind’ to the start a line and the end a line, respectively.

Using the previous snippet from the file (/etc/passwd) we write:

# grep ^root /etc/passwd

this would produce:

root:*:0:0:JR:/root:/bin/csh

What we wrote in the grep RE is that we would like to scan the file (/etc/passwd/) and print all the lines that begin with the word ‘root’.

By the same token, we could change ‘^root’ to ‘csh$’ and grep would output all the lines that end with the word ‘csh.

This feature is helpful in that many “text databases” use the first word or last word as a unique key. By doing this, a line can act as a record in a text database.

Databases of this type are often used in light-weight systems, such as embedded systems. Sometimes “text databases” of this type are referred to as flat files [3].

Next, {} (curly braces) and [] (square brackets) provide us with a way to describe a variable number of characters or a variable set of characters, respectively.

a{1-3}

which binds to one, two or three a's

a{5}

which binds to exactly five (5) a's

a{1-}

which binds to one or more a's

[a-c]

binds to one character - either a, b, or c

[aeiou]

binds to any vowel

[a-z,A-Z]

binds to the alphabet, in lower or upper case

Table #2: Examples

Table #2 show some examples using some of these meta-characters.

You might be wondering just what to do if you needed one of these special symbols to be part of what your RE. In that case, we use an ‘escape character’. In UNIX, the character traditionally used for this is the back slash ‘\’. Unfortunately the rules governing what happens when using a back slash can be confusing and difficult to remember. So rather than go through all the rules and variations, I will show some common examples in Table #3.

\.

we want a period, not the symbolic use

\$

we want a dollar sign

\[

we want the square bracket; we are NOT defining a set

\\

we want a backslash character

\\\\

we want two (2) backslash characters

Table #3: Escaped Characters

Lastly, if you'd like to try any of these examples, I would recommend using grep. Trying them with anything else would require a delimiter that I won't mention until the next article.

So that's it for now.

The next article will have more complex examples of RE.

Notes

[1] “Meta”, one definition is ‘one level of description higher than the present’. However, one common problem with the word ‘meta’ is that it is often used to describe a thing. Meta is a concept, not a thing.
[2] “Construct”, as we use it, is a noun. In this case, it is part of a larger thing, like a framework.
[3] “Flat files” are also called as ASCII text files.

Editorial

By Reg. Charney

So much to say, so little to say it with

To all my local readers, I owe an apology. A number of factors conspired to delay the publication of this month’s newsletter.

I was working on an article on heterogeneous collection classes when that “last” test blew up in my face and I realized that all the work was for not. I felt like a rat caught in a maze, no matter which way I turned, there were more technical problems that could not be overcome in the time left. So I did the only sane thing, I gave up and decided that I would rest before trying again. The week’s hiatus allowed me to refresh myself and discover several other interesting problems. I hope to write about them in coming issues.

However, that does bring up a few good points. Dealing with frustration means giving yourself a break—the world will go on. Second, more articles by my readers would ease the load and might even be more interesting ;-).

Oh, the article on heterogeneous collections will appear, but first, most C++ compilers will need to deal with template functions like the following.

class A { };

static int iG; // used later

template<class T> void f(T*)
{ iG = 1; }

template<class T>
struct B {
  void g(T*)
  {
    &f<T>; // get fcn addr
  }
};

int main()
{
  B<A> b;
  A a;

  b.g(&a);

  return iG;
}

An ambiguity error is often reported taking the address of f<T> in struct B. Do you know why?

Book Review

By Allan Kelly

The C++ Programming Language, Special Edition by Bjarne Stroustrup, Addison-Wesley, 2000, ISBN 0-201-70073-5

In calling this book the Special Edition the publishers have brought Microsoft style version numbering to books. However, calling this book the fourth edition would be misleading as there are no substantial changes. Perhaps a better name would have been third edition++.

Apart from multiple corrections the only noticeable changes to the text are the addition of two appendices (both downloadable for free from Stroustrup’s web site) dealing with locales and exception safety, plus an enlarged index - given the poor quality of the index in the second and third editions one hopes this will be an improvement, time will tell.

The appendix on exception safety is a useful contribution to the collective skills upgrade occurring in the C++ community. Only in the last couple of years, since the standard was complete, has exception safety been fully understood, hence the attention the subject is being given in books, magazines and by conference speakers.

The C++ Programming Language is not, and never has been, a good book to learn C++ from. However, it is the definitive reference to the language – notwithstanding the official ISO document which is decidedly less readable. Although primarily a reference book part IV develops Stroustrup’s own thinking on software development and as such these chapters should be read in their entirety as they explain the thinking behind the language.

If you already own the third edition it is hard to justify the $60 for a minor upgrade. But no professional C++ programmer can afford to be without either the third edition or this special edition.

Trends

By Reg. Charney

Same Old, Same Old

This month’s Trends column is like many of those from recent months. The job market continues to contract and the number of new jobs continues to drop. Companies are still hiring, but are becoming very picky. They are only hiring what they need now, based on their more profitable product lines. The preemptive hiring of the past is gone.

Figure #1 shows the continued decline in the last 12 months of the job opening market. Again, the decline is not precipitous.

Figure #2 shows the continued strength of non-internet related jobs. These are jobs that do not depend on Internet related technologies like Java, HTML, XML, etc. This represents IT used for core business purposes.

The only real surprise is shown in Figure #3. The demand for those skilled in writing device drivers for Windows 2000 has grown over the last couple of months. Windows 2000 presence in the market place has finally made itself felt in the job market. Interesting enough, device drivers is one of the first places that a new hardware or software platform makes itself known.

Conclusions

The job market and the stock market are intertwined. There is a feedback loop here. The trick is to see both at the same time. Trends in the job market often foreshadows trends in the stock market. However, changes in the stock market often affect the job market. The Valley is a classic example of this exemplified. The weak earnings and depressed stock market indicates that the job market in almost all sectors will be tight and might even tighten further.