|
By Jesus Monroy, Jr.
RE is a method familiar to the UNIX world for searching, parsing
and manipulating text. Of benefit to non-UNIX programmers, subsets
of RE are used for some generalized, but powerful applications, like
SQL. On the UNIX platform, RE can be seen in applications like vi,
sed, awk, grep, lex, yacc and perl.
While we can do the same thing with C or C++, RE provides a
generic method to do the same job. As such, RE forms its own
language, syntax and form of expression (no pun intended).
To assist us in understanding RE, I will use two articles, with
this article having two sections. The first section will show how RE
uses ‘plain text’ as a RE. The second section will show how RE
uses certain ‘plain text’ characters as ‘symbols’ to create
RE templates. These RE templates, in turn, are said to bind, or
match, to the pattern(s) expressed by the template. In the second
article, I will explain more complex RE, and some special cases of
RE.
Plain Text Matching
This first part of RE is rather easy, but at times easy forget
it. Plainly, a ‘plain text’ RE can match any string, or part of
a string. Here is an example using a snippet of the /etc/passwd
file (shown in Listing #1) as a reference to which we
apply a grep expression like
the following:
root:*:0:0:JR:/root:/bin/csh
toor:*:0:0:Superuser:/root:
daemon:*:1:1:Owner of many processes:/root:/sbin/nologin
operator:*:2:5:System &:/:/sbin/nologin
bin:*:3:7:Binaries Commands and Source,,,:/:/sbin/nologin
tty:*:4:65533:Tty Sandbox:/:/sbin/nologin |
Listing #1: Sample /etc/passwd file
# grep root /etc/passwd
The result is:
root:*:0:0:JR:/root:/bin/csh
toor:*:0:0:Superuser:/root:
In this example, we use grep to
look for the string ‘root’
in the file (/etc/passwd).
Comparing the output to the original snippet of passwd,
you’ll see that many lines from the original file were NOT
printed. But we did get all the lines containing the word ‘root’.
So ‘plain test’ RE is that simple.
It did not matter that ‘root’
was in the middle of a longer string of characters, grep
just found any line containing ‘root’
and printed it. One side effect of our grep
example is that the RE was case-sensitive. This is
usually the case, although options can change this.
Meta Character Matching
Now let's start by looking at the five (5) symbols that provide
the foundation for RE in the UNIX world. They are shown in Table #1.
| . |
period |
| ^ |
carot |
| $ |
dollar
sign |
| {} |
curly
braces |
| [] |
square
brackets |
Table #1: Meta characters
These symbols are meta [1]-characters, or place holders in
templates.
That is, unless explicitly stated otherwise, each of the symbols
IS NOT used for it's traditional meaning. In the case of the period,
it is not a period, but something else.
What the period *is* used for is a symbol. A symbol to indicate a
single character — any character. What this means when you write a
period within an RE construct [2] is that the
application will interpret the period to mean that you are looking
for any character. In RE parlance, we would say “the period” ‘binds’
to a single character. The term ‘bind’ may seem a bit confusing,
but if we consider that regular expressions are used in multiple
ways with multiple applications; it's a bit clearer.
To wit, ^ and $
mean to ‘bind’ to the start a line and the end a line,
respectively.
Using the previous snippet from the file (/etc/passwd)
we write:
# grep ^root /etc/passwd
this would produce:
root:*:0:0:JR:/root:/bin/csh
What we wrote in the grep RE
is that we would like to scan the file (/etc/passwd/)
and print all the lines that begin with the word ‘root’.
By the same token, we could change ‘^root’
to ‘csh$’ and grep
would output all the lines that end with the word ‘csh’.
This feature is helpful in that many “text databases” use the
first word or last word as a unique key. By doing this, a line can
act as a record in a text database.
Databases of this type are often used in light-weight systems,
such as embedded systems. Sometimes “text databases” of this
type are referred to as flat files [3].
Next, {} (curly braces)
and [] (square brackets)
provide us with a way to describe a variable number of characters or
a variable set of characters, respectively.
|
a{1-3} |
which binds to one, two or three a's |
|
a{5} |
which binds to exactly five (5) a's |
|
a{1-} |
which binds to one or more a's |
|
[a-c] |
binds to one character - either a,
b,
or c |
|
[aeiou] |
binds to any vowel |
|
[a-z,A-Z] |
binds to the alphabet, in lower or upper case |
Table #2: Examples
Table #2 show some examples using some of these meta-characters.
You might be wondering just what to do if you needed one of these
special symbols to be part of what your RE. In that case, we use an
‘escape character’. In UNIX, the character traditionally used
for this is the back slash ‘\’.
Unfortunately the rules governing what happens when using a back
slash can be confusing and difficult to remember. So rather than go
through all the rules and variations, I will show some common
examples in Table #3.
|
\. |
we want a period, not the symbolic use |
|
\$ |
we want a dollar sign |
|
\[ |
we want the square bracket; we are NOT defining a set |
|
\\ |
we want a backslash character |
|
\\\\ |
we want two (2) backslash characters |
Table #3: Escaped Characters
Lastly, if you'd like to try any of these examples, I would
recommend using grep. Trying
them with anything else would require a delimiter that I won't
mention until the next article.
So that's it for now.
The next article will have more complex examples of RE.
Notes
[1] “Meta”, one
definition is ‘one level of description higher than the present’.
However, one common problem with the word ‘meta’ is that it is
often used to describe a thing. Meta is a concept, not a thing.
[2] “Construct”, as we use it, is a noun. In this case, it is
part of a larger thing, like a framework.
[3] “Flat files” are also called as ASCII text files.
By Reg. Charney
To all my local readers, I owe an apology. A number of factors
conspired to delay the publication of this month’s newsletter.
I was working on an article on heterogeneous collection classes
when that “last” test blew up in my face and I realized that all
the work was for not. I felt like a rat caught in a maze, no matter
which way I turned, there were more technical problems that could
not be overcome in the time left. So I did the only sane thing, I
gave up and decided that I would rest before trying again. The week’s
hiatus allowed me to refresh myself and discover several other
interesting problems. I hope to write about them in coming issues.
However, that does bring up a few good points. Dealing with
frustration means giving yourself a break—the world will go on.
Second, more articles by my readers would ease the load and might
even be more interesting ;-).
Oh, the article on heterogeneous collections will appear, but
first, most C++ compilers will need to deal with template functions
like the following.
| class
A { };
static int iG; // used later
template<class T> void f(T*)
{ iG = 1; }
template<class T>
struct B {
void g(T*)
{
&f<T>; // get fcn addr
}
};
int main()
{
B<A> b;
A a;
b.g(&a);
return iG;
} |
An ambiguity error is often reported taking the address of f<T>
in struct B. Do
you know why?
By Allan Kelly
The
C++ Programming Language, Special Edition
by Bjarne Stroustrup, Addison-Wesley, 2000, ISBN 0-201-70073-5
In calling this book the Special Edition the publishers
have brought Microsoft style version numbering to books. However,
calling this book the fourth edition would be misleading as
there are no substantial changes. Perhaps a better name would have
been third edition++.
Apart from multiple corrections the only noticeable changes to
the text are the addition of two appendices (both downloadable for
free from Stroustrup’s web site) dealing with locales and
exception safety, plus an enlarged index - given the poor quality of
the index in the second and third editions one hopes this will be an
improvement, time will tell.
The appendix on exception safety is a useful contribution to the
collective skills upgrade occurring in the C++ community. Only in
the last couple of years, since the standard was complete, has
exception safety been fully understood, hence the attention the
subject is being given in books, magazines and by conference
speakers.
The C++ Programming Language is not, and never has been, a
good book to learn C++ from. However, it is the definitive reference
to the language – notwithstanding the official ISO document which
is decidedly less readable. Although primarily a reference book part
IV develops Stroustrup’s own thinking on software development and
as such these chapters should be read in their entirety as they
explain the thinking behind the language.
If you already own the third edition it is hard to justify the
$60 for a minor upgrade. But no professional C++ programmer can
afford to be without either the third edition or this special
edition.
By Reg. Charney
Same Old, Same Old
This month’s Trends column is like many of those from recent
months. The job market continues to contract and the number of new
jobs continues to drop. Companies are still hiring, but are becoming
very picky. They are only hiring what they need now, based on their
more profitable product lines. The preemptive hiring of the past is
gone.

Figure #1 shows the continued decline in the last 12 months of
the job opening market. Again, the decline is not precipitous.

Figure #2 shows the continued strength of non-internet related
jobs. These are jobs that do not depend on Internet related
technologies like Java, HTML, XML, etc. This represents IT used for
core business purposes.

The only real surprise is shown in Figure #3. The demand for
those skilled in writing device drivers for Windows 2000 has grown
over the last couple of months. Windows 2000 presence in the market
place has finally made itself felt in the job market. Interesting
enough, device drivers is one of the first places that a new
hardware or software platform makes itself known.
Conclusions
The job market and the stock market are intertwined. There is a
feedback loop here. The trick is to see both at the same time.
Trends in the job market often foreshadows trends in the stock
market. However, changes in the stock market often affect the job
market. The Valley is a classic example of this exemplified. The
weak earnings and depressed stock market indicates that the job
market in almost all sectors will be tight and might even tighten
further.
|