Tech Talk

The Basics of RE (Regular Expressions) — Part 2

By Jesus Monroy, Jr.

Quick Review

In the last article we examined some simple examples of how to use RE. We will quickly review part one.

the

bind to the single word ‘the’.

a

bind to the single letter ‘a’.

aaa

bind to three ‘a’ characters in a row.

.

bind to any single character.

^

bind to the beginning of line

$

bind to the end of the line

a{1-3}

bind to ‘a’, ‘aa’ or ‘aaa’

[a-z]

bind to any lower case letter

[^np]

bind to any lower case letter, except ‘n’ and ‘p’

\.

bind to the period ‘.’ character

\\

bind to backslash ‘\’ character

Table #1: Meta-characters

One quick note, most programs expect an RE to be on a single line. That is, if looking for three ‘a’ (a{3}) in a row, most programs expect to find

this is an example of aaa

not

this is a broken example
of aa
a

The later example can be constructed as a RE – but not by all programs, and only when explicitly told to do so. We won't cover that here.

Return Value

One thing we have not discussed up until now is if we could detect whether a match occurred. In some cases, we might want to know if the RE did not bind. Specifically, Perl, Awk, Lex and Yacc are languages which can take action base on the returned value of a RE. In the next section, we will see examples of how this works.

The Logic of More Complex REs

Some of the meta-characters we did not considered in the first article include:

?

question mark

Zero or one

*

star (asterisk)

Zero, one or more

+

plus sign

One or more

Table #2: Meta-modifiers

These meta-characters are modifiers on other meta-characters. If we use the ‘.’ as an example, then we might better see how they are used. To refresh our memory, the period ‘.’ is a meta-character that binds to any character.

So, if we use the modifiers in conjunction with the period we get:

.?

bind to zero or one characters

.*

bind to zero, one or more characters

.+

bind to one or more characters

Example #1: Meta-modifiers

In the first example, the ‘.?’, usually phrased as 'zero or one' occurrence, will always succeed, or return a value of true. Similarly so, ‘.*’ will always return true. However, when we use ‘.+’, this means that we must bind to with at least one (1) character or the value return is false. As a side note, a single period without a modifier ‘.’ will return true if once it finds one character. If it finds no characters, it returns false.

Another noteworthy item is the greediness of the star (*) character. When the star is used in conjunction with the ‘.’ symbol, it grabs all the characters. This might matter when trying to parse data fields, like CSV (comma separated values). In just a bit we will see how this matters, and one way to handle this.

Character Set Exclusion

One special case worth noting is the meta-characters that defines a character exclusion set. It is written like this:

[^n] bind to any lower case alphabetic character, except ‘n’

This useful when we want to grab a string encased in a pair of delimiters. Here are some common examples:

,[^,],

grab comma delimited

"[^"]*"

grab inside double quotes

'[^']+'

grab inside single quotes

^[^\s]*

grab string without spaces

Example #2: Meta-modifiers

In the first example, grab anything in between commas. In the second example, we grab anything inside a pair of double quotes. Both are commonly used for CSV (comma separated values) formats. The third example is similar to the second, except we use single quotes and it only returns true if there is something in between the quotes. The fourth example grabs the first word on the beginning of a line.

I should now continue with the point mentioned earlier about the star ‘*’ being greedy when it binds. If you'll note, in the last set of examples I wrote

"[^"]*"

not

".*"

The reason is, in the later case, the star (*) becomes greedy and grabs everything after the first single quote ("), even the end-of-line character. To avoid this, we wrote ("[^"]*").

To explain this a bit further, let's look at the character set exclusion rule, [^"]. This tells us to we want to exclude the double quote (") character from our set. Implicitly, this says “include everything else”. As such, if we include the portion with the star ([^"]*), the RE will bind to all characters, except the double quote ("). It will then continue until it finds the next double quote or the end-of-line. If it finds the end-of-line, it will fail (or return false).

If we then include a look at the original RE, we see that it will bind when it finds the first double quote ("), then stop when it finds the next double quote ("). Also, unless it finds both double quotes, the RE will return false.

Lastly, with regards to using quoting characters, some programs ‘interpret’ the meaning of characters within the quotes. For example, when using Lex, all meta-characters lose their meta meaning when inside double quotes, except for “C escape sequences”.

Grouping

One other special case worth mentioning are the grouping symbols. Their meta-characters are:

|

pipe character

()

round braces

They can be used by themselves, but are often used in combination to form more complex statements. Here is an example with the pipe character ‘|’.

cow|pig|sheep

In this example, the RE will bind with any of the three words. That is, this RE example will return true if either cow, pig or sheep is found on the line.

As for the round braces, we will give the fairly complex example:

([0-9]+)|([0-9]*\.[0-9]+)

We can see the RE breaks into 2 parts. On the left is [0-9]+. On the right is [0-9]*\.[0-9]+

The left binds to one or more numerical digits. The right binds to a decimal value that may or may not start with a numerical digit and trails with one or more numerical digits. For example:

.8
25.273684
0678.900
0,0

Left to Right Precedent Ordering

Lastly, there is precedent ordering. With the exception of *grouping*, RE does not have precedent ordering; it works from left to right. That is, if you build a slightly more complex RE, the binding starts from the left and works to the right. Here is a quick example:

A (quick)? fox jumped
(up|down)\.

Some legal results are:

A fox jumped up.
A fox jumped down.
A quick fox jumped up.
A quick fox jumped down.

That's it. I hope you found this short tutorial and examples useful. In the next article, I hope to show RE in action with Lex, the lexical analyzer.

Editorial

By Reg. Charney

Linux Development Directions

Just read a terrific article on the latest Linux kernel hackers summit by Chris DiBona. It was fun to read and gave a good flavor of where Linux development is heading. You can find the full article at http://noframes.linuxjournal.com/articles/buzz/0048.html. Areas discussed at the conference included:

This is a conference of the best and brightest of the Open Source software movement. Read Chris’s article.

C++ Object-Oriented Preprocessor

Through sheer good fortune, I came across an article in the IEEE Proc. on Software, Vol. 147, No 2, April 2000. In it, Willink and Muchnick describe an Object-Oriented Preprocessor fit for C++. I plan to do a synopsis of their proposal in a future issue, but if you can’t wait, their paper is available at, www.computing.surrey.ac.uk/research/dsrg/fog/FogIee.pdf. That paper summarizes the much more substantial work in www.computing.surrey.ac.uk/research/dsrg/fog/v/FogThesis.pdf.

Inline Functions

By Allan Kelly

C++ inline functions may optimise code: inline function saves you the overhead of a function call, that is: loading parameters onto the stack, making a jump, popping parameters off the stack, and then do it all in reverse when you return.

Modern CPUs have pipelines that a function call can disrupt when called, and again, when it returns. By inlining we potentially allow the CPU to further improve execution.

Not only this, but a compiler may be able to optimise code which it has been inlined in a way it cannot if there is a function call simply because the optimisation becomes visible, e.g. optimising register allocations. In the case of empty functions – fairly typical of constructors – making the function inline may allow the compiler to remove the call altogether.

Sometimes an inline function may save time and space. Making a function call requires instruction codes to make the call both on the part of the caller and callee. On occasions, the number of bytes required for this may actually be greater than the number of bytes in the function body itself.

With these kinds of optimisations available, we might wonder why we don’t just make all functions inline. In my opinion there are downsides that out weigh the advantages.

For starters, inline is only a hint to the compiler. The compile can ignore it if the function is too big or too complex. Compilers differ in what they will and will not inline.

Other reasons I avoid using inline are:

Inlines may improve performance. However, they will always make your code less flexible.

Trends

By Reg. Charney

An Aside

I collect my statistics on a monthly basis. I have done that for May 2001, but the news is once again, unexciting, to say the least.. The general trend is downwards, but there are no real surprises and I don’t want to make news out of nothing. Thus, I looked for facts that I thought you might be interested in. Hence the accompanying table of statistics from the SourceForge home page (www.sourceforge.org).

Some of you may know about the SourceForge, but many of you may not. The SourceForge is a sophisticated means of configuration management and publication. It offers CVS, mailing lists, bug tracking, message boards/forums, task management, site hosting, permanent file archival, full backups, and total web-based administration.

It is hosted by VA Linux and made available for free to anyone with an Open Source project. You can even get a copy for your own firm’s internal use.

I chose to write about it this month because of the number of ongoing Open Source projects that may be of interest to you. Also I wanted to show the strength of the Open Source movement. On this one site alone, there are almost 20,000 projects and over 160,000 users. There are also thousands of games available, many ported from other operating systems.

Since its free, go visit and maybe you will be interested enough to want to actively participate.

SourceForge Statistics

Hosted Projects: 19,709 Registered Users: 160,177

Top Project Downloads

(7,619) The Freenet Project
(6,909) AbiWord
(6,348) Open Source Napster Server
(5,694) MiKTeX
(5,339) Back Orifice 2000
(4,744) Tux Racer
(4,541) WinPenguins
(3,238) Audacity
(3,177) VirtualDub
(2,860) jEdit

Highest Ranked Users

1 - (5.0041) Tim Perdue
2 - (4.8605) James Byers
3 - (4.7353) Luke Ehresman
4 - (4.2937) Uriah Welcome
5 - (4.1639) Nathan Ehresman
6 - (3.9460) Brandon Wiley
7 - (3.5279) Peter Hutnick
8 - (3.5092) Paul Joseph Thompson
9 - (3.4479) Eric Warmenhoven
10 - (3.3777) Theodore Ts'o

Most Active This Week

(100.0000% ) SourceForge
( 99.9793% ) MiKTeX
( 99.9585% ) phpAdsNew
( 99.9378% ) phpGroupWare
( 99.9170% ) Gaim
( 99.8963% ) Gnucleus
( 99.8756% ) Miranda ICQ Client
( 99.8548% ) EtherApe
( 99.8341% ) Python
( 99.8134% ) DocBook Open Repository
( 99.7926% ) SquirrelMail
( 99.7719% ) HSQL Database Engine
( 99.7511% ) Firewall Builder
( 99.7304% ) FreeCraft real-time strategy game engine
( 99.7097% ) RadeonTweaker
( 99.6889% ) jboss.org
( 99.6682% ) Crystal Space 3D Engine
( 99.6474% ) jEdit
( 99.6267% ) net-snmp
( 99.6060% ) Enlightenment