|
Tech Talk
The Basics of RE
Editorial
Linux Development
Directions
C++ OO
Preprocessor
Inline Functions Trends
An Aside
SourceForge Statistics
|
By Jesus Monroy, Jr.
Quick Review
In the last article we examined some simple examples of how to
use RE. We will quickly review part one.
|
the |
bind to the single word ‘the’. |
|
a |
bind to the single letter ‘a’. |
|
aaa |
bind to three ‘a’ characters in a row. |
|
. |
bind to any single character. |
|
^ |
bind to the beginning of line |
|
$ |
bind to the end of the line |
|
a{1-3} |
bind to ‘a’, ‘aa’ or ‘aaa’ |
|
[a-z] |
bind to any lower case letter |
|
[^np] |
bind to any lower case letter, except ‘n’ and ‘p’ |
|
\. |
bind to the period ‘.’ character |
|
\\ |
bind to backslash ‘\’ character |
Table #1: Meta-characters
One quick note, most programs expect an RE to be on a single
line. That is, if looking for three ‘a’ (a{3}) in a row, most
programs expect to find
this is an example of aaa
not
this is a broken example
of aa
a
The later example can be constructed as a RE – but not by all
programs, and only when explicitly told to do so. We won't cover
that here.
Return Value
One thing we have not discussed up until now is if we could
detect whether a match occurred. In some cases, we might want to
know if the RE did not bind. Specifically, Perl, Awk, Lex and Yacc
are languages which can take action base on the returned value of a
RE. In the next section, we will see examples of how this works.
The Logic of More Complex REs
Some of the meta-characters we did not considered in the first
article include:
|
? |
question mark |
Zero or one |
|
* |
star (asterisk) |
Zero, one or more |
|
+ |
plus sign |
One or more |
Table #2: Meta-modifiers
These meta-characters are modifiers on other meta-characters. If
we use the ‘.’ as an example, then we might better see how they
are used. To refresh our memory, the period ‘.’ is a
meta-character that binds to any character.
So, if we use the modifiers in conjunction with the period we
get:
|
.? |
bind to zero or one characters |
|
.* |
bind to zero, one or more characters |
|
.+ |
bind to one or more characters |
Example #1: Meta-modifiers
In the first example, the ‘.?’, usually phrased as 'zero or
one' occurrence, will always succeed, or return a value of true.
Similarly so, ‘.*’ will always return true. However, when we use
‘.+’, this means that we must bind to with at least one (1)
character or the value return is false. As a side note, a single
period without a modifier ‘.’ will return true if once it finds
one character. If it finds no characters, it returns false.
Another noteworthy item is the greediness of the star (*)
character. When the star is used in conjunction with the ‘.’
symbol, it grabs all the characters. This might matter when trying
to parse data fields, like CSV (comma separated values). In just a
bit we will see how this matters, and one way to handle this.
Character Set Exclusion
One special case worth noting is the meta-characters that defines
a character exclusion set. It is written like this:
[^n]
bind to any lower case alphabetic character, except ‘n’
This useful when we want to grab a string encased in a pair of
delimiters. Here are some common examples:
|
,[^,], |
grab comma delimited |
|
"[^"]*" |
grab inside double quotes |
|
'[^']+' |
grab inside single quotes |
|
^[^\s]* |
grab string without spaces |
Example #2: Meta-modifiers
In the first example, grab anything in between commas. In the
second example, we grab anything inside a pair of double quotes.
Both are commonly used for CSV (comma separated values) formats. The
third example is similar to the second, except we use single quotes
and it only returns true if there is something in between the
quotes. The fourth example grabs the first word on the beginning of
a line.
I should now continue with the point mentioned earlier about the
star ‘*’ being greedy when it binds. If you'll note, in the last
set of examples I wrote
"[^"]*"
not
".*"
The reason is, in the later case, the star (*) becomes greedy and
grabs everything after the first single quote ("), even the
end-of-line character. To avoid this, we wrote
("[^"]*").
To explain this a bit further, let's look at the character set
exclusion rule, [^"]. This tells us to we want to exclude the
double quote (") character from our set. Implicitly, this says
“include everything else”. As such, if we include the portion
with the star ([^"]*), the RE will bind to all characters,
except the double quote ("). It will then continue until it
finds the next double quote or the end-of-line. If it finds the
end-of-line, it will fail (or return false).
If we then include a look at the original RE, we see that it will
bind when it finds the first double quote ("), then stop when
it finds the next double quote ("). Also, unless it finds both
double quotes, the RE will return false.
Lastly, with regards to using quoting characters, some programs
‘interpret’ the meaning of characters within the quotes. For
example, when using Lex, all meta-characters lose their meta meaning
when inside double quotes, except for “C escape sequences”.
Grouping
One other special case worth mentioning are the grouping symbols.
Their meta-characters are:
|
| |
pipe character |
|
() |
round braces |
They can be used by themselves, but are often used in combination
to form more complex statements. Here is an example with the pipe
character ‘|’.
cow|pig|sheep
In this example, the RE will bind with any of the three words.
That is, this RE example will return true if either cow, pig or
sheep is found on the line.
As for the round braces, we will give the fairly complex example:
([0-9]+)|([0-9]*\.[0-9]+)
We can see the RE breaks into 2 parts. On the left is [0-9]+. On
the right is [0-9]*\.[0-9]+
The left binds to one or more numerical digits. The right binds
to a decimal value that may or may not start with a numerical digit
and trails with one or more numerical digits. For example:
.8
25.273684
0678.900
0,0
Left to Right Precedent Ordering
Lastly, there is precedent ordering. With the exception of
*grouping*, RE does not have precedent ordering; it works from left
to right. That is, if you build a slightly more complex RE, the
binding starts from the left and works to the right. Here is a quick
example:
A (quick)? fox jumped
(up|down)\.
Some legal results are:
A fox jumped up.
A fox jumped down.
A quick fox jumped up.
A quick fox jumped down.
That's it. I hope you found this short tutorial and examples
useful. In the next article, I hope to show RE in action with Lex,
the lexical analyzer.
By Reg. Charney
Just read a terrific article on the latest Linux kernel hackers
summit by Chris DiBona. It was fun to read and gave a good flavor of
where Linux development is heading. You can find the full article at
http://noframes.linuxjournal.com/articles/buzz/0048.html. Areas
discussed at the conference included:
- Requirements for a High Performance Database.
- SCTP (Streaming Control Transport Protocol) Linux Kernel
Implementation.
- Advanced Filesystem Integration (XFS).
- Hot Plug
- Linux 2.6 Memory Management
- Security Enhanced Linux
- Asynchronous I/O
- Linux and Power Management
- Larry McVoy on BitKeeper
This is a conference of the best and brightest of the Open Source
software movement. Read Chris’s article.
Through sheer good fortune, I came across an article in the IEEE
Proc. on Software, Vol. 147, No 2, April 2000. In it, Willink and
Muchnick describe an Object-Oriented Preprocessor fit for C++. I
plan to do a synopsis of their proposal in a future issue, but if
you can’t wait, their paper is available at, www.computing.surrey.ac.uk/research/dsrg/fog/FogIee.pdf.
That paper summarizes the much more substantial work in www.computing.surrey.ac.uk/research/dsrg/fog/v/FogThesis.pdf.
By Allan Kelly
C++ inline functions may optimise code: inline function saves you
the overhead of a function call, that is: loading parameters onto
the stack, making a jump, popping parameters off the stack, and then
do it all in reverse when you return.
Modern CPUs have pipelines that a function call can disrupt when
called, and again, when it returns. By inlining we potentially allow
the CPU to further improve execution.
Not only this, but a compiler may be able to optimise code which
it has been inlined in a way it cannot if there is a function call
simply because the optimisation becomes visible, e.g. optimising
register allocations. In the case of empty functions – fairly
typical of constructors – making the function inline may allow the
compiler to remove the call altogether.
Sometimes an inline function may save time and space. Making a
function call requires instruction codes to make the call both on
the part of the caller and callee. On occasions, the number of bytes
required for this may actually be greater than the number of bytes
in the function body itself.
With these kinds of optimisations available, we might wonder why
we don’t just make all functions inline. In my opinion there are
downsides that out weigh the advantages.
For starters, inline is only a hint to the compiler. The compile
can ignore it if the function is too big or too complex. Compilers
differ in what they will and will not inline.
Other reasons I avoid using inline are:
- Code bloat: inline functions usually make your code bigger
because you duplicate code for each function invocation.
- Increased CPU cache misses: if you increase a function’s
size with inline functions, it may outgrow the cache, and fewer
big functions may be held in cache at once.
- Coupling between headers and implementation increases. A
header with interface and implementation is more likely to
change than one with interface alone. Changes in a combined
header file will cause a bigger rebuild because other files
depend on your header files. By contrast, if the code is in the
implementation file, only that file needs to be recompiled. On
large systems the difference can be hours.
- By giving the implementation in the header file you are less
likely to be able to forward declare classes, so you have to
include another header file – increasing coupling and making
the code less flexible and more prone to ripple effects –
again increasing build times.
- Perhaps more significant is what a lot of inline functions
says about your design. Inlining is typically used for get and
set functions. Classes with lots of such functions aren’t
really hiding anything, they are just containers for values that
begs the question – just how object-oriented is your design?
- Finding code in source files can be time consuming. If you
have lots of inline functions you need to look in .h files as
well. This may seem trivial, but on a large system with several
thousand files the time adds up.
Inlines may improve performance. However, they will always make
your code less flexible.
By Reg. Charney
An Aside
I collect my statistics on a monthly basis. I have done that for
May 2001, but the news is once again, unexciting, to say the least..
The general trend is downwards, but there are no real surprises and
I don’t want to make news out of nothing. Thus, I looked for facts
that I thought you might be interested in. Hence the accompanying
table of statistics from the SourceForge home page (www.sourceforge.org).
Some of you may know about the SourceForge, but many of you may
not. The SourceForge is a sophisticated means of configuration
management and publication. It offers CVS, mailing lists, bug
tracking, message boards/forums, task management, site hosting,
permanent file archival, full backups, and total web-based
administration.
It is hosted by VA Linux and made available for free to anyone
with an Open Source project. You can even get a copy for your own
firm’s internal use.
I chose to write about it this month because of the number of
ongoing Open Source projects that may be of interest to you. Also I
wanted to show the strength of the Open Source movement. On this one
site alone, there are almost 20,000 projects and over 160,000 users.
There are also thousands of games available, many ported from other
operating systems.
Since its free, go visit and maybe you will be interested enough
to want to actively participate.
SourceForge Statistics
Hosted Projects: 19,709 Registered Users: 160,177
| Top
Project Downloads
(7,619) The Freenet Project
(6,909) AbiWord
(6,348) Open Source Napster Server
(5,694) MiKTeX
(5,339) Back Orifice 2000
(4,744) Tux Racer
(4,541) WinPenguins
(3,238) Audacity
(3,177) VirtualDub
(2,860) jEdit
Highest Ranked Users
1 - (5.0041) Tim Perdue
2 - (4.8605) James Byers
3 - (4.7353) Luke Ehresman
4 - (4.2937) Uriah Welcome
5 - (4.1639) Nathan Ehresman
6 - (3.9460) Brandon Wiley
7 - (3.5279) Peter Hutnick
8 - (3.5092) Paul Joseph Thompson
9 - (3.4479) Eric Warmenhoven
10 - (3.3777) Theodore Ts'o |
Most
Active This Week
(100.0000% ) SourceForge
( 99.9793% ) MiKTeX
( 99.9585% ) phpAdsNew
( 99.9378% ) phpGroupWare
( 99.9170% ) Gaim
( 99.8963% ) Gnucleus
( 99.8756% ) Miranda ICQ Client
( 99.8548% ) EtherApe
( 99.8341% ) Python
( 99.8134% ) DocBook Open Repository
( 99.7926% ) SquirrelMail
( 99.7719% ) HSQL Database Engine
( 99.7511% ) Firewall Builder
( 99.7304% ) FreeCraft real-time strategy game engine
( 99.7097% ) RadeonTweaker
( 99.6889% ) jboss.org
( 99.6682% ) Crystal Space 3D Engine
( 99.6474% ) jEdit
( 99.6267% ) net-snmp
( 99.6060% ) Enlightenment |
|