Tech Talk

Sizing a character—Part 2

By Allan Kelly

Last month I introduced character encodings, Unicode, UTF-x and UCS-y (UCS Transformation Format-16, where UCS stands for Universal multiple-octet coded Character Set). This month I want to pick up from where I’ve left off talking about character and platforms.

Although UTF-8 encoding may be more compact it can also be more difficult to work with because the representation of a given character can vary in the number of bytes needed. When data is compressed the size difference will probably disappear, as all text data is heavily redundant. Indeed, a typical wide string in English on a Solaris-Sparc machine (UTF-32) will have three bytes of zero for every one non-zero byte.

However, wide character schemes suffer from endian issues. On a little endian Intel machine running NT a wide letter ‘A’ will be encoded as 0x4100, while on a big endian Sparc Solaris machine the same letter ‘A’ is encoded as 0x00000041. Technically the encoding used by NT is UTF-16LE while the Solaris encoding is UTF-32BE. For an NT application this may not be a problem as NT is Intel only these days, and while you may decide to ignore Solaris on Intel, remember that some Linux boxes will be big-endian and some little-endian.

This particularly becomes a problem when you want to exchange data between machines – you must ensure that the same encoding and endianness is used between all machines.

There is another big difference in how different machines handle different characters and this comes from the API a machine uses. NT programmers can use the Win32 API in either narrow or wide mode, and the filing system supports Unicode filenames.

However, on Solaris and other Unices, the API is more limited and only supports narrow characters. Actually, even in the NT world you may be limited to narrow characters because there are really two APIs: the Win32 API which has wide support, and the C API which, like Unix, demands narrow characters.

In fact, Microsoft extends the C API with a host of _underscored and prefixed functions - there are no fewer than four versions of strlen (strlen, wcslen, _mbslen and _mbstrlen). Needless to say, if you are looking for portable code you have to make some decisions as to what you are going to support.

Luckily, the C++ standard does bring some sense to this situation but at the cost of increasing the amount you have to learn. The first thing you notice is that most string and character handling code actually takes the character type as a template parameter.

The next thing to notice is the locale mechanism whereby the program knows what language to speak, how to format dates, currencies and so on.

Locale has come to mean different things in different environments: C provides a locale API tied to the current process, so a Solaris process can choose to change its locale but the whole process has just one locale. NT extends this to individual threads which can change their locale, so within one process you can have multiple locales.

In C++ locale is not an OS feature but a language feature. You can create locale objects and use them to manipulate characters, in effect, a single thread can have multiple locales!

I said at the start last month’s piece that this was intended to be an introduction to the topic. I don’t claim to have all the answer, nor indeed the answers for your project. However there are a few snippets of advice I can offer:

References:

Editorial

By Reg. Charney

Plans for the Future

I almost started out by saying what we were not going to do in this new year. That is, we were not going to whine, not going to lament the job situation, not belabor Microsoft’s continued crushing of opportunity and innovation, or bemoan other perceived ills. We have done with that— 2001 was a bad year by most measurements. We’re putting all that behind us.

I also realized that the ACCU and the local Silicon Valley chapter had a lot planned for 2002. In fact, we have more planned for this year than we did for 2001.

First, the ACCU is going to become more organized. We plan to have a small conference here in the Valley, place and time yet to be determined. Also, we should be in a position to offer a series of quick one-day weekend courses on various subjects in which many of our members are expert.

Second, we also plan to have some great speakers. As mentioned on the front page, Bjarne Stroustrup, the inventor of C++, will be speaking to us on February 12th. We also expect other significant speakers later in the year.

I am also pleased that more of our members are becoming involved in running the chapter and in contributing to this newsletter. In terms of the newsletter, we are also going to seek advertisers more actively. I believe that we are the only newsletter of this kind in the Valley. In point of fact, I also believe that we are also one of the oldest newsletters in the Valley. We are now entering our third year of publication.

Databases and Tools

I have been looking at open source databases and tools, like report writers, forms designers, and SQL generation. While I will report more fully in a later issue, a GUI front end for PostgreSQL called PgAccess, (http://ns.flex.ro/pgaccess) has really impressed me. I mention it now because a few of people have asked me about such a tool.

Book Reviews

Software Craftsmanship by Pete McBreen, Addison-Wesley, ISBN 0-201-73386-2

I like concise books. I read this one over a weekend and a few days of commuting on the 101 express bus to Palo Alto. McBreen comments on the ofttimes ineffective software development process. He explores the term “software engineering” and its common practice: the waterfall model/cycle, the resulting team/corporate structure and its shortcomings—leading to expensive software, sometimes buggy and late. The author gives ample references, both to classic works and online material.

According to McBreen, “software engineering” has tried to apply the lessons from the industrialization of physical production to software development. Labor is divided between groups of people, and after the analysts and designers have figured out how to structure the solution, hordes of (often) average coders implement the resulting specifications. Often this solution ends up being legacy software because the maintainers don’t have the big picture and resist change for fear of breaking something. The problem is that the engineering process was developed almost 30 years ago to solve large-scale multi-year projects and the world has changed since then. A lot of development is now done using small teams and short product cycles. McBreen proposes that we start viewing the development process differently. In his view, software and the people who develop it are capital. Developing software is as much a social process and a learning process as it is a technical process. It is a craft (science and art combined). To get zero-defect, useful, timely applications, we should use small teams of software craftsmen, journeymen and apprentices. This analogy comes from the traditional world of craftsmanship (blacksmiths in particular), where craftsmen want to do quality work, stand behind it, and be recognized for it. They stake their reputation on their work and as such focus on quality and timeliness. Software craftsmen have learned the intricacies of software development, including analysis and design, and take on journeymen, who participate and learn from the master. Journeymen in turn take on apprentices to train as their successors. Once a small team of such masters, journeymen and apprentices is built and has delivered an application, it stays together to keep the application alive and to make sure it evolves and continues to be valuable. The team spreads its knowledge about the whole system to every member, minimizing reliance on a single team member. Such a team will consistently produce great software applications because each member strives to improve skill and reputation.

McBreen uses eXtreme Programming and Open Source projects, to support his view. His mission is to return the focus to the people who develop software and to put formal processes where they belong. He includes tips on how to pursue software craftsmanship in a company, but not nearly enough. Another book on the subject would be most welcome, Pete! The book is a true pleasure to read. Developers will be left longing to work on a team of craftsmen. Managers will gain insight into how to build great teams of developers. My hope is that this book will start a new wave of approaching software development so that we can put the fun back where it belongs—into our everyday jobs.

—Oluf Nissen

The Unified Modeling Language User Guide by Booch et al, Addison Wesley, ISBN 0-201-57168-4.

I give this book on UML a pass. It is not as crisp as Fowler and Scott's UML Distilled, nor as witty as Booch's earlier Object-Oriented Analysis and Design. I found it difficult to look up concepts and to follow the numerous cross references. It took a lot of time to read, even to look up short subjects. It would best serve an intermediate reader, and it covers a great deal of territory.

I started reading the book with two objectives. 1) Find out what the dashed lines mean in an object diagram, and 2), find out how to present software architecture in a top-down, general-to-particular manner. The first objective was attained when I found that a dashed line is a “dependency” (p160), that a dependency means many things, including “creation”, “trace”, “refinement” and “bind” (p.61), and that it can be understood as the old Booch “using” relationship (pp. 53, 137). One object reaches to another object with a dashed line and “uses” it.

The second objective was harder, since I was looking for a decomposition into systems and subsystems, and this does not appear in the book until a great deal of work is done. Chapter 12 (Packages), and 31 (Systems and Models) give much of the answer. I didn’t find it very intuitive to put subsystem decomposition so late in the modeling process, so I changed things around and decided to model “systems” with object diagrams, using “systems” rather than classes. I could then model system interactions with sequence diagrams of “systems”. This worked much better.

At this point I became aware of a third objective: finding out how to move from UML diagram to code. This is no small matter. Perhaps it is because code is essentially procedural, as the central processor runs one instruction at a time. I have not come to a satisfactory solution for this problem. However, it is possible to translate sequence diagrams easily into pseudo-code. One can write useful pseudo-code at the system level and the detail level. A little coordination between diagram and pseudo-code, and one can have a satisfactory, tenth view (p. 24) into the software design.

I have been critical so far, but I stand in awe of Booch, Rumbaugh, and Jacobson. I will point out several of the many fine sections in the book: the discussion of components in chapter 25; the exception hierarchy on p. 285; the explanation that use-case scenarios drive the UML model ( p. 33), and the supporting Figure 2-20 on p. 31; the exposition of activity diagrams in Chapter 19, which update the old flowchart methodology; etc. etc. There is much to use.

—Daniel Bonbright

Trends

By Ali Çehreli

No Windows XP yet

Last month I promised to include Windows XP jobs in this month's charts. But the numbers are still too small: Only 6 jobs in October, and 3 jobs in both November and December have been posted. It looks like we'll have to wait some more time for XP numbers to be distinguishable among other Windows platforms.

Nothing has changed since last month. Once again: Linux and Windows 2000 among the platforms, and ASIC among the technologies have been the most trendy. All three are becoming less trendy though.

The only promising aspect of this month's data is the drop in the drop (Figures 1 and 2). Both figures indicate that we are at least at a local minimum.

As a proud Silicon Valleyite, I always like to talk about my first hand experiences and some local hearsay. Last month, I wrote about the company I worked for laying off employees, most of them H1-B holders. The good news is that all but two of them found jobs in a few weeks. Similarly, all of the laid-off employees of an optical networking startup found jobs in a very short time. Some people say that we are in a circulation period.