Tech Talk

Sizing a character—Part 1

By Allan Kelly

How many characters can you fit into 1Kb? Sounds a bit like “how many angles can you fit on the head of a pin?” Well, it seems you could spend equally long arguing either question but the former is much more important for software engineers.

You’ve probably already guess that I’m going to talk about internationalization, but I don’t want to talk about the why and how of it. I just want to look at characters. This is a massive topic in its own right and all I can hope to do is give you a starting point.

Most programs make do with the 8-bit ASCII character set, indeed for many 7-bits is quite ample but it’s more convenient with work in 8-bit bytes. However, for internationalization 255 characters is simply not enough.

This is where Unicode becomes interesting. The Unicode standard defines names and numbers of thousands of characters and very conveniently, the first 128 characters map directly to the ASCII standard. Most Windows NT programmers are probably fairly familiar with Unicode and the wchar_t that specifies a 16-bit character.

However, this is simply one encoding of Unicode known as UTF-16 (UCS Transformation Format, where UCS is Universal multiple-octet coded Character Set, for those that want to know). Solaris support for Unicode is provided with a 32-bit wchar_t and the UTF-32 encoding. There is also an 8-bit encoding for Unicode known as UTF-8. Using the UTF-8 encoding scheme, each character uses between 1 and 6 bytes. Hence, such characters are often called multi-byte characters. The scheme uses a continuation principal whereby some characters are encoded in just one byte, while characters are represented by 2 to 6 bytes. The traditional ASCII characters 0-127 use the first 7 bits of the first byte, so if your are representing an English text, the ASCII and UTF-8 encodings are the same.

Any character assigned a place in the Unicode standard will be the same in UTF-8, UTF-16 and UTF-32 even if the bit pattern looks different. Of course, if you encode a message using UTF-8 and then try to interpret it as UTF-16 you are not going to see the original message.

Characters encoded as ASCII or UTF-8 are said to be narrow – in C/C++ they are referenced with the char data type. Conversely, a wide character (UTF-16 or UTF-32) uses more than one byte but all characters are the same size – for these C/C++ uses wchar_t data type.

So to answer the original question depends on which characters you are encoding: wchar_t on NT gives 512 characters per Kb while on Solaris you have just 256; however, multi-byte encoding on any platform depends on the text.

Now this is not the end of the matter: not all multi-byte character sets are Unicode UTF-8.

Microsoft traditionally supports multi-byte character which are not UTF-8. The two schemes are similar but the Microsoft one is actually called Double Byte Character Set (although documentation usually refers to it as multi-byte) and it represents characters used with either one or two bytes. This scheme is closely linked to code pages and unfortunately, this means that the same numeric sequence, could map to different strings using different code pages.

Now I said traditionally because this scheme is employed for the MSDOS, 16-bit Windows and Windows 9x families where there is no Unicode support – not even UTF-8. NT-based systems, including XP, use UTF-16 and support UTF-8 and double byte character set. Because UTF-8 is used on the web some support has been added to Internet Explorer.

Around the time that the Unicode Consortium was defining their character set the International Standards Organisation (ISO) set about the same task. Luckily the two joined forces and ISO UCS (Universal Character Set – ISO standard 10646) is the Unicode standard by another name. However, name differences persist so UCS-2 is UTF-16 and UCS-4 is UTF-32.

That is a very very brief introduction to the subject of character sets, next month I’ll continue the subject.

References:

Editorial

By Reg. Charney

The infection known as WindowsXP

Microsoft has now launched its latest version of its Windows. From a marketing view, it has landed with a thud. It hasn’t generated any real excitement. By contrast, all technical reports say it is pretty good. Based on my experience with its predecessor, Windows 2000, WindowsXP has reached an acceptable level of reliability for most purposes.

However, I won’t let it on any of my machines. Because of its licensing and activation requirements, I can only treat it as a virus infection. It is a big piece of software that will self-destruct in 30 days if you don’t get permission from Big Brother to use it — even though you have paid through the nose for the privilege of letting Microsoft control your machine and software for the foreseeable future. Realize that if you don’t give up your right to privacy, all your data, third party software, and time invested in setting up the system will be lost. Tell me how that is different from the most destructive of the computer viruses out there?

Besides the activation requirement, there is also the pernicious Passport facility that is administered by Microsoft. It is meant to be a convenience — once you store your name, address, contact information, bank account details, credit card number and expiry dates, medical data, merchant information, etc. on Microsoft’s central system, you can transact business on the net very easily. All it costs is your identity and privacy. Keep in mind that even if Microsoft does not use or sell your name, your profile boxes you in. For example, given your bank account activity and home address, mortgage companies may determine that you are a bad risk. Given your collection of medical contacts, you may be a high risk for medical insurance. If you believe this is fantasy, you should know that this is how market research firms characterize market segments and how banks used to do “red lining” that limited mortgages based on where you lived.

Book Review

By Allan Kelly

Two books I’ve been dipping into

I’ve recently been dipping into two books of essays, one is a well know modern classic: Constantine On Peopleware, the other is a new book of old essays, Software Fundamentals : Collected Papers of David Parnas.

Anyone seriously involved with software development should make time and space for each. Both are easy to dip in when you have 10 (Constantine) or 20 (Parnas) minutes to spare.

Also in common is that both authors draw from experience outside the software industry. This reminds us that we in the software industry are not living in a bubble: creativity, team work, project management, engineering are all subjects which have a long history and from which we can draw.

When we speak of using engineering principals in our software, what are these principals? I can’t buy a book on engineering, the field is too big, there are books on mechanical engineering, process engineering, chemical engineering, marine engineering and many more. What are the principals we are dealing with in software engineering?

Parnas’ essays come from nearly 30 years of work. Some introduce concepts now taken for granted but in 1970 where radical! It is also worth noting that little changes: On the design and development of program families was written in 1976, but 25 years later we are still struggling with doing this.

Neither author ignores the wider world. Constantine uses his experience as a family therapist to deliver insights on team dynamics. Parnas’ SDI is still relevant today as 25 years ago.

References:

Trends

By Ali Çehreli

Back to decline

Last month was positively different; this month is not. The drop in the job market has been slowing down for a number of months as a possible indication of a comeback. Last month's figures were encouraging because some of the jobs we monitor had actually shown improving numbers. Unfortunately the drop is back. This is more in synch with the now pronounced recession that the US economy is in.

There is not much change in the trends: Windows 2000 and Linux jobs in the platforms and ASIC jobs in the technology continue to be the most trendy. We have started to monitor Windows XP jobs as well. But the figures are still too small. I will include Windows XP jobs next month with three data points.

The bigger drop in the ASIC jobs compared to the other technologies is notable (Figure 1). Evidentially, the small ASIC company I work for has been through a lay-off just two weeks ago. Similar to the ASIC jobs, Windows 2000 device driver jobs has experienced a bigger drop (Figure 2).

Everyone believes that the economy will recover but no one knows when. In this environment, many companies switch to a survival mode where they first reduce headcounts to cut expenses. Some engineers are lucky to find jobs at consultancies, but only if they bring their own businesses. Temporary foreign workers are the worst effected. They need to wait for the permanent residents to fill the already scarce positions. When to make the call and return to their home countries is a tough decision they have to make.