|
By Allan Kelly
How many characters can you fit into 1Kb? Sounds a bit like “how
many angles can you fit on the head of a pin?” Well, it seems you
could spend equally long arguing either question but the former is
much more important for software engineers.
You’ve probably already guess that I’m going to talk about internationalization,
but I don’t want to talk about the why and how of it. I just want
to look at characters. This is a massive topic in its own right and
all I can hope to do is give you a starting point.
Most programs make do with the 8-bit ASCII character set, indeed
for many 7-bits is quite ample but it’s more convenient with work
in 8-bit bytes. However, for internationalization 255 characters is
simply not enough.
This is where Unicode becomes interesting. The Unicode standard
defines names and numbers of thousands of characters and very
conveniently, the first 128 characters map directly to the ASCII
standard. Most Windows NT programmers are probably fairly familiar
with Unicode and the wchar_t that
specifies a 16-bit character.
However, this is simply one encoding of Unicode known as UTF-16 (UCS
Transformation Format, where UCS is Universal multiple-octet coded
Character Set, for those that want to know). Solaris support for
Unicode is provided with a 32-bit wchar_t
and the UTF-32 encoding. There is also an 8-bit encoding for Unicode
known as UTF-8. Using the UTF-8 encoding scheme, each character uses
between 1 and 6 bytes. Hence, such characters are often called
multi-byte characters. The scheme uses a continuation principal
whereby some characters are encoded in just one byte, while
characters are represented by 2 to 6 bytes. The traditional ASCII
characters 0-127 use the first 7 bits of the first byte, so if your
are representing an English text, the ASCII and UTF-8 encodings are
the same.
Any character assigned a place in the Unicode standard will be
the same in UTF-8, UTF-16 and UTF-32 even if the bit pattern looks
different. Of course, if you encode a message using UTF-8 and then
try to interpret it as UTF-16 you are not going to see the original
message.
Characters encoded as ASCII or UTF-8 are said to be narrow –
in C/C++ they are referenced with the char data type.
Conversely, a wide character (UTF-16 or UTF-32) uses more
than one byte but all characters are the same size – for these
C/C++ uses wchar_t data type.
So to answer the original question depends on which characters
you are encoding: wchar_t on NT
gives 512 characters per Kb while on Solaris you have just 256;
however, multi-byte encoding on any platform depends on the text.
Now this is not the end of the matter: not all multi-byte
character sets are Unicode UTF-8.
Microsoft traditionally supports multi-byte character which are
not UTF-8. The two schemes are similar but the Microsoft one is
actually called Double Byte Character Set (although
documentation usually refers to it as multi-byte) and it
represents characters used with either one or two bytes. This scheme
is closely linked to code pages and unfortunately, this means
that the same numeric sequence, could map to different strings using
different code pages.
Now I said traditionally because this scheme is employed for the
MSDOS, 16-bit Windows and Windows 9x families where there is no
Unicode support – not even UTF-8. NT-based systems, including XP,
use UTF-16 and support UTF-8 and double byte character set. Because
UTF-8 is used on the web some support has been added to Internet
Explorer.
Around the time that the Unicode Consortium was defining their
character set the International Standards Organisation (ISO) set
about the same task. Luckily the two joined forces and ISO UCS
(Universal Character Set – ISO standard 10646) is the Unicode
standard by another name. However, name differences persist so UCS-2
is UTF-16 and UCS-4 is UTF-32.
That is a very very brief introduction to the subject of
character sets, next month I’ll continue the subject.
References:
By Reg. Charney
Microsoft has now launched its latest version of its Windows.
From a marketing view, it has landed with a thud. It hasn’t
generated any real excitement. By contrast, all technical reports
say it is pretty good. Based on my experience with its predecessor,
Windows 2000, WindowsXP has reached an acceptable level of
reliability for most purposes.
However, I won’t let it on any of my machines. Because of its
licensing and activation requirements, I can only treat it as a
virus infection. It is a big piece of software that will
self-destruct in 30 days if you don’t get permission from Big
Brother to use it — even though you have paid through the nose for
the privilege of letting Microsoft control your machine and software
for the foreseeable future. Realize that if you don’t give up your
right to privacy, all your data, third party software, and time
invested in setting up the system will be lost. Tell me how that is
different from the most destructive of the computer viruses out
there?
Besides the activation requirement, there is also the pernicious
Passport facility that is administered by Microsoft. It is meant to
be a convenience — once you store your name, address, contact
information, bank account details, credit card number and expiry
dates, medical data, merchant information, etc. on Microsoft’s
central system, you can transact business on the net very easily.
All it costs is your identity and privacy. Keep in mind that even if
Microsoft does not use or sell your name, your profile boxes you in.
For example, given your bank account activity and home address,
mortgage companies may determine that you are a bad risk. Given your
collection of medical contacts, you may be a high risk for medical
insurance. If you believe this is fantasy, you should know that this
is how market research firms characterize market segments and how
banks used to do “red lining” that limited mortgages based on
where you lived.
By Allan Kelly
I’ve recently been dipping into two books of essays, one is a
well know modern classic: Constantine On Peopleware, the
other is a new book of old essays, Software Fundamentals :
Collected Papers of David Parnas.
Anyone seriously involved with software development should make
time and space for each. Both are easy to dip in when you have 10
(Constantine) or 20 (Parnas) minutes to spare.
Also in common is that both authors draw from experience outside
the software industry. This reminds us that we in the software
industry are not living in a bubble: creativity, team work, project
management, engineering are all subjects which have a long history
and from which we can draw.
When we speak of using engineering principals in our
software, what are these principals? I can’t buy a book on
engineering, the field is too big, there are books on mechanical
engineering, process engineering, chemical engineering, marine
engineering and many more. What are the principals we are dealing
with in software engineering?
Parnas’ essays come from nearly 30 years of work. Some
introduce concepts now taken for granted but in 1970 where radical!
It is also worth noting that little changes: On the design and
development of program families was written in 1976, but 25
years later we are still struggling with doing this.
Neither author ignores the wider world. Constantine uses his
experience as a family therapist to deliver insights on team
dynamics. Parnas’ SDI is still relevant today as 25 years ago.
References:
- Constantine on Peopleware : Larry Constantine, Simon &
Schuster 1995 ISBN: 0-1333-19768
- Software Fundamentals : Collected Papers of David L. Parnas :
Edited by David M. Hoffman and David M. Weiss, Addison-Wesley
2001 ISBN: 0-201-70369-6
By Ali Çehreli
Last month was positively different; this month is not. The drop
in the job market has been slowing down for a number of months as a
possible indication of a comeback. Last month's figures were
encouraging because some of the jobs we monitor had actually shown
improving numbers. Unfortunately the drop is back. This is more in
synch with the now pronounced recession that the US economy is in.
There is not much change in the trends: Windows 2000 and Linux
jobs in the platforms and ASIC jobs in the technology continue to be
the most trendy. We have started to monitor Windows XP jobs as well.
But the figures are still too small. I will include Windows XP jobs
next month with three data points.
The bigger drop in the ASIC jobs compared to the other
technologies is notable (Figure 1). Evidentially, the small ASIC
company I work for has been through a lay-off just two weeks ago.
Similar to the ASIC jobs, Windows 2000 device driver jobs has
experienced a bigger drop (Figure 2).


Everyone believes that the economy will recover but no one knows
when. In this environment, many companies switch to a survival mode
where they first reduce headcounts to cut expenses. Some engineers
are lucky to find jobs at consultancies, but only if they bring
their own businesses. Temporary foreign workers are the worst
effected. They need to wait for the permanent residents to fill the
already scarce positions. When to make the call and return to their
home countries is a tough decision they have to make.
|