Post by NickBrief history. Apologies for both egg-sucking and
gross-oversimplifications.
Just to add some more specific details to the mix...
Originally, telegraphs (with which we on this newsgroup are all old to
still recall) used one of a few different varieties of Morse code.
The design of Morse code, with its array of different lengths of
symbols, and with some letters beginning with symbols that correspond
to whole other letters,[1] makes it relatively difficult for primitive
electromechanical devices to decode it. (Trained humans, on the other
hand, have the right sort of computational equipment to just read it
off without even thinking about it.) So when increasing volumes made
it necessary to replace human telegraph transcribers with mechanical
"teleprinters", a simpler code was needed. The earliest common code
was five symbols -- we'd now say "bits", but in data-communications
jargon they were "mark" and "space" rather than "one" and "zero" --
and was invented by Emile Baudot in 1870; it's now known as Baudot
code or "International Telegraph Alphabet No. 1". A couple of
variants of this code were developed, and the variant known as ITA2 is
still in limited use on radio today.
A five-bit code can of course only represent 32 characters, which is
not enough to have a complete set of letters and digits, never mind
multiple cases, accents, punctuation marks, etc. In Baudot, probably
influenced by the design of mechanical typewriters, two "shift codes"
are defined, LTRS and FIGS, which effectively expand the character
set to 62 characters. (That's still not enough to have both upper and
lower case!)
By the late 1950s, this situation was clearly unsatisfactory: many
users wanted mixed-case messages, and the system of shift codes made
it unnecessarily complicated for early computers to process text. The
European Computer Manufacturers' Association introduced a six-bit code
in 1960, known as ECMA 6, and IBM had its own six-bit code as well,
but the American Standards Association's X3 committee, which was
sponsored by the Business Equipment Manufacturers' Association,
preferred a seven-bit code. (They had existing prior art in FIELDATA
code on the Univac 1100, and Western Electric's Teletype
Corp. subsidiary was already making teleprinters using an
AT&T-proprietary seven-bit character set.)
The ultimate standard, ASA X3.4-1963, was the first version of ASCII.
It's very similar to the last version of ASCII, but had a few
differences that were to prove important later on; most notable, It
had a full set of (teletype-network-inspired) control characters,
including a "newline" code distinct from both "carriage return" and
"line feed". The Multics project, which began at MIT Project MAC[2]
in 1964, was based from the start on the 1963 standard, and chose to
use the newline character as its single-character line terminator in
text files. (The terminal interface would translate this to CR/LF
with appropriate delays when talking to a device that required it,
like a Teletype machine.)
In 1968, ASCII was revised, changing some characters (notable ^, _,
and | took their current meanings), and eliminating some control
characters, including the newline, which was combined with the
line-feed character. The Multics team -- spread over three different
organizations, MIT, Honeywell, and AT&T -- had a "flag day" to convert
to the new character set, including the change of newline character.
AT&T then dropped out of the Multics project, but Ken Thompson
continued the Multics tradition of the single-character newline when
he went on to create UNIX on the (much smaller) departmental PDP-7.
ASCII was also designated as Intenrational Telegraph Alphabet No. 5.
The 1968 version of ASCII also gave users the option of two different
meanings for the ' and ` characters, again with far-reaching
consequences: ' was the apostrophe, but could also be either an acute
accent or a right single quotation mark; ` could be either a grave
accent or a left single quotation mark. Two important computer
typesetting systems, AT&T's troff and Donald Knuth's TeX, chose the
quotation-mark meaning. When John Warnock and Chuck Geschke created
PostScript, they naturally used the quotation-mark interpretation as
well, which is why the PostScript StandardEncoding puts quotation
marks at those positions, although by that time the quotation-mark
meaning was already officially deprecated. The X Window System
includes bitmap versions of the seven standard PostScript fonts,
contributed by Adobe, which until the late 1990s reflected this choice
of interpretation. (This made it much easier to read shell scripts
and TeX source files than it is today.)
In the early 1970s, the ISO got involved and adopted its own version
of ASCII as ISO 646. ECMA also ratified the seven-bit ISO standard as
a revision of its own ECMA 6. Because ASCII did not include
characters (particularly accented characters) required for many
non-English languages, many of the punctuation characters were
designated for replacement by national characters upon adoption by
national standards bodies; these included $, [, {, ], }, <, and >.
The ASCII assignments, with the exception of $, were used by the
"International Reference Variant" -- the dollar sign was replaced with
a little-used "international currency symbol" that looks like a
lozenge. In 1983, the final version of ASCII ceded the field to the
ISO 646 framework and character assignments (including the use of '
and ` as accents rather than quotation marks). ISO 646 survived into
the early 1990s, resulting in a horrible botch in the ISO C standard
known as "trigraphs" (a mechanism to allow three characters from the
invariant set of ISO 646 to be used to represent one of the missing
punctuation characters, all but one of which are fundamental to C's
syntax).
Meanwhile, it had become clear that eight bits was going to be the
standard character size for all computer and communications devices.
Two new ISO standards paved the way: ISO 2022 defined a set of escape
sequences (introduced by the "escape" control character) which allowed
users to change the characters represented by different segments of
the eight-bit range; this was necessary to support hitherto-ignored
Asian character repertoires such as the three Japanese character
systems. ISO 2022 was not widely adopted outside the Far East, where
it still remains in limited use (particularly in Japan) today. ISO
8859, on the other hand, defined a whole family of eight-bit character
sets, which included additional typesetting characters like the
non-breaking space, and most importantly improved coverage of accented
characters, allowing for the retirement of national ISO 646 variants
and a richer universal set of punctuation marks (including all of the
ones required for C programming).
But eight bits still wasn't enough to cover all European languages in
a single character set, never mind Far Eastern languages, so IT
companies joined together to take the next step, a 16-bit character
set called Unicode. With sixteen bits and some creative effort,
nearly all of the world's languages could be accomodated in the 65,000
characters of Unicode. In order to make East Asian languages fit,
however, the Unicode designers did something called "Han unification",
which gave the same character code to similar-looking Chinese-derived
characters in Chinese, Japanese, and pre-reform Korean writing
systems. This greatly offended the Japanese, who announced that they
would not be adopting any new character set which did not properly
represent Japanese characters as distinct from their Chinese
counterparts.[3] The Japanese objections made it impossible for the
ISO committee which was developing a new character-set standard to
adopt the Unicode work unchanged. As a compromise, the new ISO
standard -- ISO 10646 -- was defined as a 32-bit superset of the
16-bit Unicode standard, with the first 65,536 characters defined as
the "Basic Multilingual Plane"; however, no separate series of
Japanese characters (kanji) was ever added. (There is a very large
block of additional "Unified CJK" characters in plane 2.)
Of course, with a 16- or 32-bit character standard, it's necessary to
definte an eight-bit (or in some cases seven-bit) representation,
which will perforce either have persistent shift states, like Baudot,
or variable-length characters. The most popular encoding is UTF-8,
which takes the latter choice; it also has the property that, when
interpreted as ISO 8859-1, characters from the lower eight bits of ISO
8859-1 are clearly distinguishable (albeit preceded by another
character, often an accented capital A).
How did it get to be 5:30 already? I should have gone to bed three
hours ago!
-GAWollman
[1] To use the technical term, it's not a "prefix code". For example,
the letter S (Morse "dit dit dit") can also be read as the string EEE
(one E is a Morse "dit"); a human receiver can distinguish them by
learning the remote sender's characteristic inter-letter and
inter-word pauses, but that was too difficult for early
electromechanical equipment.
[2] Full disclosure: my current employer, modulo two name changes.
[3] This is not completely ridiculous, as the characters are
completely identical in all properties, which meant that software
needed to know whether the unified Han characters were intended to be
presented as Chinese or Japanese; multilingual Chinese/Japanese (or
traditional Korean/Japanese) texts would require extra markup to
display the correct characters.
--
Garrett A. Wollman | What intellectual phenomenon can be older, or more oft
***@bimajority.org| repeated, than the story of a large research program
Opinions not shared by| that impaled itself upon a false central assumption
my employers. | accepted by all practitioners? - S.J. Gould, 1993