Query on UTF-32 encodings for letters

Discussion:

Query on UTF-32 encodings for letters

Robert Dewar

2005-01-11 14:28:29 UTC

Ada 2005 requires full support for all planes of UTF-32
encoding, including the use of letters in identifiers,
including also proper upper lower case equivalence.

All this information is obtainable from the 10646 standard,
but it is non-trivial to generate the predicates Is_Letter,
and the function To_Lower.

I wondered if anyone knew of GPL'ed code that did either of
these two functions before I reinvent the wheel :-)

Robert Dewar

Paul Koning

2005-01-11 14:56:18 UTC

Robert> Ada 2005 requires full support for all planes of UTF-32
Robert> encoding, including the use of letters in identifiers,
Robert> including also proper upper lower case equivalence.

Robert> All this information is obtainable from the 10646 standard,
Robert> but it is non-trivial to generate the predicates Is_Letter,
Robert> and the function To_Lower.

Robert> I wondered if anyone knew of GPL'ed code that did either of
Robert> these two functions before I reinvent the wheel :-)

Libiconv and libidn are relevant -- both GNU projects
(http://www.gnu.org/software/libiconv/ and
http://www.gnu.org/software/libidn/). There's also the Stringprep
spec, RFC 3454.

paul

Neil Booth

2005-01-12 00:17:57 UTC

Robert Dewar wrote:-

Post by Robert Dewar
Ada 2005 requires full support for all planes of UTF-32
encoding, including the use of letters in identifiers,
including also proper upper lower case equivalence.
All this information is obtainable from the 10646 standard,
but it is non-trivial to generate the predicates Is_Letter,
and the function To_Lower.
I wondered if anyone knew of GPL'ed code that did either of
these two functions before I reinvent the wheel :-)

You should have a look at Tom Lord's hackerlab; it's GPLed
and part of the GNU project I think. At a glance it seems to
have a very clean unicode implementation.

Neil.

Robert Dewar

2005-01-12 01:15:11 UTC

Post by Neil Booth
You should have a look at Tom Lord's hackerlab; it's GPLed
and part of the GNU project I think. At a glance it seems to
have a very clean unicode implementation.

Can you tell me where to glance, when I went to Tom's web site
there were many broken links, and I could not find Unicode stuff.

Neil Booth

2005-01-12 01:55:39 UTC

Robert Dewar wrote:-

Post by Robert Dewar

Post by Neil Booth
You should have a look at Tom Lord's hackerlab; it's GPLed
and part of the GNU project I think. At a glance it seems to
have a very clean unicode implementation.

Can you tell me where to glance, when I went to Tom's web site
there were many broken links, and I could not find Unicode stuff.

Yeah, that's annoying.

Download arch 1.3; it's in the distribution.

ftp://ftp.gnu.org/gnu/gnu-arch/tla-1.3.tar.gz

and look at the directories hackerlab/uni*.

Neil.

Geoffrey Keating

2005-01-13 00:46:53 UTC

Post by Robert Dewar
Ada 2005 requires full support for all planes of UTF-32
encoding, including the use of letters in identifiers,
including also proper upper lower case equivalence.
All this information is obtainable from the 10646 standard,
but it is non-trivial to generate the predicates Is_Letter,
and the function To_Lower.

You might consider glibc, or possibly simply use iswalpha() and towlower().

Robert Dewar

2005-01-15 05:26:28 UTC

Post by Geoffrey Keating

Post by Robert Dewar
Ada 2005 requires full support for all planes of UTF-32
encoding, including the use of letters in identifiers,
including also proper upper lower case equivalence.

You might consider glibc, or possibly simply use iswalpha() and towlower().

Well I really don't understand the implementation of iswalpha. For
example, it yields false for "FEMININE ORDINAL INDICATOR" (16#AA#)
even though the definition in the database is:

00AA;FEMININE ORDINAL INDICATOR;Ll;0;L;<super> 0061;;;;N;;;;;

Here the L1 shows that this is a lower case letter,
at least that's the way
I understand the database, and thus it is allowed
in Ada identifiers. MICRO SIGN is a similar example.

At first, it looked to me like it was just testing LETTER in the
name of the symbol, but that is disproved by:

LIGATURE YIDDISH DOUBLE VAV (16#05F0#)
where the database entry is

05F0;HEBREW LIGATURE YIDDISH DOUBLE VAV;Lo;0;R;;;;;N;HEBREW LETTER DOUBLE VAV;;;;

The Lo here indicates "Letter, other", so this should also be considered
a letter and iswalpha returns True in this case.

Looks to me like I have to spin my own here :-(

or I could just use these functions and decide that discrepancies are not
that critical in these obscure cases :-)

Tom Tromey

2005-01-15 06:36:26 UTC

Robert> Looks to me like I have to spin my own here :-(

For gcj we have a perl script (gcc/java/gen-table.pl) that reads the
unicode database and writes out C code that is used by the lexer. We
did it by hand because we wanted to ensure that our use matched what
is in the current java spec. I don't know what Ada needs, but this
approach has worked well for us. We use something similar for libgcj;
it is more complicated, though, since the library needs more
information than the compiler.

Tom

Robert Dewar

2005-01-15 11:38:03 UTC

Post by Tom Tromey
Robert> Looks to me like I have to spin my own here :-(
For gcj we have a perl script (gcc/java/gen-table.pl) that reads the
unicode database and writes out C code that is used by the lexer. We
did it by hand because we wanted to ensure that our use matched what
is in the current java spec. I don't know what Ada needs, but this
approach has worked well for us. We use something similar for libgcj;
it is more complicated, though, since the library needs more
information than the compiler.

Yes, generating the tables is easy (they are in fact quoted in
the relevant Ada AI, as evidence that the tables are not too
large, though the case conversion one is pretty huge -- that's
not one that C and Java have to worry about). I just wondered
if there might be something on the shelf before I put something
there myself.

Thanks for the comment. I gather that for Java, iswalpha is
also not quite right. We will probably end up with our own
stuff for Ada as well.

Tom Tromey

2005-01-16 03:00:23 UTC

Robert> though the case conversion one is pretty huge -- that's
Robert> not one that C and Java have to worry about

FWIW, libgcj does need this, since String and Character both have case
conversion methods. This is mostly table-driven, using tables derived
from the Unicode tables via a converter program, but there are some
special cases in String for weird things like esset and dotless "i" in
the Turkish locale. See libjava/scripts and
libjava/java/lang/{*String*,*Character*}.

Tom

Robert Dewar

2005-01-16 03:14:20 UTC

Post by Tom Tromey
Robert> though the case conversion one is pretty huge -- that's
Robert> not one that C and Java have to worry about
FWIW, libgcj does need this, since String and Character both have case
conversion methods. This is mostly table-driven, using tables derived
from the Unicode tables via a converter program, but there are some
special cases in String for weird things like esset and dotless "i" in
the Turkish locale. See libjava/scripts and
libjava/java/lang/{*String*,*Character*}.
Tom

Well it is possible that libgcj does exactly the right thing for
Ada over all planes, but on the other hand, it is not that difficult
to do exactly what is required for Ada, and we definitely do not need
the fold to lower case (which is where problems occur), and Ada does
NOT allow special casing of esset.

Joseph S. Myers

2005-01-16 20:43:09 UTC

Post by Robert Dewar
Well I really don't understand the implementation of iswalpha. For
example, it yields false for "FEMININE ORDINAL INDICATOR" (16#AA#)

glibc's iswalpha works for me, provided the program has called setlocale
before iswalpha and is running under a suitable locale whose definition
copies the i18n file's LC_CTYPE data (e.g. en_GB.UTF-8, not C / POSIX).

Post by Robert Dewar
At first, it looked to me like it was just testing LETTER in the

That is one thing gen-unicode-ctype.c looks at in addition to the
character class. To quote from CVS glibc, localedata/gen-unicode-ctype.c,

return (unicode_attributes[ch].name != NULL
&& ((unicode_attributes[ch].category[0] == 'L'
/* Theppitak Karoonboonyanan <***@links.nectec.or.th> says
<U0E2F>, <U0E46> should belong to is_punct. */
&& (ch != 0x0E2F) && (ch != 0x0E46))
/* Theppitak Karoonboonyanan <***@links.nectec.or.th> says
<U0E31>, <U0E34>..<U0E3A>, <U0E47>..<U0E4E> are is_alpha. */
|| (ch == 0x0E31)
|| (ch >= 0x0E34 && ch <= 0x0E3A)
|| (ch >= 0x0E47 && ch <= 0x0E4E)
/* Avoid warning for <U0345>. */
|| (ch == 0x0345)
/* Avoid warnings for <U2160>..<U217F>. */
|| (unicode_attributes[ch].category[0] == 'N'
&& unicode_attributes[ch].category[1] == 'l')
/* Avoid warnings for <U24B6>..<U24E9>. */
|| (unicode_attributes[ch].category[0] == 'S'
&& unicode_attributes[ch].category[1] == 'o'
&& strstr (unicode_attributes[ch].name, " LETTER ")
!= NULL)
/* Consider all the non-ASCII digits as alphabetic.
ISO C 99 forbids us to have them in category "digit",
but we want iswalnum to return true on them. */
|| (unicode_attributes[ch].category[0] == 'N'
&& unicode_attributes[ch].category[1] == 'd'
&& !(ch >= 0x0030 && ch <= 0x0039))));

If what you require is a specific definition in terms of (maybe a specific
version of) the Unicode Character database rather than something
locale-dependent and so system-dependent, then indeed the system library
may be unsuitable.

--
Joseph S. Myers http://www.srcf.ucam.org/~jsm28/gcc/
***@polyomino.org.uk (personal mail)
***@codesourcery.com (CodeSourcery mail)
***@gcc.gnu.org (Bugzilla assignments and CCs)

Robert Dewar

2005-01-16 21:16:43 UTC

Post by Joseph S. Myers
If what you require is a specific definition in terms of (maybe a specific
version of) the Unicode Character database rather than something
locale-dependent and so system-dependent, then indeed the system library
may be unsuitable.

Right, that's what I am thinking, it is really better to have an implementation
that is local independent, given the standard is locale independent.

Not that this is very critical I must say, I really wonder if there will be
people using wide wide characters in identifiers and getting upset if case
folding is not "correct". We will see :-)

Joseph S. Myers

2005-01-16 19:52:49 UTC

Post by Robert Dewar
Ada 2005 requires full support for all planes of UTF-32
encoding, including the use of letters in identifiers,
including also proper upper lower case equivalence.
All this information is obtainable from the 10646 standard,
but it is non-trivial to generate the predicates Is_Letter,
and the function To_Lower.

Proper case folding and caseless matching are locale-dependent. Case
conversion can also depend on context in a word as well as on locale. In
Unicode there is titlecase as well as uppercase and lowercase. I presume
there is in fact a more precise specification, with appropriate normative
references, of what exactly is required and whether there is to be
locale-dependence, at compile time or at runtime.

Although the Unicode Character Database includes various tables for case
mapping, including context and locale dependent mapping, I'm not sure
whether these are normative or informative; section 4.2 of the Unicode
Standard version 4.0 refers to them as normative, while section 5.18 says
that case itself is normative but the mappings are informative: but the
whole of chapter 5 is not normative.

--
Joseph S. Myers http://www.srcf.ucam.org/~jsm28/gcc/
***@polyomino.org.uk (personal mail)
***@codesourcery.com (CodeSourcery mail)
***@gcc.gnu.org (Bugzilla assignments and CCs)

Robert Dewar

2005-01-16 20:34:18 UTC

Post by Joseph S. Myers

Post by Robert Dewar
Ada 2005 requires full support for all planes of UTF-32
encoding, including the use of letters in identifiers,
including also proper upper lower case equivalence.
All this information is obtainable from the 10646 standard,
but it is non-trivial to generate the predicates Is_Letter,
and the function To_Lower.

Proper case folding and caseless matching are locale-dependent.

That's not true for the Ada 2005 rules, which are locale independent
and driven only by the 10646 database.

Case

Post by Joseph S. Myers
conversion can also depend on context in a word as well as on locale. In
Unicode there is titlecase as well as uppercase and lowercase.

title case is allowed in Ada 2005 identifiers.

The full documentation for what the Ada 2005 AI requires can be found in

www.ada-auth.org/cgi-bin/cvsweb.cgi/AIs/AI-00285.TXT?rev=1.22

Post by Joseph S. Myers
I presume
there is in fact a more precise specification, with appropriate normative
references, of what exactly is required and whether there is to be
locale-dependence, at compile time or at runtime.

Indeed, the quoted AI is the precise specification

Post by Joseph S. Myers
Although the Unicode Character Database includes various tables for case
mapping, including context and locale dependent mapping, I'm not sure
whether these are normative or informative; section 4.2 of the Unicode
Standard version 4.0 refers to them as normative, while section 5.18 says
that case itself is normative but the mappings are informative: but the
whole of chapter 5 is not normative.

Well the Ada rules as stated are indeed normative and are based on the
unicode categorization. But Ada does not follow all the Unicode
recommendations. In particular, it does not mandate Normalization
Form KC, and instead follows the C# style of only rigourously
defining the effect of programs which are already in this
normalization form. Furthermore, Ada decided not to use
ISO/IEC TR 10176 which would be the assumed approach. The
reasons for this are discussed in the AI.

Anyway, it seems not too hard to write specific Is_Letter and
Fold_To_Upper_Case following the rules in this AI.

At this stage, I have pretty much concluded that I should spin my own
version of these routines to exactly match the Ada spec.

Thanks Joseph for your comments!

(this character stuff is a bottomless pit :-)

Paul Koning

2005-01-17 15:42:49 UTC

Post by Joseph S. Myers
Proper case folding and caseless matching are locale-dependent.

Robert> That's not true for the Ada 2005 rules, which are locale
Robert> independent and driven only by the 10646 database.

Then that simply means that Ada has either created a locale of its
own, or adopted one specific locale to be the one it uses.
Anglocentrism at work, perhaps?

Robert> (this character stuff is a bottomless pit :-)

It sure is.

paul

Robert Dewar

2005-01-17 18:44:40 UTC

Post by Paul Koning

Post by Joseph S. Myers
Proper case folding and caseless matching are locale-dependent.

Robert> That's not true for the Ada 2005 rules, which are locale
Robert> independent and driven only by the 10646 database.
Then that simply means that Ada has either created a locale of its
own, or adopted one specific locale to be the one it uses.
Anglocentrism at work, perhaps?

I don't think that is the case, with the full 10646 database,
every character in the database is properly categorized, and
the whole point of Wide_Wide_Character in Ada is to match the
10646 standard exactly. That is what ISO mandates, so it is
hardly a matter of Anglocentrism (note that any reference to
Unicode as a standard *is* Anglocentric :-) We are driven by
ISO 10646, not Unicode. Luckily these are essentially
completely aligned at this stage.

Note that in 10646, there is a lot of distinction between
different national characters. For instance, the Greek upper
case alpha is typographically identical to latin upper case
A, but they occupy distinct code positions. That means that
the folding rule for every character is part of the
non-locale dependent database.

Post by Paul Koning
Robert> (this character stuff is a bottomless pit :-)
It sure is.
paul

Paul Koning

2005-01-17 19:01:08 UTC

Post by Paul Koning

Post by Joseph S. Myers
Proper case folding and caseless matching are locale-dependent.

Robert> That's not true for the Ada 2005 rules, which are locale
Robert> independent and driven only by the 10646 database.

Post by Paul Koning
Then that simply means that Ada has either created a locale of its
own, or adopted one specific locale to be the one it uses.
Anglocentrism at work, perhaps?

Robert> I don't think that is the case, with the full 10646 database,
Robert> every character in the database is properly categorized, and
Robert> the whole point of Wide_Wide_Character in Ada is to match the
Robert> 10646 standard exactly. That is what ISO mandates, so it is
Robert> hardly a matter of Anglocentrism (note that any reference to
Robert> Unicode as a standard *is* Anglocentric :-) We are driven by
Robert> ISO 10646, not Unicode. Luckily these are essentially
Robert> completely aligned at this stage.

Robert> Note that in 10646, there is a lot of distinction between
Robert> different national characters. For instance, the Greek upper
Robert> case alpha is typographically identical to latin upper case
Robert> A, but they occupy distinct code positions. That means that
Robert> the folding rule for every character is part of the
Robert> non-locale dependent database.

But that is nowhere near sufficient. The issue is that case folding
rules are different for different languages/locales that use the SAME
character set. For example, there are a whole bunch of different
folding rules for Latin-1.

If 10646 defines a single set of rules, then it's part of the problem,
not part of the solution.

paul

Robert Dewar

2005-01-17 19:09:44 UTC

Post by Paul Koning
But that is nowhere near sufficient. The issue is that case folding
rules are different for different languages/locales that use the SAME
character set. For example, there are a whole bunch of different
folding rules for Latin-1.

Well in practice the folding rules for Latin-1 have been part of the
standard for ten years, so they are not about to change.

It would be interesting to know an example of what you state above.
Certainly people have been using Latin-1 to write Ada in countries
all over the world, and no one has ever found the folding rules
for identifiers to be in any way inconvenient.

There was a point in the discussion early on when JDI wanted upper
case E and lower case E-acute to match in identif

The decision in Ada is that you do not want the meaning of a program
or its legality to change in a locale dependent way. This is really
a fundamental starting point. Note that this is a radically different
issue from folding at run-time in a manner that makes sense to an
application program.

Post by Paul Koning
If 10646 defines a single set of rules, then it's part of the problem,
not part of the solution.

Well the 10646 definition provides a framework from which an acceptable
locale-independent set of folding rules can be obtained. Note that acceptable
here means acceptable to at least the ISO P-members. Indeed when it comes
to such issues in the Ada standard, this is an area where the non-english
speaking member countries take the lead.

Robert Dewar

2005-01-17 19:13:48 UTC

Robert Dewar wrote:

(try again, message got sent prematurely before)

Post by Paul Koning
But that is nowhere near sufficient. The issue is that case folding
rules are different for different languages/locales that use the SAME
character set. For example, there are a whole bunch of different
folding rules for Latin-1.

Well in practice the folding rules for Latin-1 have been part of the
standard for ten years, so they are not about to change.

It would be interesting to know an example of what you state above.
Certainly people have been using Latin-1 to write Ada in countries
all over the world, and no one has ever found the folding rules
for identifiers to be in any way inconvenient.

There was a point in the discussion early on when JDI wanted upper
case E and lower case E-acute to match in identifiers (many French
folks have the illusion that upper case letters do not have accents,
this comes from typewriter days). However, this kind of matching is
very definitely language dependent (an interesting test is can you
cross the letters in a cross-word puzzle, in French xword puzzles,
E-acute and E can cross, but of course A and A-with-circle in Swedish
do not cross, since they are quite different letters.

The decision in Ada is that you do not want the meaning of a program
or its legality to change in a locale dependent way. This is really
a fundamental starting point and I don't think there is anyone from
any country that would think otherwise.

Note that this is a radically different
issue from folding at run-time in a manner that makes sense to an
application program.

Post by Paul Koning
If 10646 defines a single set of rules, then it's part of the problem,
not part of the solution.

Well the 10646 definition provides a framework from which an acceptable
locale-independent set of folding rules can be obtained. Note that acceptable
here means acceptable to at least the ISO P-members. Indeed when it comes
to such issues in the Ada standard, this is an area where the non-english
speaking member countries take the lead.

Mind you, my own feeling would have been to abandon case insensitive
matching for non-Latin1 letters, but that *was* considered to be
an unacceptably anglo-centric point of view, and the Japanese in
paricular were insistent on this point.

Paul Koning

2005-01-17 19:22:42 UTC

Post by Paul Koning
But that is nowhere near sufficient. The issue is that case
folding rules are different for different languages/locales that
use the SAME character set. For example, there are a whole bunch
of different folding rules for Latin-1.

Robert> Well in practice the folding rules for Latin-1 have been part
Robert> of the standard for ten years, so they are not about to
Robert> change.

Robert> It would be interesting to know an example of what you state
Robert> above.

Uppercase letters aren't accented in France, but they are in Quebec.
(That doesn't affect folding to lowercase, of course, but it does
affect case-insensitive equality).

An example that affects folding to lowercase: I folds to i-without-dot
in Turkish. Those aren't in Latin-1, but they are in the Latin
section of 10646.

Robert> The decision in Ada is that you do not want the meaning of a
Robert> program or its legality to change in a locale dependent
Robert> way. This is really a fundamental starting point. Note that
Robert> this is a radically different issue from folding at run-time
Robert> in a manner that makes sense to an application program.

Ok, fair enough, I was thinking more of the runtime case in my
comments.

paul

Robert Dewar

2005-01-17 19:35:20 UTC

Post by Paul Koning
Uppercase letters aren't accented in France

That is a (very commonly held) myth. Even many French people think this, but
it is wholly false. The true situation is that in classical typography,
upper case letters were always accented. Then typewriters came along and
it became customary to omit the accents. So widespread did this custom
become that many french schools taught that this was the preferred rule.
However, formally typeset material continued to use accents on upper
case letters. But this was never official usage. In fact I had a friend
Pascal Cleve (there is an accent grave over the first e), whose father
was denied some government benefit on the grounds that his name was
spelled wrong in his passport (without the accent). He bounced back
and forth between govt departments until finally the passport department
got the first typewriter in France that could put accents on upper case
letters.

I learned about this first from Alfred Strohmeir, working together on
the Ada 9X CRG (character rapporteur group -- we had banned all
discussion of characters from the main language group, so I chaired
the CRG to which all such discussions were consigned).

When I told Jean Ichbiah about this, he was adamant that I was wrong.
Luckily we were at his home which has an extensive French library, so
I sent him off to look for a typeset example of a missing accent.
After excusions through many examples (e.g. Journal Ecole ... with
accent acute on the E of course), he could not find ONE example to
back his point of view, and we found dozens that confirmed this.

Post by Paul Koning
Uppercase letters aren't accented in France, but they are in Quebec.
(That doesn't affect folding to lowercase, of course, but it does
affect case-insensitive equality).

No it jolly well does not :-)
Not for identifiers at least.

Post by Paul Koning
An example that affects folding to lowercase: I folds to i-without-dot
in Turkish. Those aren't in Latin-1, but they are in the Latin
section of 10646.

Yes, but for Ada, we can consider identifier matching to be only in the
mode of folding to upper case, which takes care of the dotless i since
this folds to upper case I.

I know about this latter case, and I deal with the French accent case
above, do you know of any other cases?

Post by Paul Koning
Ok, fair enough, I was thinking more of the runtime case in my
comments.

At runtime, it seems that there may be many conventions, and indeed
it is up to the programmer to follow rules appropriate to the
particular application domain.

What makes Ada different is the requirement for absolutely defined
legality rules about what is allowed in identifiers and when they
compare equal.

Paul Koning

2005-01-17 19:44:10 UTC

Post by Paul Koning
Uppercase letters aren't accented in France

Robert> That is a (very commonly held) myth.

Interesting. Learn something new every day.

Post by Paul Koning
An example that affects folding to lowercase: I folds to
i-without-dot in Turkish. Those aren't in Latin-1, but they are
in the Latin section of 10646.

Robert> Yes, but for Ada, we can consider identifier matching to be
Robert> only in the mode of folding to upper case, which takes care
Robert> of the dotless i since this folds to upper case I.

Then take i, which upcases to I with dot. Turkish has i with and
without dot, and the dot is preserved when you change case (in either
direction).

Would you map eszet (in German) to ss? Or to sz? Or neither? Modern
usage does the former; 1930-ish usage the latter.

paul

Georg Bauhaus

2005-01-17 21:06:31 UTC

Post by Paul Koning
Then take i, which upcases to I with dot. Turkish has i with and
without dot, and the dot is preserved when you change case (in either
direction).

And AFAICT, the dot can be quite important, because when spoken,
the difference between ı and i can mean quite different things,
much like the distinction between "year" and "your".

Post by Paul Koning
Would you map eszet (in German) to ss? Or to sz? Or neither? Modern
usage does the former; 1930-ish usage the latter.

Not very often even in the 30s.

Some more things into the pit: Almost never was there
an s followed by a z representing a sharp s in German.
You can go back to the middle ages (1100 or so) and find some
interesting spellings. But then you could also argue that
we should consider matching p with b and d with th
(as in English). See da consonant? :-)

There have been some debates about ß, e.g. when
Switzerland discussed the issue in the 1960s. Technically,
it's not an eszet, and the Unicode databases doesn't say
otherwise.

00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S

In the 1930s printers (at least science) used mostly what is now used
again as official spelling: two s for a sharp s (now: when the preceding
vowel is short). Swiss printers always use two s, which is
one of the reasons why you will hardly ever find ß in Wirth's
writings.

In books around 1900 you can see the origin of sharp s,
long s followed by small s:
Most typographers and experts from related
professions will explain that sharp s has its origin
in this combination: a (then) normal s, long shape, same as you
can find in older English texts, followed
by a "Schluss-S" (final s, "normal" shape, ending a word.
Exceptional details omitted.)

Connect the upper end of the long s to the upper
end of the small s and you get sharp s. It's a ligature. (I will
omit the story about how handwriting has created the notion
of an "eszet".) This explains why "Straße" matches "STRASSE".
"STRAßE" is kind of silly computerese. (Straße is German (de_DE)
for street, so I think it is a common name in computer programs.)

For a nice view, see
http://www.queries.de/selbst/typografie.html

Georg

Robert Dewar

2005-01-17 21:45:50 UTC

Post by Paul Koning
Then take i, which upcases to I with dot. Turkish has i with and
without dot, and the dot is preserved when you change case (in either
direction).

Yes, and that's fine, both lower case i with dot and lower case i
without dot fold upper case to capital I (without dot), and so all three
are equivalent in identifiers.

There is no upper case I with dot, so I have no idea what you mean by
saying the dot is preserved. The three characters in question are:

0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;

Post by Paul Koning
Would you map eszet (in German) to ss? Or to sz? Or neither? Modern
usage does the former; 1930-ish usage the latter.

The specific decision for Ada (all documented in the AI), is not to do
anything special for eszet, so the answer is neither. Quoting from the

Post by Paul Koning
We notice that there are cases not covered by this simple correspondence.
For example, German "SS" corresponds to two lowercase sequences. One
is the string "ss", and the other is the es-zett character. We feel that
such complicated cases should be untouched in this time frame, waiting for
the future standardization of appropriate ISO/IEC standards or technical
reports.

Georg Bauhaus

2005-01-17 21:58:39 UTC

Post by Robert Dewar
There is no upper case I with dot, so I have no idea what you mean by

Unless I'm missing something,

0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE

Robert Dewar

2005-01-17 22:06:48 UTC

Post by Georg Bauhaus

Post by Robert Dewar
There is no upper case I with dot, so I have no idea what you mean by

Unless I'm missing something,
0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE

Yes, you are right, interesting, I missed this. Well the Ada
decision is to fold the dotless I to a capital I without dot.
So that's what we will do! I hope Turkish programmers using
Ada will not get completely confused :-)

It's a very deliberate decision too, since it is inconsistent

Post by Georg Bauhaus
(16#119#, 16#119#, -1), -- LATIN SMALL LETTER E WITH OGONEK .. LATIN SMALL LETTER E WITH OGONEK
(16#11B#, 16#11B#, -1), -- LATIN SMALL LETTER E WITH CARON .. LATIN SMALL LETTER E WITH CARON
(16#11D#, 16#11D#, -1), -- LATIN SMALL LETTER G WITH CIRCUMFLEX .. LATIN SMALL LETTER G WITH CIRCUMFLEX
(16#11F#, 16#11F#, -1), -- LATIN SMALL LETTER G WITH BREVE .. LATIN SMALL LETTER G WITH BREVE
(16#121#, 16#121#, -1), -- LATIN SMALL LETTER G WITH DOT ABOVE .. LATIN SMALL LETTER G WITH DOT ABOVE
(16#123#, 16#123#, -1), -- LATIN SMALL LETTER G WITH CEDILLA .. LATIN SMALL LETTER G WITH CEDILLA
(16#125#, 16#125#, -1), -- LATIN SMALL LETTER H WITH CIRCUMFLEX .. LATIN SMALL LETTER H WITH CIRCUMFLEX
(16#127#, 16#127#, -1), -- LATIN SMALL LETTER H WITH STROKE .. LATIN SMALL LETTER H WITH STROKE
(16#129#, 16#129#, -1), -- LATIN SMALL LETTER I WITH TILDE .. LATIN SMALL LETTER I WITH TILDE
(16#12B#, 16#12B#, -1), -- LATIN SMALL LETTER I WITH MACRON .. LATIN SMALL LETTER I WITH MACRON
(16#12D#, 16#12D#, -1), -- LATIN SMALL LETTER I WITH BREVE .. LATIN SMALL LETTER I WITH BREVE
(16#12F#, 16#12F#, -1), -- LATIN SMALL LETTER I WITH OGONEK .. LATIN SMALL LETTER I WITH OGONEK
(16#131#, 16#131#, -232), -- LATIN SMALL LETTER DOTLESS I .. LATIN SMALL LETTER DOTLESS I
(16#133#, 16#133#, -1), -- LATIN SMALL LIGATURE IJ .. LATIN SMALL LIGATURE IJ
(16#135#, 16#135#, -1), -- LATIN SMALL LETTER J WITH CIRCUMFLEX .. LATIN SMALL LETTER J WITH CIRCUMFLEX
(16#137#, 16#137#, -1), -- LATIN SMALL LETTER K WITH CEDILLA .. LATIN SMALL LETTER K WITH CEDILLA
(16#13A#, 16#13A#, -1), -- LATIN SMALL LETTER L WITH ACUTE .. LATIN SMALL LETTER L WITH ACUTE
(16#13C#, 16#13C#, -1), -- LATIN SMALL LETTER L WITH CEDILLA .. LATIN SMALL LETTER L WITH CEDILLA
(16#13E#, 16#13E#, -1), -- LATIN SMALL LETTER L WITH CARON .. LATIN SMALL LETTER L WITH CARON
(16#140#, 16#140#, -1), -- LATIN SMALL LETTER L WITH MIDDLE DOT .. LATIN SMALL LETTER L WITH MIDDLE DOT
(16#142#, 16#142#, -1), -- LATIN SMALL LETTER L WITH STROKE .. LATIN SMALL LETTER L WITH STROKE

Bah! Makes me think even more that the whole business of extending case
insensitive letters to wide characters was a mistake. Oh well, I have got
it all implemented now :-)

Georg Bauhaus

2005-01-17 22:46:22 UTC

Post by Robert Dewar
I hope Turkish programmers using
Ada will not get completely confused :-)

There are quite a few Turkish people around here. I can
confirm that they know how to work around missing dotless i and
dot-less I when hardware or software don't offer one.

Post by Robert Dewar
It's a very deliberate decision too, since it is inconsistent

I just noticed that this character is treated specially in
CaseFolding.txt. Seems like a good catalyst character if you want
a complexity reaction :)

0130; F; 0069 0307; # LATIN CAPITAL LETTER I WITH DOT ABOVE
0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE

# The status field is:
# C: common case folding, common mappings shared by both simple and full mappings.
# F: full case folding, mappings that cause strings to grow in length. Multiple characters are separated by spaces.
# S: simple case folding, mappings to single characters where different from F.
# T: special case for uppercase I and dotted uppercase I
# - For non-Turkic languages, this mapping is normally not used.
# - For Turkic languages (tr, az), this mapping can be used instead of the normal mapping for these characters.
# Note that the Turkic mappings do not maintain canonical equivalence without additional processing.
# See the discussions of case mapping in the Unicode Standard for more information.

Post by Robert Dewar
Bah! Makes me think even more that the whole business of extending case
insensitive letters to wide characters was a mistake. Oh well, I have got
it all implemented now :-)

I for one am looking forward to portable international Ada source code.

Georg

Robert Dewar

2005-01-18 02:39:47 UTC

Post by Georg Bauhaus
I for one am looking forward to portable international Ada source code.

Yes, of course, but I still argue against trying to do this extended
case folding.

Why? Because it seems wrong in any case in Ada programs not to spell
the same identifier in a uniform way throughout the program. That can
never help the reader, and the reader is always favored over the writer
in Ada land. Indeed the standard options used to compile GNAT itself
enforce this usage throughout the compiler and run time.

That does not mean that I prefer the approach of C. I think it is also
confusing to users to have random use of say FirstNode and firstNode
in the same program -- yes I realize that there are some stylized uses
that may be helpful, but the potential for confusion is high, and indeed
most C programmers adopt coding conventions that prevent this kind of
usage in any case.

So ideally programs do not depend on either case sensitivity or
case insensivity in practice. This means that it is perfectly fine
to have either regime in practice if people follow what I consider
to be good style as described above.

Basically what this says is that the only reason for insisting
on case folding is to allow people to write programs that I don't
think should be written in the first place.

For sure you can get portable international Ada source code
without this folding rule.

Furthermore, I worry about tinkering in the future. Suppose someone
discovers a clear error in the folding tables. Do we then modify all
Ada compilers, and make some existing Ada programs illegal. Suppose
someone in Turkey gets real interested in Ada, and then kicks up a
fuss complaining about the equivalence of i and i-dot, again, do we
change the language. I think KISS would have been a better idea here.

Same thing for use of non-letters. Who cares? If it's bad style to use
some wide wide character in an identifier, let coding standards take
care of it. Don't insist on all compilers having big tables and slow
search procedures (or giant tables with fast search procedures) to
figure out what is and what is not a letter.

Oh well, these must be weak arguments, they did not persuade the ARG :-)

Now back to work implementing an efficient binary search
routine to search the rather large case folding table which
has 418 entries!

Marcin Dalecki

2005-01-18 03:32:55 UTC

Post by Robert Dewar

Post by Georg Bauhaus
(16#142#, 16#142#, -1), -- LATIN SMALL LETTER L WITH STROKE ..
LATIN SMALL LETTER L WITH STROKE

Bah! Makes me think even more that the whole business of extending case
insensitive letters to wide characters was a mistake. Oh well, I have got
it all implemented now :-)

It was an idiocy to call it by the proper name. Like the idiocy of case
insensitive
filesystems where one couldn't suddenly sometimes read discs recorded
on different
locales of the very same OS.

Florian Weimer

2005-01-17 22:03:26 UTC

Post by Robert Dewar

Post by Paul Koning
Then take i, which upcases to I with dot. Turkish has i with and
without dot, and the dot is preserved when you change case (in either
direction).

Yes, and that's fine, both lower case i with dot and lower case i
without dot fold upper case to capital I (without dot), and so all three
are equivalent in identifiers.

No, this is not the way Turkish case conversion works. Turkish has a
rule LATIN SMALL LETTER I -> LATIN CAPITAL LETTER I WITH DOT ABOVE
(U+0130).

Robert Dewar

2005-01-17 22:10:46 UTC

Post by Florian Weimer

Post by Robert Dewar
Yes, and that's fine, both lower case i with dot and lower case i
without dot fold upper case to capital I (without dot), and so all three
are equivalent in identifiers.

No, this is not the way Turkish case conversion works. Turkish has a
rule LATIN SMALL LETTER I -> LATIN CAPITAL LETTER I WITH DOT ABOVE
(U+0130).

Maybe not, but I am implementing Ada, and not Turkish :-)
And the Ada rules map as I quoted. Ours not to reason why ....

I guess the point is that since we know that latin small letter i
must map to latin capital letter i (with no dot) in Ada (because
obviously that's reasonable and we cannot have case conversion in
identifiers be locale dependent. When it comes to the dotless I,
it would indeed be bizarre to map it to a dotted capital I, so they
end up being mapped the same. Makes sense, given the requirement
that case conversion (or more basically program legality) be
locale independent.

Paul Koning

2005-01-17 22:05:02 UTC

Post by Paul Koning
Then take i, which upcases to I with dot. Turkish has i with and
without dot, and the dot is preserved when you change case (in
either direction).

Robert> Yes, and that's fine, both lower case i with dot and lower
Robert> case i without dot fold upper case to capital I (without
Robert> dot), and so all three are equivalent in identifiers.

That's wrong for Turkish.

Robert> There is no upper case I with dot, so I have no idea what you
Robert> mean by saying the dot is preserved. The three characters in
Robert> question are:

Robert> 0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
Robert> 0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
Robert> 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;

There certainly is such a thing as uppercase I with a dot, that's a
standard part of Turkish. For an example, see
http://www.turkishembassy.org/consularservices/duyurular.htm, second
section ("Ingiltere Vizeleri:" but with a dot on the first letter).

I see in the character list this entry:

I WITH DOT ABOVE, LATIN CAPITAL LETTER 0130

That sounds like the one.

paul

Robert Dewar

2005-01-17 22:16:54 UTC

Post by Paul Koning

Post by Paul Koning
Then take i, which upcases to I with dot. Turkish has i with and
without dot, and the dot is preserved when you change case (in
either direction).

Robert> Yes, and that's fine, both lower case i with dot and lower
Robert> case i without dot fold upper case to capital I (without
Robert> dot), and so all three are equivalent in identifiers.
That's wrong for Turkish.

This does indeed show that case conversion is locale dependent.
But case equivalence in Ada identifiers cannot be locale dependent.
So Ada is wrong for Turkish, and there is no practical way to make
it right. Of course there can be a local character set available
for Turkish Ada programmers (GNAT already implements several
localized identifier character sets:

@item 1
ISO 8859-1 (Latin-1) identifiers

@item 2
ISO 8859-2 (Latin-2) letters allowed in identifiers

@item 3
ISO 8859-3 (Latin-3) letters allowed in identifiers

@item 4
ISO 8859-4 (Latin-4) letters allowed in identifiers

@item 5
ISO 8859-5 (Cyrillic) letters allowed in identifiers

@item 9
ISO 8859-15 (Latin-9) letters allowed in identifiers

@item p
IBM PC letters (code page 437) allowed in identifiers

@item 8
IBM PC letters (code page 850) allowed in identifiers

@item f
Full upper-half codes allowed in identifiers

@item n
No upper-half codes allowed in identifiers

@item w
Wide-character codes (that is, codes greater than 255)
allowed in identifiers))

But the point is that the standard rules cannot be locale
dependent, so some choice has to be made. Basically there
were two approaches:

1. Don't allow any case equivalence for wide characters used in
identifiers (this is the way the -gnatiw switch in GNAT Ada 95
mode works).

2. Allow best-possible case mapping, understanding that it will
be not quite right in some cases.

I would have chosen 1 (as I said, this is what I did choose :-)
The Ada Committee (in all its wisdom) has chosen approach 2.

Paul Koning

2005-01-17 22:31:15 UTC

Robert> Yes, and that's fine, both lower case i with dot and lower
Robert> case i without dot fold upper case to capital I (without
Robert> dot), and so all three are equivalent in identifiers.

Post by Paul Koning
That's wrong for Turkish.

Robert> This does indeed show that case conversion is locale
Robert> dependent. But case equivalence in Ada identifiers cannot be
Robert> locale dependent. So Ada is wrong for Turkish, and there is
Robert> no practical way to make it right.

Agreed. If the requirement is to do case folding independent of
locale, then it follows that you must pick one, and be wrong for the
others that use the same characters but apply different rules. And
that's why I said "anglocentric" -- though the more accurate adjective
would be "eurocentric" given what you described.

paul

Georg Bauhaus

2005-01-17 22:51:21 UTC

Paul Koning wrote:

[I and dots]

Post by Paul Koning
though the more accurate adjective
would be "eurocentric" given what you described.

<ot>Uhm. Turkey is to become a member of the EU... ;-)</>

Georg

Marcin Dalecki

2005-01-18 03:36:04 UTC

Post by Georg Bauhaus
[I and dots]

Post by Paul Koning
though the more accurate adjective
would be "eurocentric" given what you described.

<ot>Uhm. Turkey is to become a member of the EU... ;-)</>

<OT>
Never!
The EU dissects itself before this can happen. Which is more likely.
</OT>

Robert Dewar

2005-01-18 05:00:29 UTC

Post by Marcin Dalecki
<OT>
Never!
The EU dissects itself before this can happen. Which is more likely.
</OT>

Chuckle chuckle, now this thread is REALLY going to start wandering
if there are more flame throwers like this one :-)

Robert Dewar

2005-01-17 23:21:46 UTC

Post by Paul Koning
Agreed. If the requirement is to do case folding independent of
locale, then it follows that you must pick one, and be wrong for the
others that use the same characters but apply different rules. And
that's why I said "anglocentric" -- though the more accurate adjective
would be "eurocentric" given what you described.

Well for the dotted I case, it is not so much eurocentrism per se
that is at work, rather it is the issue of upwards compatibility.
(in any case Turkey is presumably part of Europe given that its
candidacy for the EU has been allowed to proceed past the first
stage :-)

Andreas Schwab

2005-01-17 22:22:51 UTC

Post by Robert Dewar
There is no upper case I with dot

There is. It's U0130, LATIN CAPITAL LETTER I WITH DOT ABOVE.

Post by Robert Dewar
0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;

In the tr_TR locale, U0069 is mapped to U0130, not U0049; and U0049 is
mapped to U0131, not U0069.

Andreas.

--
Andreas Schwab, SuSE Labs, ***@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5
"And now for something completely different."

Kevin Puetz

2005-01-18 06:37:39 UTC

Post by Robert Dewar

Post by Paul Koning
Then take i, which upcases to I with dot. Turkish has i with and
without dot, and the dot is preserved when you change case (in either
direction).

Yes, and that's fine, both lower case i with dot and lower case i
without dot fold upper case to capital I (without dot), and so all three
are equivalent in identifiers.
There is no upper case I with dot, so I have no idea what you mean by
0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;

Many others have already corrected you on the existance of İ, so I won't
bother. But I will point out the existence of a very handy program (it's
pacakged as native in debian, so presumably there's no other upstream
source), available from
http://ftp.debian.org/debian/pool/main/u/unicode/unicode_0.4.6.tar.gz

It does all kinds of simple property lookups and has saved me more time when
dealing with unicode issues...

Robert Dewar

2005-01-18 12:50:25 UTC

Post by Kevin Puetz
It does all kinds of simple property lookups and has saved me more time when
dealing with unicode issues...

Actually I already have all the tables I need (they are in fact included
in the AI that I referenced). The issue was whether there is some other
source of routines that would have the exact semantics I need. It is
pretty clear from the discussion that the answer is no, and that I should
write Ada specific routines, which I have already done.

Marc Espie

2005-01-27 21:17:42 UTC

Post by Robert Dewar
That is a (very commonly held) myth. Even many French people think this, but
it is wholly false. The true situation is that in classical typography,
upper case letters were always accented. Then typewriters came along and
it became customary to omit the accents. So widespread did this custom
become that many french schools taught that this was the preferred rule.
However, formally typeset material continued to use accents on upper
case letters. But this was never official usage. In fact I had a friend
Pascal Cleve (there is an accent grave over the first e), whose father
was denied some government benefit on the grounds that his name was
spelled wrong in his passport (without the accent). He bounced back
and forth between govt departments until finally the passport department
got the first typewriter in France that could put accents on upper case
letters.

To corroborate your point, there is one fairly known set of printed
typographic rules, _les regles de l'imprimerie nationale_ (accents
omitted) that does mention this, among other things. Omitting accents
over uppercase letters is in fact a spelling mistake. One fairly
common thanks to cheap typography and uneducated people, but still
a mistake.

It's likely we are waging a lost battle though, since quite a few people
have now seen more documents without accents over uppercase letters than
correct documents, and like lemmings, they will go on typesetting stuff
the wrong way---usually with Word and Microsoft's comic sans typeface
for greater effect...

Robert Dewar

2005-01-30 05:27:49 UTC

Post by Marc Espie
To corroborate your point, there is one fairly known set of printed
typographic rules, _les regles de l'imprimerie nationale_ (accents
omitted) that does mention this, among other things. Omitting accents
over uppercase letters is in fact a spelling mistake. One fairly
common thanks to cheap typography and uneducated people

Actually it is more likely to be educated people who have this
misconception since typically French schools used to teach that
this was proper style (omitting accents on upper case letters).
I do not know if this is still the case.

One interesting discussion (to bring it a bit back to topic,
is that if french programmers expect case folding, they tend
to expect e-acute to be folded to capital E without an acute
accent. We had a furious argument about this during the Ada 95
design. Jean wanted to apply the crossword criterion. THat says
that if in a crossword puzzle you can cross two letters, then
they should be regarded as equivalent in identifiers. So he
argued that e-acute and e should be considered equivalent.

The main (and effective) argument against this was that case
folding of this kind is locale dependent, so we could only do
approximate case folding anyway. I come to think that it is
a mistake for the Ada standard to mandate locale independent
case equivalence for wide characters in identifiers, but it
looks like it's too late to change people's minds on this.

Oh well, not too bad, it's not that hard to implement (you
will see the updated widechar unit checked in very soon that
supports the case equivalence stuff), and in practice I
think Ada programmers will follow the excellent style rule
of spelling a given identifier consistently anyway.

To me, case equivalence for identifiers in a language is
not about being able to spell a given identifier as Ada
in one place and ADA in another place, but rather it is
about preventing a program from having distinct identifiers
Ada and ADA, which makes programs hard to talk about.

Post by Marc Espie
It's likely we are waging a lost battle though, since quite a few people
have now seen more documents without accents over uppercase letters than
correct documents, and like lemmings, they will go on typesetting stuff
the wrong way---usually with Word and Microsoft's comic sans typeface
for greater effect...

Well stuff that is "typeset" using WORD is hardly serious. My
experience is that formally published material still always
correctly uses the accents.

Of course you may think that formally published material will
disappear and be replaced by junk on the net. But if that happens
we will lose more elements of style than just the upper case
accents in French :-)

Gabriel Dos Reis

2005-01-30 01:14:22 UTC

Robert Dewar <***@adacore.com> writes:

| Marc Espie wrote:
|
| > To corroborate your point, there is one fairly known set of printed
| > typographic rules, _les regles de l'imprimerie nationale_ (accents
| > omitted) that does mention this, among other things. Omitting accents
| > over uppercase letters is in fact a spelling mistake. One fairly
| > common thanks to cheap typography and uneducated people
|
| Actually it is more likely to be educated people who have this
| misconception since typically French schools used to teach that
| this was proper style (omitting accents on upper case letters).

which French schools?

-- Gaby

Robert Dewar

2005-01-30 06:23:10 UTC

Post by Gabriel Dos Reis
|
| > To corroborate your point, there is one fairly known set of printed
| > typographic rules, _les regles de l'imprimerie nationale_ (accents
| > omitted) that does mention this, among other things. Omitting accents
| > over uppercase letters is in fact a spelling mistake. One fairly
| > common thanks to cheap typography and uneducated people
|
| Actually it is more likely to be educated people who have this
| misconception since typically French schools used to teach that
| this was proper style (omitting accents on upper case letters).
which French schools?

From what I understand, a generation ago, pretty much all French
schools taught this rule. As I say, I do not know if this is still
the case. I have met a bunch of (definitely well educated) French
folks who went to the best French schools, and they all report that
they were taught this rule. One person (forget who) even showed me
the rule in an old text book. But this is a small erratic sample :-)

Gabriel Dos Reis

2005-01-30 01:57:26 UTC

Robert Dewar <***@adacore.com> writes:

| Gabriel Dos Reis wrote:
| > Robert Dewar <***@adacore.com> writes:
| > | Marc Espie wrote:
| > | | > To corroborate your point, there is one fairly known set of
| > printed
| > | > typographic rules, _les regles de l'imprimerie nationale_ (accents
| > | > omitted) that does mention this, among other things. Omitting accents
| > | > over uppercase letters is in fact a spelling mistake. One fairly
| > | > common thanks to cheap typography and uneducated people
| > | | Actually it is more likely to be educated people who have this
| > | misconception since typically French schools used to teach that
| > | this was proper style (omitting accents on upper case letters).
| > which French schools?
|
| From what I understand, a generation ago, pretty much all French
| schools taught this rule. As I say, I do not know if this is still
| the case. I have met a bunch of (definitely well educated) French
| folks who went to the best French schools, and they all report that
| they were taught this rule. One person (forget who) even showed me
| the rule in an old text book. But this is a small erratic sample :-)

Probably. I'm not sure but I think Marc is a generation ahead of me so
he might give an answer there -- but definitely, educated people I've
met don't omit the accent (my generation, or a generation ahead of me).
The FAQ for news:fr.lettres.langue.francaise reports the omission as
having to do with keyboard defects (historically typewriter). It does
not say that it was a rule taught in [tupical or best] French schools.

Which is why I would appreciate to have pointers/evidence for those
"[typically] French schools use to teach" that. But if you don't have
anyone, it is not a big deal.

-- Gaby

Eric Botcazou

2005-01-30 07:11:07 UTC

Post by Gabriel Dos Reis
Which is why I would appreciate to have pointers/evidence for those
"[typically] French schools use to teach" that. But if you don't have
anyone, it is not a big deal.

Well, mine IIRC (I can give the address privately :-).

However the context was slightly different: I was never taught to use
typographic upper case letters, like 'E'. But I was taught to start every
sentence with calligraphic letters (the 'H' especially was funny to write)
and these ones didn't have accents IIRC.

--
Eric Botcazou

Marcin Dalecki

2005-01-18 03:14:43 UTC

Post by Robert Dewar
It would be interesting to know an example of what you state above.
Certainly people have been using Latin-1 to write Ada in countries
all over the world, and no one has ever found the folding rules
for identifiers to be in any way inconvenient.

You must be joking or you don't speak any other language then english
yourself.
Let me assure you that the most common exercise in i18n is to avoid
automatic
case dependencies wherever possible. The most common example of the
idiocy of case
insensitivity are file system names. This leads for example quickly
into situations
where you:

1. Can't provide efficient hashing mechanisms for item lookup.
2. Can't even read the contents on a system with a different locale.
3. Can't change the locale at will.

And believe it or not it is more common in this world to speak multiple
languages then to
speak only a single language!

The most common mistake is to think that systems are either Latin-1 or
something
else. But in reality the most common case it that you want:

1. Change locale at the fly. (Yes the whole LC_ALL and famliy is
literally nearly
usless...)
2. Use multiple locales at the same time.

Maybe people from aboard the USA are not such whinnies so you don't
hear them
frequently complain. But if you are an english only speaker please just
don't bother
even thinking about localization. My experience shows that every single
localization system
devised by such a person was doing more harm then good. ISO for
cyrillic is a nice example
for such a thing. Literally nobody to whom it matters is using it
because some illiterate
imbecile managed to provide an 8 bit encoding for this alphabet which
is in fact not in
proper order and which isn't even complete.

Robert Dewar

2005-01-18 04:58:26 UTC

Post by Marcin Dalecki

Post by Robert Dewar
It would be interesting to know an example of what you state above.
Certainly people have been using Latin-1 to write Ada in countries
all over the world, and no one has ever found the folding rules
for identifiers to be in any way inconvenient.

1. Can't provide efficient hashing mechanisms for item lookup.
2. Can't even read the contents on a system with a different locale.
3. Can't change the locale at will.

I am talking specifically about the issue of Ada folding rules with
Latin-1 and ten years experience with them. You seem to be talking
about different issues entirely. Note that I talked about the folding
rules above. I think you did not read carefully, and thought I said
that people were happy to write in Latin-1. That of course is false.
That is why GNAT has always provided -gnatiw to allow full 16-bit
characters in identifiers, and it is why the new standard has
mandated this, and extended it to all planes of 10646.

Post by Marcin Dalecki
The most common mistake is to think that systems are either Latin-1 or
1. Change locale at the fly. (Yes the whole LC_ALL and famliy is
literally nearly
usless...)
2. Use multiple locales at the same time.

This has nothing whatever to do with folding of identifiers in Ada.
You are using the opportunity for general discussions of
internationalization. Fine, but it is not relevant to this thread
which is very specifically about folding Ada identifiers.

Post by Marcin Dalecki
ISO for cyrillic is a nice example for such a thing.
Literally nobody to whom it matters is using it
because some illiterate
imbecile managed to provide an 8 bit encoding for this alphabet which is
in fact not in proper order and which isn't even complete.

Well perhaps true in your world, but in fact the Cyrillic ISO
table for Ada identifiers was submitted by a GNAT user, so your
claim is most certainly false in the world we are talking about
here. Order is of course totally irrelevant for identifiers, and
apparently it is complete enough to be useful to Ada programmers
in the real world :-)

You really seem to have been set off on some flame war here and
I don't think it is relevant, since we are not talking about the
general issue, but about Ada identifiers. If you think you have
a useful contribution to make *on that subject*, by all means
read the ai (I gave the reference), and submit comments to
ada comment. The ARG will be happy to take them into account.

Marcin Dalecki

2005-01-18 05:45:18 UTC

Post by Robert Dewar
I am talking specifically about the issue of Ada folding rules with
Latin-1 and ten years experience with them. You seem to be talking
about different issues entirely. Note that I talked about the folding
rules above. I think you did not read carefully, and thought I said
that people were happy to write in Latin-1. That of course is false.
That is why GNAT has always provided -gnatiw to allow full 16-bit
characters in identifiers, and it is why the new standard has
mandated this, and extended it to all planes of 10646.

Look the problem isn't the fact that somebody wishes to support
international
encodings for symbols in code. This may be even helpful to some.
However I still stand by the opinion that declaring them case
insensitive
is some kind of wired idiocy which can be only the result of some
polit-bureau group. What are the good reasons to make them such in
first place?

Robert Dewar

2005-01-18 06:20:37 UTC

Post by Marcin Dalecki
Look the problem isn't the fact that somebody wishes to support
international
encodings for symbols in code. This may be even helpful to some.
However I still stand by the opinion that declaring them case insensitive
is some kind of wired idiocy which can be only the result of some
polit-bureau group. What are the good reasons to make them such in first
place?

OK, but the fight over whether Ada should have case insensitive identifier
names was discussed and decided 20 years ago, with almost no controversy.
Pretty much everyone agreed this was the way to go. There are many good
reasons, which are not worth rehashing here, since this thread is not
about replaying that old chesnut! And if you start of thinking it is
wired idiocy, and have not bothered to read up on the issue, I think
it unlikely to be fruitful to discuss it in any case!

Marcin Dalecki

2005-01-18 07:54:04 UTC

Post by Robert Dewar

Post by Marcin Dalecki
Look the problem isn't the fact that somebody wishes to support
international
encodings for symbols in code. This may be even helpful to some.
However I still stand by the opinion that declaring them case
insensitive
is some kind of wired idiocy which can be only the result of some
polit-bureau group. What are the good reasons to make them such in
first place?

OK, but the fight over whether Ada should have case insensitive identifier
names was discussed and decided 20 years ago, with almost no
controversy.
Pretty much everyone agreed this was the way to go. There are many good
reasons, which are not worth rehashing here, since this thread is not
about replaying that old chesnut! And if you start of thinking it is
wired idiocy, and have not bothered to read up on the issue, I think
it unlikely to be fruitful to discuss it in any case!

Reading up on the issue or not is not the point here. Simple due to the
fact
that I'm indeed a person which is using up to 4 different languages
with
different character set encodings (ASCII, Latin-1, Latin-2, KO8-R,
KOI8-U) on
a regular basis, did make me already suffer enough from such "good"
"i18n" ideas.
Bah... Even in the simple case where I did have to read some code
commented in german,
I did nearly always for some reason have to adjust to a different wired
pseudo-standard encoding I didn't have support for on the system I did
have to use it on.

Based on this experience I just think that you are indeed wasting your
time
on extending such a facility. There simply isn't such a thing
like a well defined equivalence relation R holding only and
only then when some two string x,y are equal with disregard to the
casing: x R y.
There are too many external variables such a relation depends on. There
isn't
even such a basic thing as a standard way to tell which encoding a file
is
written in. If looking at cyrillic for example it nearly always turns
out to be
a guess game... ALT, KOI8-R, ISO-8859-5, KOI8-U, CP1259?

Robert Dewar

2005-01-18 12:52:39 UTC

Based on this experience I just think that you are indeed wasting your time
on extending such a facility.

You should at least read the AI in question, because that does provide
a well defined definition of folding as required by the Ada standard.
(or more accurately as will be required when it is issued). If you want
to persue the issue at the Ada standard level, this is not the forum
for that.

Joe Buck

2005-01-18 19:14:55 UTC

Post by Robert Dewar
OK, but the fight over whether Ada should have case insensitive identifier
names was discussed and decided 20 years ago, with almost no controversy.
Pretty much everyone agreed this was the way to go.

20 years ago, only the US DoD and its contractors had any interest in Ada,
and they did all of their business in English. Case insensitivity is
trivial and uncontroversial if you're only dealing with English and ASCII.

Robert Dewar

2005-01-18 20:38:11 UTC

Post by Joe Buck

Post by Robert Dewar
OK, but the fight over whether Ada should have case insensitive identifier
names was discussed and decided 20 years ago, with almost no controversy.
Pretty much everyone agreed this was the way to go.

20 years ago, only the US DoD and its contractors had any interest in Ada,
and they did all of their business in English.

That's quite inaccurate, there was significant international involvement,
remember that Ada was an international standard. Only someone quite unaware
of Ada history would make such a statement. There was significant interest
in Ada from the start from many other countries. The ada business is not,
and never was, exclusively defense oriented, though of course that is a
significant segment (probably something like half the current business).

It is true that at the
time (1980) the use of >8 bit character codes was not seen as important
(and that included delegations from Japan, China etc, which concluded at
the time that everything would be encoded as 8-bit octets anyway).

In 1995, wide-character was added, with the notion that a single plane
approach was appropriate at that time, but there was quite a deliberate
decision NOT to require the use of wide characters in identifiers.

Now in 2005, the decision is to require wide characters in identifiers.
This reasonably corresponds to general thinking in this area :-)

Post by Joe Buck
Case insensitivity is
trivial and uncontroversial if you're only dealing with English and ASCII.

Actually the experience is that Latin-1 also works quite smoothly, as I said
earlier, no one has ever raised queries about the decisions made for Ada 1995,
which seem to have worked well.

Extending case insensitivity outside Latin-1 to the multi-plane world is tricky
indeed, but actually I think the ARG did a pretty good job of making the right
decisions (once you decide that you do want to do this extension, which is something
several delegations, including Japan, insisted on). Have you read the full
text of the AI?

Florian Weimer

2005-01-17 20:28:20 UTC

Post by Robert Dewar
I don't think that is the case, with the full 10646 database,
every character in the database is properly categorized,

Really? This has to be a new development. AFAIK, characters in the
"10646 database" do not have properties. Character properties are a
Unicode thing.

56 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Robert Dewar 2005-01-11 14:28:29 UTC

Paul Koning 2005-01-11 14:56:18 UTC

Neil Booth 2005-01-12 00:17:57 UTC

Robert Dewar 2005-01-12 01:15:11 UTC

Neil Booth 2005-01-12 01:55:39 UTC

Geoffrey Keating 2005-01-13 00:46:53 UTC

Robert Dewar 2005-01-15 05:26:28 UTC

Tom Tromey 2005-01-15 06:36:26 UTC

Robert Dewar 2005-01-15 11:38:03 UTC

Tom Tromey 2005-01-16 03:00:23 UTC

Robert Dewar 2005-01-16 03:14:20 UTC

Joseph S. Myers 2005-01-16 20:43:09 UTC

Robert Dewar 2005-01-16 21:16:43 UTC

Joseph S. Myers 2005-01-16 19:52:49 UTC

Robert Dewar 2005-01-16 20:34:18 UTC

Paul Koning 2005-01-17 15:42:49 UTC

Robert Dewar 2005-01-17 18:44:40 UTC

Paul Koning 2005-01-17 19:01:08 UTC

Robert Dewar 2005-01-17 19:09:44 UTC

Robert Dewar 2005-01-17 19:13:48 UTC

Paul Koning 2005-01-17 19:22:42 UTC

Robert Dewar 2005-01-17 19:35:20 UTC

Paul Koning 2005-01-17 19:44:10 UTC

Georg Bauhaus 2005-01-17 21:06:31 UTC

Robert Dewar 2005-01-17 21:45:50 UTC

Georg Bauhaus 2005-01-17 21:58:39 UTC

Robert Dewar 2005-01-17 22:06:48 UTC

Georg Bauhaus 2005-01-17 22:46:22 UTC

Robert Dewar 2005-01-18 02:39:47 UTC

Marcin Dalecki 2005-01-18 03:32:55 UTC

Florian Weimer 2005-01-17 22:03:26 UTC

Robert Dewar 2005-01-17 22:10:46 UTC

Paul Koning 2005-01-17 22:05:02 UTC

Robert Dewar 2005-01-17 22:16:54 UTC

Paul Koning 2005-01-17 22:31:15 UTC

Georg Bauhaus 2005-01-17 22:51:21 UTC

Marcin Dalecki 2005-01-18 03:36:04 UTC

Robert Dewar 2005-01-18 05:00:29 UTC

Robert Dewar 2005-01-17 23:21:46 UTC

Andreas Schwab 2005-01-17 22:22:51 UTC

Kevin Puetz 2005-01-18 06:37:39 UTC

Robert Dewar 2005-01-18 12:50:25 UTC

Marc Espie 2005-01-27 21:17:42 UTC

Robert Dewar 2005-01-30 05:27:49 UTC

Gabriel Dos Reis 2005-01-30 01:14:22 UTC

Robert Dewar 2005-01-30 06:23:10 UTC

Gabriel Dos Reis 2005-01-30 01:57:26 UTC

Eric Botcazou 2005-01-30 07:11:07 UTC

Marcin Dalecki 2005-01-18 03:14:43 UTC

Robert Dewar 2005-01-18 04:58:26 UTC

Marcin Dalecki 2005-01-18 05:45:18 UTC

Robert Dewar 2005-01-18 06:20:37 UTC

Marcin Dalecki 2005-01-18 07:54:04 UTC

Robert Dewar 2005-01-18 12:52:39 UTC

Joe Buck 2005-01-18 19:14:55 UTC

Robert Dewar 2005-01-18 20:38:11 UTC

Florian Weimer 2005-01-17 20:28:20 UTC

about - legalese

Loading...