Case-insensitive comparison of std::strings

"Ganesh" <***@gmail.com> wrote in message news:***@posting.google.com...
|
| Given that case-insensitive comparison is such a common operation,
| shouldn't it be made available within C++ standard library instead of
| leaving it to the programmers to re-write such commonly used
| functionality?

yes. Keep an eye out for the next version of boost which will have a string library .

br

Thorsten

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Roshan

2004-07-30 09:27:09 UTC

Post by Thorsten Ottosen
| Given that case-insensitive comparison is such a common operation,
| shouldn't it be made available within C++ standard library instead of
| leaving it to the programmers to re-write such commonly used
| functionality?

std::string may not, but std::basic_string<> should allow for that i think
In a sense there are multiple ways to sort strings...ascii... ebcdic...lexicographical....

You need to define a custom character trait that implements the correct compare( ) function.
It _may_ be pretty easy to write a custom char trait that inherits from char_traits<char> and simply
overrides
the comapre( ) function. Then use your my_char_trait as follows

typedef std::basic_string<char, my_char_trait> insensitive_string;

-Roshan

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Thomas Maeder

2004-07-29 21:13:41 UTC

Post by Ganesh
It is a surprise to most of the "common" C++ programmers to learn that
std::string provides no simple way of doing case-insensitive
http://www.freshsources.com/bjarne/ALLISON.HTM
http://www.josuttis.com/libbook/string/icstring.hpp.html
Given that case-insensitive comparison is such a common operation,
shouldn't it be made available within C++ standard library instead of
leaving it to the programmers to re-write such commonly used
functionality?

Please give an *exact* specification of what you understand by
"case insensitive comparison of std::strings". Take into consideration
that in German, "MASSE" and "Masse" should only compare equal if they both
mean "mass", but not if they mean "measures".

Oh, and that's only in some countries, such as Germany and Austria. Here in
Switzerland, they should always compare equal.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Julie

2004-07-30 09:35:19 UTC

Please give an *exact* specification of what you understand by
"case insensitive comparison of std::strings". Take into consideration
that in German, "MASSE" and "Masse" should only compare equal if they both
mean "mass", but not if they mean "measures".
Oh, and that's only in some countries, such as Germany and Austria. Here in
Switzerland, they should always compare equal.

a - lower case 'A'
A - upper case 'A'

case-insensitive comparison - a == A

Remember, this operates on _characters_, not words. It doesn't matter if MASSE
and Masse are considered different words in different countries -- in that
case, you wouldn't do a case insensitive comparison. For those countries where
case doesn't determine the word, case insensitive comparisons would be
appropriate.

All of this, however, is at the *option* of the programmer. Right now, there
isn't an intrinsic way way to compare std::string in a case-insensitive way.
Having that capability would be beneficial, boost offerings aside and
localities aside.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Francis Glassborow

2004-07-30 15:27:42 UTC

In article <***@nospam.com>, Julie <***@nospam.com>
writes

Post by Julie
a - lower case 'A'
A - upper case 'A'
case-insensitive comparison - a == A
Remember, this operates on _characters_, not words.

Fine, but how should accented lower case letter compare to unaccented
uppercase ones? Please note that we use accents and other diacriticals
in British English but generally only on lowercase letters.

The idea that there is a (natural) language universal concept of case
sensitivity is simplistic. For example, how should we handle the German
double s represented by a glyph that looks like beta.

Case, along with collation order is not a property of letters but of a
specific use of a natural language. We should not give some elevated
status to (US) English other than that it already has by effectively
being the default C and C++ locale. And in those contexts, case
sensitivity reigns.

--
Francis Glassborow ACCU
Author of 'You Can Do It!' see http://www.spellen.org/youcandoit
For project ideas and contributions: http://www.spellen.org/youcandoit/projects

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Julie

2004-07-31 03:14:20 UTC

Post by Julie
a - lower case 'A'
A - upper case 'A'
case-insensitive comparison - a == A
Remember, this operates on _characters_, not words.

Fine, but how should accented lower case letter compare to unaccented
uppercase ones?

You tell me!

How are accented letters compared in real life situations in a locale? Apply
that model to the language and create specific comparators that operate on the
current locale.

Post by Francis Glassborow
The idea that there is a (natural) language universal concept of case
sensitivity is simplistic. For example, how should we handle the German
double s represented by a glyph that looks like beta.

Please provide more on how case sensitivity is simplistic? Do German keyboards
not have a shift key? or does the shift key not operate on the QWERTY portion
of the keyboard?

Case is all very well defined, per the keyboard -- use that model.

Post by Francis Glassborow
Case, along with collation order is not a property of letters but of a
specific use of a natural language. We should not give some elevated
status to (US) English other than that it already has by effectively
being the default C and C++ locale. And in those contexts, case
sensitivity reigns.

I really don't know what your status comment has to do w/ anything pertaining
to this topic.

Nobody is advocating changes to the current behavior, simply adding to it to
provide support for (presumably locale-specific) case-insensitive comparisons.
If case-(in)sensitivity doesn't apply to a particular locale, then it isn't
provided. For those where it does, it is provided, available, and usable at
the discretion of the programmer.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Thomas Maeder

2004-07-31 13:51:32 UTC

Post by Francis Glassborow
Fine, but how should accented lower case letter compare to unaccented
uppercase ones?

You tell me!

I'm not Francis, but if you ask me to tell you, then I think that the idea
that there is an (even locale specific) general correct way of treating
strings case-insensitively is wrong.

Post by Julie
How are accented letters compared in real life situations in a locale? Apply
that model to the language and create specific comparators that operate on
the current locale.

Fine. You have just contradicted yourself. :-)

In real life, "ss" and "SS" are compared in a context dependent way. You can
only do it correctly if you know the meaning of the word that they are part
of. And this is just an example.

Case-insensitivity on a character by character basis simply doesn't make any
sense in the scope of std::string.

Please provide more on how case sensitivity is simplistic? Do German
keyboards not have a shift key? or does the shift key not operate on the
QWERTY portion of the keyboard?
Case is all very well defined, per the keyboard -- use that model.

This is utter nonsense.

First, there are different varieties of German keyboards (and I think they all
are "QWERTZ", not "QWERTY"), with different behavior wrt case.

Second, the idea that an ISO Standard should be based on some keyboard layout
is really adventurous.

Post by Julie
Nobody is advocating changes to the current behavior, simply adding to it to
provide support for (presumably locale-specific) case-insensitive comparisons.

[Repeating myself:] Locale specificity is not sufficient, understanding of
the text is required.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Ray Lischner

2004-07-31 18:45:50 UTC

Post by Thomas Maeder
Locale specificity is not sufficient, understanding of
the text is required.

I'm curious. How does word processing software perform a
case-insensitive search in German? I guess they detect incorrect
matches, and it is up to the user to decide what to do with the
results. Or do they try to interpret the text to find only correct
matches?

--
Ray Lischner, author of C++ in a Nutshell
http://www.tempest-sw.com/cpp

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Thomas Maeder

2004-08-01 11:15:41 UTC

Post by Thomas Maeder
Locale specificity is not sufficient, understanding of
the text is required.

I'm curious. How does word processing software perform a
case-insensitive search in German? I guess they detect incorrect
matches, and it is up to the user to decide what to do with the
results.

FWIW, I just created a Word document, entered "Maße", set the language of the
entire document to "German (Germany)" and did a (what Word calls)
case-insensitive search for "MASSE" (and "MASZE", to err on the safe side);
Word didn't find it. Same result for "MASSE" in the text and "Maße" in
the search argument.

Post by Ray Lischner
Or do they try to interpret the text to find only correct matches?

I wouldn't know of any software that could do this anywhere near correctly.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Julie

2004-08-02 15:14:51 UTC

Post by Thomas Maeder
Locale specificity is not sufficient, understanding of
the text is required.

I'm curious. How does word processing software perform a
case-insensitive search in German? I guess they detect incorrect
matches, and it is up to the user to decide what to do with the
results.

What happens if you enter in "Maße" and then Format/Change Case/lowercase?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Thomas Maeder

2004-08-02 22:48:47 UTC

[I have the feeling that this is getting off-topic.]

Post by Julie
What happens if you enter in "Maße" and then Format/Change Case/lowercase?

MAßE, which seems very wrong to me.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

llewelly

2004-08-03 11:37:32 UTC

Post by Thomas Maeder
Locale specificity is not sufficient, understanding of
the text is required.

I'm curious. How does word processing software perform a
case-insensitive search in German? I guess they detect incorrect
matches, and it is up to the user to decide what to do with the
results.

My question is: Do you think a German-speaker who was an ordinary
computer user, would find this behavior an unpleasant surprise?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Gerhard Menzl

2004-08-04 12:40:50 UTC

This post might be inappropriate. Click to display it.

Thomas Maeder

2004-08-04 12:49:17 UTC

FWIW, I just created a Word document, entered "Ma=DFe", set the lang=

uage

of the entire document to "German (Germany)" and did a (what Word ca=

lls)

case-insensitive search for "MASSE" (and "MASZE", to err on the safe
side); Word didn't find it. Same result for "MASSE" in the text and
"Ma=DFe" in the search argument.

My question is: Do you think a German-speaker who was an ordinary
computer user, would find this behavior an unpleasant surprise?

No. The problem can't be correctly solved, so I'm not surprised of anythi=
ng
here.

What I am surprised of is that the moderators let all this happen in this
newsgroup. :-)

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

k***@gabi-soft.fr

2004-08-04 14:04:51 UTC

Locale specificity is not sufficient, understanding of the text
is required.

I'm curious. How does word processing software perform a
case-insensitive search in German? I guess they detect incorrect
matches, and it is up to the user to decide what to do with the
results.

FWIW, I just created a Word document, entered "Maße", set the
language of the entire document to "German (Germany)" and did a
(what Word calls) case-insensitive search for "MASSE" (and "MASZE",
to err on the safe side); Word didn't find it. Same result for
"MASSE" in the text and "Maße" in the search argument.

My question is: Do you think a German-speaker who was an ordinary
computer user, would find this behavior an unpleasant surprise?

Perhaps:-). This is getting a bit away from C++ (and I'm probably not a
typical computer user, so take my comments with a bit of salt), but...

There are two contexts where the case issue comes up when dealing in
normal text: converting, and searching. Given the word "Maße" in normal
texte, I would expect converting it to caps to give "Masse"; if a
program claims to support case conversion, and doesn't do this, it is,
IMHO, broken. I don't expect the reverse to be true -- it may be
because I am computer aware, and realize the limitations, but I wouldn't
expect a program, asked to convert "MASSE" to lower case, to be able to
tell whether the results should be "Maße" or "Masse"; for that matter, I
would be very impressed if the program realized that it was dealing with
a word which, even in lower case, must start with a capital letter.
Similarly, it would never occur to me to do a case insensitive search
for "MASSE"; I would expect, however, that a case insensitive search for
"maße" or "Maße" match MASSE.

The C++ library has all of the necessary functions for most reasonable
uses, see the std::collate facet, or std::locale::operator(), for
example. Logically, they ARE part of the locale section of the library,
since they very much depend on the locale. Regretfully (although I
don't see any reasonable alternative), the standard doesn't require any
locales except "C" to be present, and text is case sensitive in the C
locale, so you have no guarantee of being able to do a case insensitive
comparison. IMHO, it wouldn't be too much for the standard to require
at least one language/country specific locale to be furnished, although
in the absence of a standard for naming such locales, I'm not sure how
much this would help. From a quality of implementation point of view,
I think a minimum would include an international locale (based on
English, since that is the international language) and a locale for the
country in which the compiler is being sold -- for countries like
Belgium, Canada and Switzerland, this means in fact several locales.

I also see a need for OS specific locales, e.g. "POSIX" or "WINDOWS".
(The Posix standard requires it for Posix conformant systems.) Thus, in
"POSIX", the collate facet is case sensitive, in Windows no. Here, too,
it would seem acceptable, at least to me, that the standard require such
a locale; possibly even that it give it a fixed name (e.g. "SYSTEM").

As a passing thought, I wonder what rules Windows uses for its case
insensitive filename comparison. In French, for example, 'i' == 'I',
but this would definitly not be the case in Turkish, where you should
have 'i' == '\u0130' and '\u0131' == 'I'. I suppose that the obvious
solution is just to ignore all accents, with 'i' == 'I' == '\u0130' ==
'\u0131', but this will lead to ambiguous names in Turkish, and probably
some other languages as well. And of course, "Maße", "MASSE" and
"MASZE" must compare equal as well, or the system will be quite
counter-intuitive in most German speaking areas (but not Switzerland).

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Niklas Matthies

2004-08-04 12:10:21 UTC

Post by Thomas Maeder
Locale specificity is not sufficient, understanding of
the text is required.

I'm curious. How does word processing software perform a
case-insensitive search in German? I guess they detect incorrect
matches, and it is up to the user to decide what to do with the
results.

FWIW, I just created a Word document, entered "Maße", set the
language of the entire document to "German (Germany)" and did a
(what Word calls) case-insensitive search for "MASSE" (and "MASZE",
to err on the safe side); Word didn't find it. Same result for
"MASSE" in the text and "Maße" in the search argument.

While not a word processor, my online banking application converts
"Maße" to "MASSE" in the "reason for transfer" field, and the search
function also correctly finds the transfer containing "MASSE" when
searching for "maße". (Actually I tested this with "Straße".)
Google also finds "ss"/"ae"/etc. when searching for "ß"/"ä"/etc.
and vice versa, as well as http://dict.leo.org/ and many other
dictionaries.

I would say that this is pretty much expected behavior from German-
aware software, despite resulting in "incorrect" matches. IMHO such
matches are in the same category as those you would get with any
homographic word (e.g. "record").

-- Niklas Matthies

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Thomas Maeder

2004-08-04 13:57:19 UTC

Post by Niklas Matthies
While not a word processor, my online banking application converts
"Maße" to "MASSE" in the "reason for transfer" field, and the search
function also correctly finds the transfer containing "MASSE" when
searching for "maße". (Actually I tested this with "Straße".)
Google also finds "ss"/"ae"/etc. when searching for "ß"/"ä"/etc.
and vice versa, as well as http://dict.leo.org/ and many other
dictionaries.

Converting "Maße" to "MASSE" is ok, but converting the other way round
is presumptuous, as is telling that the two mean the same thing.

Unless you are in a well-defined context that is a small subset of the
context of a language, or a locale. Such as, as you tell me, your
application,
or, as another example, Internet host names.

"Straße" is a different case because there is no word "Strasse" in
German German.

[And I don't see how "Maße" can be a reason for transfer, but that may
be$
me :-)]

Post by Niklas Matthies
I would say that this is pretty much expected behavior from German-
aware software, despite resulting in "incorrect" matches. IMHO such
matches are in the same category as those you would get with any
homographic word (e.g. "record").

If you can live with false positives, that's ok.

But functionality that delivers false positives should not be
standardized.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Niklas Matthies

2004-08-05 03:19:49 UTC

Converting "Maße" to "MASSE" is ok, but converting the other way round
is presumptuous, as is telling that the two mean the same thing.

But subsequent application of the search function does exactly that.
When searching for "Masse", the search function cannot tell whether
you mean to include the "Masse" transcription of "Maße" or not.

Or, incidentally, when searching for "Thomas Maeder" it cannot tell
whether you mean to (also) get matches for "Thomas Mäder" or not.

German-language strings like "Maeder" are inherently ambiguous in
general, since you can't tell whether this may mean "Mäder" because
of restricted input capabilities (say a US keyboard without an input
method for non-US characters) or a restricted character set (as is the
case with bank transfers), or whether this is really meant to be
"Maeder" and only "Maeder".

If you can live with false positives, that's ok.
But functionality that delivers false positives should not be
standardized.

My point is that searching for "record" (case-insensitive or not) can
also result in such semantic false positives.

-- Niklas Matthies

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

James Kanze

2004-08-08 22:04:20 UTC

Thomas Maeder <***@glue.ch> writes:

|> Niklas Matthies <usenet-***@nmhq.net> writes:

|> > While not a word processor, my online banking application converts
|> > "Maße" to "MASSE" in the "reason for transfer" field, and the
|> > search function also correctly finds the transfer containing
|> > "MASSE" when searching for "maße". (Actually I tested this with
|> > "Straße".) Google also finds "ss"/"ae"/etc. when searching for
|> > "ß"/"ä"/etc. and vice versa, as well as http://dict.leo.org/ and
|> > many other dictionaries.

|> Converting "Maße" to "MASSE" is ok, but converting the other way
|> round is presumptuous, as is telling that the two mean the same
|> thing.

All of the applications I know are either case sensitive, or treat
everything as upper case, for historical reasons. One important
application I don't know is what Windows does with filenames, but MS-DOS
also treated them as upper case, so I suspect that this is also the
case. Thus, if you have a file named Maße.txt, and you try and create
one named MASSE.TXT, the system should refuse (or replace the original
file, depending on the context).

Similarly, for a generalized text insensitive search, I would convert to
upper case, and treat all accented characters as being equal to the
unaccented version. There would be some false positives, but this is
generally preferable to missing something, or requiring the user to make
several searches or to use some complex regular expression to find what
he is looking for.

|> Unless you are in a well-defined context that is a small subset of
|> the context of a language, or a locale. Such as, as you tell me,
|> your application, or, as another example, Internet host names.

Internet domain names are a special case -- only seven bit ASCII is
allowed, so no ambiguities are possible. But unless you are writing
protocol level software (e.g. a new implementation of DNS), you should
probably not play with it.

|> "Straße" is a different case because there is no word "Strasse" in
|> German German.

But there is in Swiss German.

And to add an additional complication, when I learned German, there was
no word "dass" in German, just "daß". Today, it is the reverse.

|> [And I don't see how "Maße" can be a reason for transfer, but that
|> may be me :-)]

I suspect that he just experimented with the two lines of free text
allowed in a standard bank transfer. It doesn't have to be reasonable,
as long as both parties understand why the transfer is being made.

|> > I would say that this is pretty much expected behavior from
|> > German-aware software, despite resulting in "incorrect" matches.
|> > IMHO such matches are in the same category as those you would get
|> > with any homographic word (e.g. "record").

|> If you can live with false positives, that's ok.

|> But functionality that delivers false positives should not be
|> standardized.

The problem is what the program is being used for. IMHO:

- there is a locale specific function, std::collate<>::compare, which
is standardized, and would seem to fit the bill, and

- it probably wouldn't be a bad idea to add a requirement for a locale
for comparing system specific filenames -- Posix requires a locale
named "POSIX", but as far as I know, Windows doesn't require
anything, and there really should be a portable name that one could
use.

For other cases of case insensitive comparison... Who knows what is
needed?

--
James Kanze
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Niklas Matthies

2004-08-09 09:25:30 UTC

One important application I don't know is what Windows does with
filenames, but MS-DOS also treated them as upper case, so I suspect
that this is also the case. Thus, if you have a file named
Maße.txt, and you try and create one named MASSE.TXT, the system
should refuse (or replace the original file, depending on the
context).

No, because the case-insensitivity of the filesystem needs to be
locale-independent. Otherwise two files with different names with
regard to locale X suddenly have the same name when seen from
locale Y. What Windows seems to do is a character-by-character
translation for those character codes that have a "canonical" 1:1
case mapping.

-- Niklas Matthies

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

k***@gabi-soft.fr

2004-08-10 18:52:39 UTC

Post by Niklas Matthies

No, because the case-insensitivity of the filesystem needs to be
locale-independent.

That's part of my point. There is no such thing as a locale independant
case-insensitivity. From a human point of view, all interactions with
the computer (commands, filenames, etc.) should be case insensitive.

Post by Niklas Matthies
From a practical point of view, this leads to the problem of locale

dependency. There is a trade-off which has to be made, and it isn't
always obvious.

Post by Niklas Matthies
Otherwise two files with different names with regard to locale X
suddenly have the same name when seen from locale Y.

Quite. And a lot depends on the use of the machine. On a "personal"
computer, there should be no problem using my "personal" locale; on a
shared computer, or a computer accessing a shared file system, this
becomes more problematic, and the system probably has to impose a locale
for itself, e.g.: locale "POSIX" or locale "WINDOWS".

As I've mentioned in another post, it might be worthwhile for the
standard to require such a locale, under a standardized name, e.g. "OS",
or "System", or some such.

Post by Niklas Matthies
What Windows seems to do is a character-by-character translation for
those character codes that have a "canonical" 1:1 case mapping.

Which defines a particular locale. It is NOT the "C" locale; in the "C"
locale, character comparisons (strcoll, std::collate<char>::compare,
etc.) are case sensitive.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

llewelly

2004-08-20 00:23:58 UTC

Post by Julie
a - lower case 'A'
A - upper case 'A'
case-insensitive comparison - a == A
Remember, this operates on _characters_, not words.

[snip]

The solution is simple - if a character with more than one 'obvious'
or no obvious case conversion is encountered, the function calls
std::terminate() .

Are you appalled? I know I am. But without such a function, nearly
every program uses some in-house function which does some variant
of case-insensitive comparison, and, when faced with the
situations you describe, silently does the wrong thing.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Alf P. Steinbach

2004-08-20 10:06:38 UTC

Post by Julie
a - lower case 'A'
A - upper case 'A'
case-insensitive comparison - a == A
Remember, this operates on _characters_, not words.

[snip]
The solution is simple - if a character with more than one 'obvious'
or no obvious case conversion is encountered, the function calls
std::terminate() .

I'm appalled.

The solution is simple, as I've described in another posting in this thread:
if the character code, regardless of locale issues and such, defines a
unique uppercase version of the lowercase accented letter, use that (e.g.
accented uppercase); if not, let the character be as-is -- a to_upper()
convenience function is for convenience, not for $50.000 word processing
with tens or hundreds of MiB natural language parser & KBS at bottom.

Incidentally I believe this approach, except perhaps the "ignore locale"
bit, reflects current practice, which is a Good Thing to standardize.

Post by llewelly
Are you appalled?

See above.

--
A: Because it messes up the order in which people normally read text.
Q: Why is it such a bad thing?
A: Top-posting.
Q: What is the most annoying thing on usenet and in e-mail?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Francis Glassborow

2004-08-20 14:41:09 UTC

Post by Alf P. Steinbach
if the character code, regardless of locale issues and such, defines a
unique uppercase version of the lowercase accented letter, use that (e.g.
accented uppercase); if not, let the character be as-is -- a to_upper()
convenience function is for convenience, not for $50.000 word processing
with tens or hundreds of MiB natural language parser & KBS at bottom.

Many programmers assume that c==to_upper(to_lower(c)) and
c == to_lower(to_upper(c)) are universally true. It seems that this
assumption might be false.

--
Francis Glassborow ACCU
Author of 'You Can Do It!' see http://www.spellen.org/youcandoit
For project ideas and contributions: http://www.spellen.org/youcandoit/projects

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Kai-Uwe Bux

2004-08-21 03:24:55 UTC

Post by Francis Glassborow

Post by Alf P. Steinbach
The solution is simple, as I've described in another posting in this
thread: if the character code, regardless of locale issues and such,
defines a unique uppercase version of the lowercase accented letter, use
that (e.g.
accented uppercase); if not, let the character be as-is -- a to_upper()
convenience function is for convenience, not for $50.000 word processing
with tens or hundreds of MiB natural language parser & KBS at bottom.

Many programmers assume that c==to_upper(to_lower(c)) and
c == to_lower(to_upper(c)) are universally true. It seems that this
assumption might be false.

I would hope that no programmer assumes

c == to_upper( to_lower( c ) )

to be true for c == 'd'. Probably, you meant:

to_upper( c ) == to_upper( to_lower( c ) )

Now, that is something, that I think *should* be true for all values of c.
Do you know an instance, where it fails?

Best

Kai-Uwe Bux

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Francis Glassborow

2004-08-21 16:07:46 UTC

Post by Kai-Uwe Bux
I would hope that no programmer assumes
c == to_upper( to_lower( c ) )
to_upper( c ) == to_upper( to_lower( c ) )

Yes, that is what I should have written.

Post by Kai-Uwe Bux
Now, that is something, that I think *should* be true for all values of c.
Do you know an instance, where it fails?

I think it easier to give the counter example for:

to_lower(c) == to_lower(to_upper(c));

and consider locales in which lower case accented letters are converted
to upper case un-accented ones.

Kai-Uwe Bux

2004-08-22 03:09:07 UTC

Post by Francis Glassborow

Post by Kai-Uwe Bux
I would hope that no programmer assumes
c == to_upper( to_lower( c ) )
to_upper( c ) == to_upper( to_lower( c ) )

Yes, that is what I should have written.

Post by Kai-Uwe Bux
Now, that is something, that I think *should* be true for all values of c.
Do you know an instance, where it fails?

to_lower(c) == to_lower(to_upper(c));
and consider locales in which lower case accented letters are converted
to upper case un-accented ones.

Thanks, that is convincing. I am forced to reconsider.

Best

Kai-Uwe Bux

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Niklas Matthies

2004-08-22 23:06:52 UTC

Post by Kai-Uwe Bux
to_upper( c ) == to_upper( to_lower( c ) )
Now, that is something, that I think *should* be true for all values
of c. Do you know an instance, where it fails?

It can fail when titlecase characters (such as U+01F2) are considered
to constitute uppercase and to_upper() returns them unchanged.

-- Niklas Matthies

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

k***@gabi-soft.fr

2004-08-20 14:43:21 UTC

Post by Julie
a - lower case 'A'
A - upper case 'A'
case-insensitive comparison - a == A
Remember, this operates on _characters_, not words.

[snip]
The solution is simple - if a character with more than one 'obvious'
or no obvious case conversion is encountered, the function calls
std::terminate() .
Are you appalled? I know I am. But without such a function, nearly
every program uses some in-house function which does some variant
of case-insensitive comparison, and, when faced with the
situations you describe, silently does the wrong thing.

Now that's an interesting point of view. I'm intregued.

Basically, your argument is that practically every program uses some
broken version in house, so we should ensconce a specific broken version
in the standard. There is definitly some precedent (think of gets), and
at least in that case, we know where we stand.

I also like your suggestion for handling the awkward cases:-).
Seriously. There ARE contexts where case insensitivity makes sense, but
the only ones I can think of are when the character set is limited to
straight ASCII.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Alf P. Steinbach

2004-08-21 04:15:34 UTC

Post by k***@gabi-soft.fr
There ARE contexts where case insensitivity makes sense, but
the only ones I can think of are when the character set is limited to
straight ASCII.

Most simple text searching operations can involve case insensitity.

File names, process names, etc.

Usually my own case insensitive searches (as a computer user) require
at least 8259-1, since ASCII doesn't have the Norwegian æøåÆØÅ, or UCS2,
since many commonly used characters such as m-dash and euro are not in the
basic Latin-1 set.

It's no big deal to support this limited functionality, but the idea that
software simply shouldn't work if it cannot support all potential cases
is not that far-fetched -- because there's much actual software that
behaves that way!

For example, many of Microsoft's C++ development tools have traditionally
only worked 100% in Seattle/Redmond; in Visual Studio 7.1 (the latest
offering when disregarding beta of new version) the "front page", so to
speak, has three tabs called "Projects", "Online Resources" and "My
Profile", and the "Online Resources" either works or not at all, that is,
no result whatsoever, not even gibberish, depending on the locale settings
of the machine and some mysterious factor that nobody's identified so far.
Presumably it doesn't call std::terminate and recover from that but instead
just throws an exception, when you don't have the right character code,
locale, keyboard and so on. What a great idea!

k***@gabi-soft.fr

2004-08-23 22:29:33 UTC

There ARE contexts where case insensitivity makes sense, but the
only ones I can think of are when the character set is limited to
straight ASCII.

Most simple text searching operations can involve case insensitity.

Most simple text searching operations should involve case
insensitivity. And do it correctly -- "Maße" matches "MASSE" (or
"MASZE").

Obviously, a simple, character based toupper or to lower function won't
help here.

Post by Alf P. Steinbach
File names, process names, etc.

Most of the cases I've seen of these limit the character sets.
Extremely. The one exception is Windows, and I've not had the chance to
see what semantics they assign. Will the filename "Maße" match a file
created with the name of "MASSE"? What happens with 'i', whose upper
case equivalent depends on the language? Etc., etc. Or does Windows
just ignore the accents, so that "parlé" and "parle" are the same
filename. (I think that that is what I would do.)

Note that when all possible characters are allowed, even pure caps can
cause problems. For example, can you tell the difference between "AB"
and "\u0391\u0392" -- I don't know of any font where they are
distinguishable.

Post by Alf P. Steinbach
Usually my own case insensitive searches (as a computer user) require
at least 8259-1, since ASCII doesn't have the Norwegian æøåÆØÅ, or
UCS2, since many commonly used characters such as m-dash and euro are
not in the basic Latin-1 set.

I think that there is a typo somewhere in that paragraph, since Latin-1
is ISO 8859-1 (and I don't think that there is such a thing as ISO 8259,
although it seems frequently referenced). But note that you are already
introduced locale dependencies. In Norwegian (I think -- at least in
Danish and Swedish), letters like ø or æ ARE distinct letters, with
their own place in the alphabet. Not all languages treat accented
letters this way, and of course, as I've already mentionned, in Turkish,
'I' is NOT the upper case form of 'i' -- they are two distinct letters.
(Normally, Turkish would use Latin-3, where the capital of 'i' has the
code 0xA9. In Unicode, it would be \u0130.)

Post by Alf P. Steinbach
It's no big deal to support this limited functionality, but the idea
that software simply shouldn't work if it cannot support all potential
cases is not that far-fetched -- because there's much actual software
that behaves that way!

How true:-).

In fact, I'm not so much against the idea of standardizing something, as
I am against standardizing it now, when we don't yet know what the
correct solution is (asuming there is one).

Post by Alf P. Steinbach
For example, many of Microsoft's C++ development tools have
traditionally only worked 100% in Seattle/Redmond;

And those from Sun in California:-). And so on. So we standardize bad
practices, on the grounds that they are wide-spread?

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Alf P. Steinbach

2004-08-24 06:38:48 UTC

There ARE contexts where case insensitivity makes sense, but the
only ones I can think of are when the character set is limited to
straight ASCII.

Most simple text searching operations can involve case insensitity.

Most simple text searching operations should involve case
insensitivity. And do it correctly -- "Maße" matches "MASSE" (or
"MASZE").
Obviously, a simple, character based toupper or to lower function won't
help here.

Obviously it does help, because that's what I & a zillion computer users use
today and find very helpful... ;-)

When it's _simple_ enough that the user can understand it fully, the user
can supply the intelligence that is seems you'd like it to have, and as of
2004 any intelligent design places the req. of intelligence on the user.

When it's complex & intelligent enough to handle most such cases it won't be
simple enough to understand (so the user cannot then know and work around
limitations), furthermore it probably won't be there at all...

Julie

2004-08-24 06:39:53 UTC

There ARE contexts where case insensitivity makes sense, but the
only ones I can think of are when the character set is limited to
straight ASCII.

Most simple text searching operations can involve case insensitity.

Most simple text searching operations should involve case
insensitivity. And do it correctly -- "Maße" matches "MASSE" (or
"MASZE").

Sorry, but I'm going to have to interject here --

Let's back up for a sec -- the things that we are dealing with are called
'strings' right -- and that is just a shorted term for an 'array of
characters'. We aren't dealing with some type called 'word', so what happens
at the word level has absolutely no relevance.

If a character in a character set has an upper case and lower case equivalent,
then that is used in case transformation. If it doesn't, the there isn't
transformed. Plain, simple, and quite easy to understand.

Forget this MASSE word. Speak strictly on the ß character:

What is the upper case _character_ of ß?

What is the lower case _character_ of ß?

If the answer is that there isn't an upper or lower case, then:

ß == toupper('ß') && ß == tolower('ß')

*regardless* of adjacent characters (that may be interpreted as a 'word').

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Anthony Williams

2004-08-24 19:47:30 UTC

There ARE contexts where case insensitivity makes sense, but the
only ones I can think of are when the character set is limited to
straight ASCII.

Most simple text searching operations can involve case insensitity.

Most simple text searching operations should involve case
insensitivity. And do it correctly -- "Maße" matches "MASSE" (or
"MASZE").

There are cases where the upper or lower case equivalents are not a single
character. There are also cases where the transformation is not reversible,
since there are multiple lower case characters with the same upper case
character.

There are also cases where the upper or lower case equivalent depends on the
current locale, and/or the context of the rest of the word/sentence ---
e.g. lower case sigma is different at the end of a word to in the middle, and
the upper case character for 'i' depends on the language.

If you mean to disregard all these cases, then you *can* define a simplified
toupper/tolower, where characters without a *simple* translation are left
as-is. Whether this is then useful is another question.

Post by Julie
What is the upper case _character_ of ß?

_two_ characters --- SS

Post by Julie
What is the lower case _character_ of ß?

ß is lower case.

Anthony

--
Anthony Williams
Senior Software Engineer, Beran Instruments Ltd.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

James Hopkin

2004-08-24 22:35:02 UTC

Post by Julie
What is the upper case _character_ of ß?
What is the lower case _character_ of ß?

The answers to your questions are upper case: SS, lower case: ß

At least, I believe that's most common, if not obligatory.

Similarly, ö is often capitalised as OE.

That being the case in German, I can well believe there are similar
situations in other alphabetic languages, where there isn't a
one-to-one mapping between lower- and upper-case.

James

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

k***@gabi-soft.fr

2004-08-24 22:41:32 UTC

Post by k***@gabi-soft.fr
There ARE contexts where case insensitivity makes sense, but
the only ones I can think of are when the character set is
limited to straight ASCII.

Most simple text searching operations can involve case
insensitity.

Most simple text searching operations should involve case
insensitivity. And do it correctly -- "Maße" matches "MASSE" (or
"MASZE").

Actually, in C++, it's just a standard term for an array of small
integers. C++ doesn't have a character type.

Post by Julie
We aren't dealing with some type called 'word', so what happens at
the word level has absolutely no relevance.

The question is: are we or are we not dealing with text? If we are
dealing with text, then we treat it as text. If we aren't, then what do
upper and lower case mean?

Post by Julie
If a character in a character set has an upper case and lower case
equivalent, then that is used in case transformation. If it doesn't,
there isn't transformed. Plain, simple, and quite easy to understand.

For a programmer. For a user who does a search on "Maße", and doesn't
find "MASSE", it's impossible to understand.

Post by Julie
What is the upper case _character_ of ß?

"SS". Or "SZ", in some contextes, but I suspect that you could get away
with "SS".

Post by Julie
What is the lower case _character_ of ß?

"ß"

Post by Julie
If the answer is that there isn't an upper or lower case,

If the answer is that the alphabet in question doesn't have case, then
there is no problem. The problem is that 'ß' is lower case, and that
it's upper case equivalent requires two characters.

(Actually, it's more subtle than that. At least one character set has a
character 'SS', a single character than when typeset looks exactly like
two S's. In that character set, toupper( 'ß' ) works.)

Post by Julie
ß == toupper('ß') && ß == tolower('ß')
*regardless* of adjacent characters (that may be interpreted as a 'word').

It has nothing to do with words. It has to do with the fact that the
upper case variant of one letter might require more than one letter.
There's also the fact that the upper case equivalent is definitly locale
specific -- different locales have different rules.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Daniel Krügler (ne Spangenberg)

2004-07-30 15:34:54 UTC

Hello Julie.

Please give an *exact* specification of what you understand by
"case insensitive comparison of std::strings". Take into consideration
that in German, "MASSE" and "Masse" should only compare equal if they both
mean "mass", but not if they mean "measures".
Oh, and that's only in some countries, such as Germany and Austria. Here in
Switzerland, they should always compare equal.

a - lower case 'A'
A - upper case 'A'
case-insensitive comparison - a == A
Remember, this operates on _characters_, not words. It doesn't matter if MASSE
and Masse are considered different words in different countries -- in that
case, you wouldn't do a case insensitive comparison. For those countries where
case doesn't determine the word, case insensitive comparisons would be
appropriate.
All of this, however, is at the *option* of the programmer. Right now, there
isn't an intrinsic way way to compare std::string in a case-insensitive way.
Having that capability would be beneficial, boost offerings aside and
localities aside.

I don't think that you can ignore locales for any proper
case-insentitive comparison which acts on general
strings (and not of special constrained strings, which might be limited
to some special code set). I can
say that because I once did the same error (and I actually I **should**
have known it due to my national
origin...).
Consider languages (e.g. German) which don't have a unique
character-by-character mapping (e.g. sz, which
is the character ß in my code page, ands maps to ss). Additionally there
exist circumstances where an umlaut
can validly compared by a two-character-representation (e.g. ü -> ue).
So I don't think, that the C++
standard should provide any half-baked solution.

If have a limited on special character codes you can write a quite
general solution by writing a special char_traits
class and use this traits class in the std::basic_string<> class
template. Have a look

http://www.gotw.ca/gotw/029.htm

to see what I mean in detail.

Greetings from Bremen,

Daniel

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Julie

2004-07-31 03:13:50 UTC

Post by Daniel KrÃ¼gler (ne Spangenberg)
I don't think that you can ignore locales for any proper
case-insentitive comparison which acts on general
strings (and not of special constrained strings, which might be limited
to some special code set). I can
say that because I once did the same error (and I actually I **should**
have known it due to my national
origin...).

You are absolutely correct, locales must be taken into consideration, if and
when case-insensitive comparators are provided. My previous comment about
'locales aside' was merely to discuss the value of case-insensitive
comparisons, excluding specifics, but may have been a little to unrestricted to
really convey my comments.

Post by Daniel KrÃ¼gler (ne Spangenberg)
Consider languages (e.g. German) which don't have a unique
character-by-character mapping (e.g. sz, which
is the character ß in my code page, ands maps to ss). Additionally there
exist circumstances where an umlaut
can validly compared by a two-character-representation (e.g. ü -> ue).
So I don't think, that the C++
standard should provide any half-baked solution.

Absolutely. A local-specific case-insensitive comparator may be far from
trivial to implement. In cases where it can't be implemented due to
locale-specific context issues, then that comparator is simply not available.
In those locales where it can be implemented, then it is provided. I don't
consider this half-baked, simply providing what _can_ be provided, rather than
an 'all or nothing' approach.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Allan W

2004-08-02 22:53:00 UTC

Post by Julie
Absolutely. A local-specific case-insensitive comparator may be far from
trivial to implement. In cases where it can't be implemented due to
locale-specific context issues, then that comparator is simply not
available. In those locales where it can be implemented, then it is
provided. I don't consider this half-baked, simply providing what
_can_ be provided, rather than an 'all or nothing' approach.

I hope the problems with this approach are apparent.

If the standard says that such a comparator *MAY* be made available
by an implementation, this implies that it might *NOT* be available.
Which means that your program can't assume that it exists on all
compliant platforms. Which means that your portable program can't
use it.

The workaround would be to have the standard specify a preprocessor
symbol that says if the comparator is available or not. Then your
program could use the library version if it is available, otherwise
it could roll it's own...

But if you're able to roll your own for the cases where it's needed,
why can't you just roll your own 100% of the time? It's actually
LESS work to do this (because you don't have to muck around with
preprocessor directives).

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Julie

2004-08-04 10:03:55 UTC

Post by Allan W

I hope the problems with this approach are apparent.

<snip>

Well, to be honest, none of this discussion relating to case
conversion/comparison for some languages is all that clear. The explanations
have been weak and far from enlightening, and my character translation
experience is pretty much limited to ASCII where upper/lower case is well
defined as far as I'm concerned.

Presumably there are more than just a few out there that need case insensitive
comparisons? What do they do?

- Write their own std::string comparator?

- Use a platform/compiler specific case-insensitive string comparator function
such as stricmp?

- Use some third-party library/Boost?

- ???

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

k***@gabi-soft.fr

2004-08-05 14:16:51 UTC

Post by Allan W

Post by Julie
Absolutely. A local-specific case-insensitive comparator may be
far from trivial to implement. In cases where it can't be
implemented due to locale-specific context issues, then that
comparator is simply not available. In those locales where it can
be implemented, then it is provided. I don't consider this
half-baked, simply providing what _can_ be provided, rather than
an 'all or nothing' approach.

I hope the problems with this approach are apparent.

The explinations have largely been based on examples, I think. There's
no real theory behind it -- natural language conventions don't follow
rigorous mathematical rules which can be logically explained. The only
important thing to note is that case conversion is not necessarily a
bijection, and the case insensitive comparison isn't a well defined
operation.

Post by Julie
Presumably there are more than just a few out there that need case
insensitive comparisons? What do they do?
- Write their own std::string comparator?
- Use a platform/compiler specific case-insensitive string
comparator function such as stricmp?
- Use some third-party library/Boost?
- ???

The first thing I always do is define what I want. I think the main
point of many of us posting here is that the expression "case
insensitive comparison" is not an adequate specification to begin
anything; it leaves a lot of questions unanswered. So the first thing
is to actually define what the application needs. The needs of a Pascal
compiler (which uses a very limited set of input characters) are
different from those of a database of German book titles. Once I define
what is actually needed, I then see if anything existing will do the
job. If it will, I use it. If it won't, I write what is needed.

For more information, you might want to look at some of the Unicode
technical reports (http://www.unicode.org/unicode/reports/index.html);
UTS 10 (http://www.unicode.org/unicode/reports/tr10/) is particularly
relevant. In fact, if your concerns are collating or comparing text in
a natural language (including English), I would consider it necessary
reading -- even in English, you would want "naïve" == "NAIVE".

For artificial languages (e.g. Pascal, SQL, Windows filenames), the
problem is usually much simpler, and a simple one to one mapping of
lower case characters to upper case characters is often sufficient.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

k***@gabi-soft.fr

2004-08-05 03:23:35 UTC

Post by Allan W

Post by Julie
Absolutely. A local-specific case-insensitive comparator may be far
from trivial to implement. In cases where it can't be implemented
due to locale-specific context issues, then that comparator is
simply not available. In those locales where it can be implemented,
then it is provided. I don't consider this half-baked, simply
providing what _can_ be provided, rather than an 'all or nothing'
approach.

I hope the problems with this approach are apparent.

They are:-).

Post by Allan W
If the standard says that such a comparator *MAY* be made available by
an implementation, this implies that it might *NOT* be available.
Which means that your program can't assume that it exists on all
compliant platforms. Which means that your portable program can't use
it.

I agree, but that IS what the standard says. Furthermore, it says that
if it is available, you don't know the name of it, and if you try and
use it, and it isn't available, or you get the wrong name, you get a
run-time exception (and not a compiler error).

Personally, I find it an awkward situation, and it has really caused me
problems. (It caused even more problems because one compiler wasn't
conform -- if the service wasn't available, it just did something else,
rather than tell me.)

Anyway, with the correct locale's installed on under Solaris, somethink
like:
std::sort( v1, v2, std::locale( "de_DE" ) ) ;
should work. (I can't test it, because someone removed all of the
locales on my machine.) The problem is, although exactly the same
functionality is available under Windows (again, perhaps dependant on
the installation of some particular software), the string constant is
probably different -- worse, I have no idea what it should be.

Post by Allan W
The workaround would be to have the standard specify a preprocessor
symbol that says if the comparator is available or not.

The problem is that, at least in the Unix world (but I think that the
situation is similar under Windows), whether the functionality is
available depends on what is or is not installed on the machine where
the code is run, and not on the machine where it is compiled. Ideally,
supposing the Posix naming convention, one would like to see all
combinations of language and country available; practically, the demand
for something like "eu_AL" (Basque, as used in Abania) is small enough
that I'm sure it will never be supported. And how could an
implementation pretend to support "zh_CN" (Chinese) for std::string?

In sum, the current situation is totally unacceptable, but I'm not sure
what is both acceptable and reasonably possible. So until I can propose
a workable alternative, I'm living with it.

Post by Allan W
Then your program could use the library version if it is available,
otherwise it could roll it's own...
But if you're able to roll your own for the cases where it's needed,
why can't you just roll your own 100% of the time? It's actually LESS
work to do this (because you don't have to muck around with
preprocessor directives).

The problem is that most programmers can't roll their own. The whole
point (well, one major point) of having locales is that the programmer
doesn't know all of the rules for all of the locales which he will have
to support.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Vinayak Raghuvamshi

2004-07-29 21:20:31 UTC

Post by Ganesh
It is a surprise to most of the "common" C++ programmers to learn that
std::string provides no simple way of doing case-insensitive
comparison.

Well, isn't everything case sensitive in C++? so why surprised at
strings being treated in case sensitive manner? :-)

STL is kind of saying "hey, strings and everything else are case
sensitive in C++, but you can replace any of my methods with your own
in a pluggable manner...". I think it is fair enough...

Simple way of doing case-insensitive comparison?

stricmp(dest.c_str(),src.c_str());

Sorry, I know my response doesn't help much, and I wish I had a better
answer....

-Vinayak

Post by Ganesh
http://www.freshsources.com/bjarne/ALLISON.HTM
http://www.josuttis.com/libbook/string/icstring.hpp.html
Given that case-insensitive comparison is such a common operation,
shouldn't it be made available within C++ standard library instead of
leaving it to the programmers to re-write such commonly used
functionality?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Julie

2004-07-30 09:34:49 UTC

Post by Ganesh
It is a surprise to most of the "common" C++ programmers to learn that
std::string provides no simple way of doing case-insensitive
comparison.

Well, isn't everything case sensitive in C++? so why surprised at
strings being treated in case sensitive manner? :-)
STL is kind of saying "hey, strings and everything else are case
sensitive in C++, but you can replace any of my methods with your own
in a pluggable manner...". I think it is fair enough...
Simple way of doing case-insensitive comparison?
stricmp(dest.c_str(),src.c_str());
Sorry, I know my response doesn't help much, and I wish I had a better
answer....

Yes, bad answer.

C++, the *language* is case sensitive, but strings/character arrays typically
represent real-world words, which are by default, not case sensitive. Mixing
the two completely separate notions is flawed.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Vinayak Raghuvamshi

2004-07-30 16:01:50 UTC

Post by Ganesh
It is a surprise to most of the "common" C++ programmers to learn that
std::string provides no simple way of doing case-insensitive
comparison.

Well, isn't everything case sensitive in C++? so why surprised at
strings being treated in case sensitive manner? :-)
STL is kind of saying "hey, strings and everything else are case
sensitive in C++, but you can replace any of my methods with your own
in a pluggable manner...". I think it is fair enough...
Simple way of doing case-insensitive comparison?
stricmp(dest.c_str(),src.c_str());
Sorry, I know my response doesn't help much, and I wish I had a better
answer....

Depends on what your notion of "real-world words" is...
The file systems of Most OSes are case sensitive.
Usernames/Passwords Used by Almost All systems are case sensitive.

As a developer, it actually helps to work in an environment that keeps
reminding you that the whole world is not case in-sensitive.

I agree that I could not provide a good "solution" to the original
poster, but nevertheless, I dO BElieVE THat MoSt rEAl-woRLd
apPlIcAtIoNS arE caSe sEnSItIve....

-Vinayak

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Jeff Flinn

2004-07-31 03:05:01 UTC

Post by Julie
C++, the *language* is case sensitive, but strings/character arrays typically
represent real-world words, which are by default, not case sensitive.

Mixing

Post by Julie
the two completely separate notions is flawed.

Depends on what your notion of "real-world words" is...
The file systems of Most OSes are case sensitive.

Most OSes does not equate to most systems in use.

Jeff F

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Julie

2004-07-31 03:45:42 UTC

Post by Vinayak Raghuvamshi
As a developer, it actually helps to work in an environment that keeps
reminding you that the whole world is not case in-sensitive.

You may want to get out of your cubicle and look around. Computers aside, look
around.

Are you confused when you read "MILK" on a carton rather than the more common
"Milk"? Presumably no.

Do you talk to others as:

"cap H - hello period cap W what's new?" Again, presumably no.

Post by Vinayak Raghuvamshi
I agree that I could not provide a good "solution" to the original
poster, but nevertheless, I dO BElieVE THat MoSt rEAl-woRLd
apPlIcAtIoNS arE caSe sEnSItIve....

If your statement were correct, then your eXaMPle wouldn't make any sense,
strictly because of case.

Look around, most real-world situations are case insensitive.

Finally, all hyperbole aside, what is the problem with _providing_ an intrinsic
mechanism to do case-insensitive comparisons?

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Vinayak Raghuvamshi

2004-08-03 11:38:22 UTC

Post by Julie
You may want to get out of your cubicle and look around. Computers aside, look
around.

I just did. And I did not find any std::strings "out there.." :-)

Post by Julie
Are you confused when you read "MILK" on a carton rather than the more common
"Milk"? Presumably no.

Well no. But I do consider it a bit odd when I see a sentence typed as
plEase dRink mILk, rELax aND gEt A lIFE.....

As some one rightly said in a reply to your other comments, case
sensitiveness or insensitiveness depends on the context.

Post by Julie
"cap H - hello period cap W what's new?" Again, presumably no.

Well no. And I do not use a std::string to "talk" to others. I am sure
you write "cap H - hello period cap W what's new?", though...

Post by Vinayak Raghuvamshi
I agree that I could not provide a good "solution" to the original
poster, but nevertheless, I dO BElieVE THat MoSt rEAl-woRLd
apPlIcAtIoNS arE caSe sEnSItIve....

If your statement were correct, then your eXaMPle wouldn't make any sense,
strictly because of case.

My example was meant to emphasize that case DOES make sense even for
normal, everyday sentences. It was also an effort at some humor....
:-)

Post by Julie
Finally, all hyperbole aside, what is the problem with _providing_ an intrinsic
mechanism to do case-insensitive comparisons?

I never said that stl should not provide one. But I just dont see
anything outrageous about the fact that it doesnt. stl provides a core
set of features that can be infinitely expanded. there are libraries
like Boost that are built around and over stl that you can use to get
these features if you do not want to build them on your own....

I just dont see any reason to get emotional about the fact that stl
does not provide case insensitive strings. The world is case sensitive
or insensitive depending on where you look. Again, as someone rightly
pointed out, the prime factor is the context...

Anyways, I guess we have beaten the problem to death and we could as
well have implemented a case insensitive string compare by providing
our own char traits in a fraction of the time that we spent typing out
all these case sensitive messages.. :-)

Relax, and Peace....

-Vinayak

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Andrea Griffini

2004-08-05 10:53:40 UTC

This post might be inappropriate. Click to display it.

Alf P. Steinbach

2004-08-05 14:15:13 UTC

Post by Andrea Griffini
Imagine you asking for a glass of water. And imagine
you that the bar tender start discussing ad infinitum
about exactly does it mean "a glass" (it's clear that
in various countries the average glass size is quite
different, by several percentage points!!... and don't
expect that big/small will be enough to get out of
that) and exactly what you mean with water (and this
can't be simply gas/no gas... because you sure well
know that there are a jillion different types of
water that are not perfectly equally tasting).
Hey!!... may be it's easier if you fill up a form
about what kind of water you're looking for, given
the chemical properties and the temperature you would
like it to be (and of course this can't be just
"cold" or not... as it's clear that "cold" is both
subjective and context dependent).
Now let me guess what would be your reaction...
Probably the reaction would be just leaving the pub
babbling something like "geesh, you're crazy" or,
if you have a gun and are really really thirsty,
stuffing your gun up the nose of the bar tender and
saying with a warm calm voice "now I'll count to ten...".
In my opinion a newbie reading this discussion about
converting a string to uppercase or removing trailing
spaces from a string will have the strong temptation
to just leave the language. And who can blame him
or her for that ? In my opinion the common sense left
this dark area of C++ long time ago.

Applause!

But also, there is a difference in that the standard library is
more like the organization that provides tap water to the city,
and exact standards must be defined and guaranteed.

Common sense is to choose a sensible, practical set of standards
and focus on the guarantee/delivery bit; but as you've noted
discussions tend to instead focus on choosing the most impractical
and unusable but in some academic sense "perfect" set of standards
while using the fact that such perfection cannot be guaranteed or
even generally achieved as argument to not provide anything at all.

For what it's worth, I think the practical set of standards should
be character code oriented (forget about locales and all that stuff),
which is essentially what Julie suggested before getting bogged down
in demands for definitions of "glass", "water", "temperature" etc.

If the character code provides a unique uppercase character, then
that's it (regardless of idiosyncracies of English, German or for
that matter Norwegian); otherwise, leave the character as-is. This
means that tolower(toupper(s)) == tolower(s) does not hold in general.
And that's very very very OK, because that's how it Really Is (TM).

k***@gabi-soft.fr

2004-08-06 15:13:49 UTC

Post by Vinayak Raghuvamshi
I never said that stl should not provide one. But I just dont see
anything outrageous about the fact that it doesnt. stl provides a
core set of features that can be infinitely expanded. there are
libraries like Boost that are built around and over stl that you can
use to get these features if you do not want to build them on your
own....

This thread really reminds me the one about the ability of trimming
trailing or leading spaces from a string. No. The standard library is
not providing that "exotic" feature either and you must code one
yourself.
If you go looking back to that thread you'll find a lot of explanation
- pointless
- not well defined
- locale dependent
- not the job of std::string
- immoral
- uncool
Now do this experiment...
Imagine you asking for a glass of water. And imagine you that the bar
tender start discussing ad infinitum about exactly does it mean "a
glass" (it's clear that in various countries the average glass size is
quite different, by several percentage points!!... and don't expect
that big/small will be enough to get out of that) and exactly what you
mean with water (and this can't be simply gas/no gas... because you
sure well know that there are a jillion different types of water that
are not perfectly equally tasting). Hey!!... may be it's easier if
you fill up a form about what kind of water you're looking for, given
the chemical properties and the temperature you would like it to be
(and of course this can't be just "cold" or not... as it's clear that
"cold" is both subjective and context dependent).
Now let me guess what would be your reaction...
Probably the reaction would be just leaving the pub babbling something
like "geesh, you're crazy" or, if you have a gun and are really really
thirsty, stuffing your gun up the nose of the bar tender and saying
with a warm calm voice "now I'll count to ten...".

Now that's an interesting example. Because in France, at least, if you
ask for water in a restaurant, the first thing the waiter is likely to
do is ask you what kind. I don't see how it could be otherwise, since
both sparkling and flat water is widespread. In German or in Italy, he
will automatically bring you a bottle of the house brand mineral water
(always with gas). Whereas in America, it will be tap water with lots
of ice.

In sum, no reasonable person would expect a simple solution for an
incomplete question.

In my opinion a newbie reading this discussion about converting a
string to uppercase or removing trailing spaces from a string will
have the strong temptation to just leave the language. And who can
blame him or her for that ? In my opinion the common sense left this
dark area of C++ long time ago.

Is it a lack of common sense to want to know what the function should do
before trying to find it? The C++ standard DOES have a function for
case insensitive comparison of strings: std::collate::compare.
Obviously, it's a template function (since it has to deal with char and
wchar_t), obviously, it is in the locale section (since, like water,
what one intuitively expects from "case" depends on local conventions).
And just as obviously, the user can supply additional versions for
himself, since this is definitly a case where one size doesn't fit all.

(Removing trailing spaces is a different issue -- it is locale
dependant, but other than that, I don't see any real problems.)

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Andrea Griffini

2004-08-08 01:45:39 UTC

Post by k***@gabi-soft.fr
In sum, no reasonable person would expect a simple solution for an
incomplete question.

Every question is incomplete; it all boils down where you want
to stop adding details. Common sense is trying to stop asking
details at least before who is talking to you is tempted to
hit your nose with a punch.

If someone asks you "where is north" now I wonder if you are
going to say "do you mean geographical north, magnetical north
or geomagnetical north ?". You can really go forever... even
establishing which of the three major north pole one is
concerned about the question is far from being "complete".
For example supposing the subject is the magnetical north a
question could be "Are you asking what direction would a compass
be pointing to (hence considering local magnetic field modifications),
or what geodesic line would pass from here and the north magnetic
pole supposing the earth being a sphere (so you're interested
in where the pole is) ?".

But my guess is that your nose would be already bleeding by then.

Post by k***@gabi-soft.fr
Is it a lack of common sense to want to know what the function should do
before trying to find it?

Lack of common sense is the missing of "s.upper()" or "upper(s)"
working on std::string by default. It would have been of course
ok being able to handle complexity needed for chinese ... but
ONLY if that wasn't going to annoy where it's not needed.

To me it's evident (and Francis confirmed) that the prolem is
the "committee effect" that required to avoid assuming that
american english should be the "default". Or that anything
was going to be the default because that would have been
"unfair" for the others.

The situation closely reminds me about the TIFF file format
situation... where because it would have been "unfair" to
choose between big-endian or little-endian the totally nonsense
solution is that there is first a single byte that tells if the
rest will use the little-endian or big-endian representation.
With the net result that now BOTH little endian and big endian
architectures have added complexity when reading those files
and that writing portable code handling TIFF files is harder
because you'll have BOTH the compile-time endian-ness problem
AND the run-time endian-ness problem.

IMO drawing straws would have been a better solution. By far.

Post by k***@gabi-soft.fr
The C++ standard DOES have a function for case insensitive
comparison of strings: std::collate::compare.

But no s.upper() or upper(s) ... because that would be

- pointless
- not well defined
- locale dependent
- immoral
- uncool
- having it working for american english would be unfair
for languages where it's an unsolvable problem (IIUC
for german not even a dictionary could be enough... but
a syntax analysis or even an semantical analysis of the
meaning of the text is required).

Post by k***@gabi-soft.fr
And just as obviously, the user can supply additional versions for
himself, since this is definitly a case where one size doesn't fit all.

I don't need it solved in the general case. I can solve it
to any extent I want if I have to. And I'm not forced to put
my solution in the frameset of the standard library.

Let me add that I probably wouldn't. Reading Herb Sutter's
exceptional C++ items 2 and 3 made clear for me that I'll
stay as far as possible from that. My job is solving problems
using C++ as a tool, not fighting with C++ for the fun of it.

Lack of common sense is providing complex solutions (or
complex infrastructure where you should put your complex
solution) for complex cases, ignoring providing reasonable
simple solutions for simple cases.

Post by k***@gabi-soft.fr
(Removing trailing spaces is a different issue -- it is locale
dependant, but other than that, I don't see any real problems.)

But where are the trim functions in the standard library ?

Anyway I don't think that anything I may say would convince
you that there's lack of common sense in what C++ proposes.
If you can't see why the following is ludicrous

if ( std::use_facet< std::collate< char > >( std::locale() )
.compare( s1.data(), s1.data() + s1.size(),
s2.data(), s2.data() + s2.size() ) == 0 ) ...

probably no amount of explanation would be enough.

Andrea

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Bo Persson

2004-08-08 21:50:09 UTC

Post by Andrea Griffini

Post by k***@gabi-soft.fr
Is it a lack of common sense to want to know what the function should do
before trying to find it?

It would work fine for chinese (in a way), because they don't even have
the concept of cases.

Post by Andrea Griffini
To me it's evident (and Francis confirmed) that the prolem is
the "committee effect" that required to avoid assuming that
american english should be the "default". Or that anything
was going to be the default because that would have been
"unfair" for the others.

To me at least, it seems utterly silly to have an ISO standard demand
functions that only work for US english. (Yes I know about the C
library!)

Post by Andrea Griffini

Post by k***@gabi-soft.fr
The C++ standard DOES have a function for case insensitive
comparison of strings: std::collate::compare.

But no s.upper() or upper(s) ... because that would be
- pointless

This is the closest. We already have a bunch of totally useless
character classification functions in the C library. Why add more of
those to the C++ library?

Post by Andrea Griffini
- not well defined
- locale dependent
- immoral
- uncool
- having it working for american english would be unfair
for languages where it's an unsolvable problem (IIUC
for german not even a dictionary could be enough... but
a syntax analysis or even an semantical analysis of the
meaning of the text is required).

This is an ISO standard. Why add US-only functions to that?

Perhaps the ANSI version of the standard could add those?

Bo Persson

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

James Kanze

2004-08-08 22:13:53 UTC

Andrea Griffini <***@tin.it> writes:

|> On 6 Aug 2004 11:13:49 -0400, ***@gabi-soft.fr wrote:

|> >In sum, no reasonable person would expect a simple solution for an
|> >incomplete question.

|> Every question is incomplete; it all boils down where you want to
|> stop adding details. Common sense is trying to stop asking details
|> at least before who is talking to you is tempted to hit your nose
|> with a punch.

|> If someone asks you "where is north" now I wonder if you are going
|> to say "do you mean geographical north, magnetical north or
|> geomagnetical north ?". You can really go forever... even
|> establishing which of the three major north pole one is concerned
|> about the question is far from being "complete". For example
|> supposing the subject is the magnetical north a question could be
|> "Are you asking what direction would a compass be pointing to (hence
|> considering local magnetic field modifications), or what geodesic
|> line would pass from here and the north magnetic pole supposing the
|> earth being a sphere (so you're interested in where the pole is) ?".

|> But my guess is that your nose would be already bleeding by then.

It's not quite equivalent. Everywhere I've ever been, if someone speaks
of "north", they mean geographical north. In most of the places I've
actually worked, however, there really are ambiguïties concerning case
insensitive comparisons, and e.g. 'é' and 'E' are not considered equal
when comparing, say, filenames, but are when comparing other things.

For better or worse, just saying you want a case insensitive comparison
is NOT a sufficient specification to do anything about in French or
German. It is in English, and I think in Italian as well (although even
there, one might expect stricmp( "vertù", "VERTU" ) to return true).

And it is a real fact that a significant number of users of C++ are not
working in English speaking environments.

|> >Is it a lack of common sense to want to know what the function
|> >should do before trying to find it?

|> Lack of common sense is the missing of "s.upper()" or "upper(s)"
|> working on std::string by default. It would have been of course ok
|> being able to handle complexity needed for chinese ... but ONLY if
|> that wasn't going to annoy where it's not needed.

I would argue that something like s.upper() or toUpper(s) would be a
good idea. I would also argue, however, that the actual signature
should be something like:

std::string::upper( std::locale const& = std::locale() ) ;

I do agree that there are many contexts where it is clear. I have
nothing against reasonable defaults.

|> To me it's evident (and Francis confirmed) that the prolem is the
|> "committee effect" that required to avoid assuming that american
|> english should be the "default". Or that anything was going to be
|> the default because that would have been "unfair" for the others.

There is a political problem with a "default" of American English, at
least when the default can't be overridden. In this case, it would seem
to me that there is a good solution, which allows overriding, or even
setting the default to something else.

[...]
|> >The C++ standard DOES have a function for case insensitive
|> >comparison of strings: std::collate::compare.

|> But no s.upper() or upper(s) ... because that would be

|> - pointless
|> - not well defined
|> - locale dependent
|> - immoral
|> - uncool
|> - having it working for american english would be unfair
|> for languages where it's an unsolvable problem (IIUC
|> for german not even a dictionary could be enough... but
|> a syntax analysis or even an semantical analysis of the
|> meaning of the text is required).

More likely because despite the name, std::string really has very little
to do with text. It's just a glorified container for small integers. Or
whatever -- the standard says you can have std::basic_string<double>
(although it core dumps with g++ on Solaris).

I'll admit that I'd find even a limited toupper more use than
basic_string<double>. Precisely because of all the problems we've been
talking about -- you can't implement it using a character by character
translation, so it has to work on strings. IMHO, it must be locale
specific, but that's not really a problem.

On the other hand, it doesn't require any cool template
meta-programming, so I guess that's a good reason not to have it.

|> >And just as obviously, the user can supply additional versions for
|> >himself, since this is definitly a case where one size doesn't fit
|> >all.

|> I don't need it solved in the general case. I can solve it to any
|> extent I want if I have to. And I'm not forced to put my solution in
|> the frameset of the standard library.

|> Let me add that I probably wouldn't. Reading Herb Sutter's
|> exceptional C++ items 2 and 3 made clear for me that I'll stay as
|> far as possible from that. My job is solving problems using C++ as a
|> tool, not fighting with C++ for the fun of it.

Sounds like we have similar problems:-). My customers pay me for
working code, not for stress testing compilers.

Maybe the only difference is that I've really had to deal with "case
insensitive" look-ups involving "Maße":-). I'll admit that I'm very
sensitized to the problem. (And a quick glance at the thread shows that
almost all of the people asking for a more precise specification work or
have worked in German speaking areas. Probably not by chance.)

|> Lack of common sense is providing complex solutions (or complex
|> infrastructure where you should put your complex solution) for
|> complex cases, ignoring providing reasonable simple solutions for
|> simple cases.

Would you be talking about locale, by any chance?

|> >(Removing trailing spaces is a different issue -- it is locale
|> >dependant, but other than that, I don't see any real problems.)

|> But where are the trim functions in the standard library ?

Where is any support for text? Where is a true character type?

Where is networking? Where is a GUI?

|> Anyway I don't think that anything I may say would convince you that
|> there's lack of common sense in what C++ proposes.

If "convince" implies my changing my opinion, no. Because I'm already
convinced of it for a number of things: all of locale, or the
templatization of iostream or string, for example.

Still, it's the only standard we've got, and we can (and have to) live
with it. It could be worse.

|> If you can't see why the following is ludicrous

|> if ( std::use_facet< std::collate< char > >( std::locale() )
|> .compare( s1.data(), s1.data() + s1.size(),
|> s2.data(), s2.data() + s2.size() ) == 0 ) ...

|> probably no amount of explanation would be enough.

What's wrong with a simple wrapper?

And to tell the truth: we're complaining about a lack of proper support
for text in C++. Did you, or any one else, make a proposal? I know I
didn't, and the committee can't standardize something that hasn't even
been proposed.

Matt Austern

2004-08-14 03:04:56 UTC

Post by James Kanze
And to tell the truth: we're complaining about a lack of proper support
for text in C++. Did you, or any one else, make a proposal? I know I
didn't, and the committee can't standardize something that hasn't even
been proposed.

As LWG chair: I would love to see a proposal for better text handling
in C++, especially if it involved better treatment of i18n issues. I
think we have many of the primitive pieces you'd need to write that
proposal, but they aren't put together in a usable way.

One of the special problems with these sorts of issues, is that
careful handling of i18n, like careful handling of numerics, takes
domain expertise. There aren't many i18n experts on the C++ committee.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Andrea Griffini

2004-08-14 10:39:35 UTC

Post by Matt Austern
One of the special problems with these sorts of issues, is that
careful handling of i18n, like careful handling of numerics, takes
domain expertise. There aren't many i18n experts on the C++ committee.

That is only half of the story... I think that correctness and
carefully handling is just one of dimensions of this problem.
Another IMO very important one is usability.

For example my impression is that the whole C++ I/O subsystem
dismissed usability and now we've joke-looking code snippets
just to write out a number with three decimal digits or to
get the integer value of a string.

I think that for C++ made a few steps in respect to C on I/O,
but these are IMO steps in the wrong direction (i.e. LESS
dynamic code).

Reading the introduction of streams in TCPPPL I remember the
fear of what it was about to come (starting by saying that
IO is difficult and no library will please everyone is like
starting a joke telling that humour is a complex thing, and
not everyone will like the joke).

IMO my fear was justified.

And now I'm trembling in terror waiting of what will be the
result of more work on i18n.

How many people will work on the issue ? I've read somewhere
that the combined IQ of a committee can be easily computed
by starting from 100 and subtracting 5 for every partecipant :-)

Andrea

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

k***@gabi-soft.fr

2004-08-16 19:54:07 UTC

Post by Andrea Griffini

That is only half of the story... I think that correctness and
carefully handling is just one of dimensions of this problem. Another
IMO very important one is usability.

You wouldn't be thinking of <locale> now, would you?

I actually think that part of the problem is due to premature
standardization. For political reasons, the standard must have support
for internationalization. Even though we are at a state where not only
do we not know a good, general solution, we don't really even know how
to specify the problem. Thus, for example, it is obvious that word order
changes between languages -- the Open Systems have implemented support
for this in their versions of printf. What they have implemented has
always sufficed for the type of messages I print -- log's, error
messages, and the sort. But suppose you are generating messages that
should appear to come from a human being, always grammatically correct,
and that things like "n error(s) found" won't do the trick. The
classical solution in the Anglo-Saxon community is something like:

printf( "%d error%s found\n", errorCount, errorCount == 1 ? "" : "s" ) ;

Now, you'll need more than just getting a translated text string to fix
that in Italian. More generally, one might write something like:

"%d %s found\n", errorCount, errorCount == 1 ? "error" : "errors"

-- even in English, the original fails if we are counting feet, rather
than errors. Except that, of course, in many languages, "found" will
also change forms ("trovata"/"trovate", "trouvée"/"trouvées"...).

And of course, some languages have a dual, so you need a different form
if errorCount is 2 as well. And someone once told me that in Russian,
you use the singular behind numbers like twenty-one or thirty-one, which
end in "one".

So the question is: what do we need to support this kind of thing?

And until we've defined the problem in a general way, I find it very
difficult to come up with a solution. I've implemented a number of
different solutions, in different applications, but each time, I
implemented a solution to the subset of the problem which our
application was concerned with (which has always permitted things like
"%d error(s) found\n").

Post by Andrea Griffini
For example my impression is that the whole C++ I/O subsystem
dismissed usability and now we've joke-looking code snippets just to
write out a number with three decimal digits or to get the integer
value of a string.

I don't know. The C++ I/O subsystem has several very important
improvements over that in C: much has been said about type safety (which
IMHO is very important) and extensibility, but let's not forget the
separation of the formatting from the sinking and sourcing of bytes as
well.

[...]

Post by Andrea Griffini
How many people will work on the issue ? I've read somewhere that the
combined IQ of a committee can be easily computed by starting from 100
and subtracting 5 for every partecipant :-)

The IQ of a croud is the lowest IQ of the people in the croud, divided
by the number of people in the croud. But I don't think that the
committee is really a croud. And sometimes, one person, working alone,
can make a pretty big mess too.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Eugene Gershnik

2004-08-08 01:59:28 UTC

Post by k***@gabi-soft.fr
The C++ standard DOES have a function
for case insensitive comparison of strings: std::collate::compare.

Let's compare it to other languages/dialects for a simple task: condition
based on an internet protocol name which is always English and locale
independent. I am perfectly aware that the snippets below are not equivalent
but this is besides the point which is how much work should a simple and
frequent task take.

(Disclaimer: Code typed without compiling)

<popular language #1>

String protocol = "HTTP";

if (protocol.compareToIgnoreCase("http") == 0)
{
...
}

</popular language #1>

<popular language #2>

string protocol = "HTTP";

if (String.Compare(protocol, "http", true) == 0)
{
...
}

</popular language #2>

<what C++ programmers usually do>

const string protocol = "HTTP";

if (_stricmp(protocol.c_str(), "http") == 0)
{
...
}

</what C++ programmers usually do>

<standard C++>

const string protocol = "HTTP";

const char * const protocol_begin = protocol.c_str();
const char * const protocol_end = protocol_begin + protocol.length();
const char test_begin[] = "http";
const char * const test_end = test_begin + sizeof(test_begin) - 1;
if (use_facet<collate<char> >(locale::classic()).compare(
protocol_begin,
protocol_end,
test_begin,
test_end) == 0)
{
...
}

</standard C++>

--
Eugene

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

James Kanze

2004-08-08 22:22:05 UTC

"Eugene Gershnik" <***@hotmail.com> writes:

|> ***@gabi-soft.fr wrote:
|> > The C++ standard DOES have a function for case insensitive
|> > comparison of strings: std::collate::compare.

|> Let's compare it to other languages/dialects for a simple task:
|> condition based on an internet protocol name which is always English
|> and locale independent. I am perfectly aware that the snippets below
|> are not equivalent but this is besides the point which is how much
|> work should a simple and frequent task take.

Is it really that frequet to compare the name of a protocol in a URL?
More frequent than, say, looking up a person's name?

(I'm not saying that it shouldn't be easier. But I don't think that the
language should "prefer" this particular application, either.)

[...]
|> <standard C++>

|> const string protocol = "HTTP";

|> const char * const protocol_begin = protocol.c_str();
|> const char * const protocol_end = protocol_begin + protocol.length();
|> const char test_begin[] = "http";
|> const char * const test_end = test_begin + sizeof(test_begin) - 1;
|> if (use_facet<collate<char> >(locale::classic()).compare(
|> protocol_begin,
|> protocol_end,
|> test_begin,
|> test_end) == 0)
|> {
|> ...
|> }

|> </standard C++>

That's progress:-. Like:

std::cout << std::setprecision( 4 )
<< std::setw( 8 )
<< std::fixed
<< someDouble ;

instead of:

printf( "%8.4f", someDouble ) ;

:-).

Don't worry. One of these day's, you'll be paid by the line, and you'll
appreciate it.

Eugene Gershnik

2004-08-09 20:13:53 UTC

Post by James Kanze

Post by k***@gabi-soft.fr
The C++ standard DOES have a function for case insensitive
comparison of strings: std::collate::compare.

condition based on an internet protocol name which is always
English and locale independent. I am perfectly aware that the
snippets below are not equivalent but this is besides the point
which is how much work should a simple and frequent task take.

Is it really that frequet to compare the name of a protocol in a URL?
More frequent than, say, looking up a person's name?
(I'm not saying that it shouldn't be easier. But I don't think that
the language should "prefer" this particular application, either.)

It is quite frequent in the area I work and besides network protocols are
not the only example. File formats, hardware protocols and other "inside the
computer" stuff are almost exclusively US english. I suspect that everybody
who writes software in these areas will find the arguments "ad locale" not
very convincing. I also realize that people who write other kinds of
software will have a different opinion.
Ideally I think a standard library should cater to the needs of both groups.
The current library is too small and uncompetitive compared with the stuff
other languages come with.
Software I have to write today is order of magnitude more complex than it
used to be 10 years ago. My managers simply cannot afford to spend time
implementing functionality like to_upper, socket, thread etc. every time it
is required. I can either use 3rd party libraries for that or reuse company
specific libraries. First approach is too costly or unreliable (from the
managers point of view) and the second usually ends in disaster given the
fact that most in-house library designers are not exactly Andrei
Alexandrescu, or you or any of this forum regulars. The end result is that
within the time, budget and other organizational constraints I have no
choice but to use one of the "popular languages".
Now, before the whole world jumps on me explaining why the arguments above
are BS let me say that I know that pretty well. I also tend to agree with
your often expressed opinion that "popular language #1" is unsuitable for
large scale software development. Despite all that I simply cannot sell C++
to my managers when items like "write/license library X for portable text
manipulation" are on any C++ project plan.

Post by James Kanze
std::cout << std::setprecision( 4 )
<< std::setw( 8 )
<< std::fixed
<< someDouble ;
printf( "%8.4f", someDouble ) ;
:-).

To risk wandering too far off-topic even the next version of Java language
will finally include C compatible printf. I find it very encouraging to see
how an old but simple and elegant design still beats all new inventions.

Post by James Kanze
Don't worry. One of these day's, you'll be paid by the line, and
you'll appreciate it.

I'll just stop using templates then. A few copies of std::map for each type
will make an early retirement possible ;-)

--
Eugene

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

k***@gabi-soft.fr

2004-08-10 18:42:46 UTC

Post by Eugene Gershnik

Post by James Kanze

Post by k***@gabi-soft.fr
The C++ standard DOES have a function for case insensitive
comparison of strings: std::collate::compare.

Is it really that frequet to compare the name of a protocol in a
URL? More frequent than, say, looking up a person's name?
(I'm not saying that it shouldn't be easier. But I don't think that
the language should "prefer" this particular application, either.)

It is quite frequent in the area I work

And not very frequent in the areas I work in.

Post by Eugene Gershnik
and besides network protocols are not the only example. File formats,
hardware protocols and other "inside the computer" stuff are almost
exclusively US english.

Most of the network protocols today use case sensitive UTF-8. DNS is a
bit of an exception. All of the stuff "inside the computer" on the
machines I work on is also case sensitive, and more or less (human)
language independant. It's been a long time since I've had to deal with
7 bit ASCII, and for most of what I see, text is text, and the programs
just consider it a sequence of arbitrary bytes; there are typically a
couple of characters reserved for separating things, and that is it.

Post by Eugene Gershnik
I suspect that everybody who writes software in these areas will find
the arguments "ad locale" not very convincing.

I don't know. I've done a lot of networking programming, and I find that
the rules are very variable. Generally speaking, case insensitivity only
works when you limit the character set, typically to seven bit ASCII. In
the more recent protocols I've had to deal with, everything was case
sensitive (and UTF-8) precisely to avoid these sort of problems.

Post by Eugene Gershnik
I also realize that people who write other kinds of software will have
a different opinion. Ideally I think a standard library should cater
to the needs of both groups.

I think that most of the necessary framework is there, in locale. There
are some problems: the collate facet definitly needs additional
interface functions to handle standard strings, and in the ctype facet,
toupper and tolower really should work on strings, returning new values
(of not necessarily the same length). And arguably, the entire locale
interface should be redesigned to make it usable. But the idea that the
comparisons and conversions should be locale specific is a major step in
the right direction.

After that, the question is what locales should be supported?

Post by Eugene Gershnik
The current library is too small and uncompetitive compared with the
stuff other languages come with.

True. Note that the Java function String.toUpperCase uses a locale, and
maps "ß" to "SS". Globally, I rather like the idea of a:
std::string::toupper( std::locale const& = std::locale() ) ;
function.

Post by Eugene Gershnik
Software I have to write today is order of magnitude more complex than
it used to be 10 years ago.

Totally agreed. Ten years ago, my string class had a (naïve) toUpper
function. Nobody demanded the complexity of locale dependant
conversions. Today, the applications I write generally do need it (if
they need toUpper at all -- most of the time, we use case sensitivity to
avoid the problem completely).

Post by Eugene Gershnik
My managers simply cannot afford to spend time implementing
functionality like to_upper, socket, thread etc. every time it is
required.

I agree that as it stands, C++ is unusable without a certain number of
additional third party libraries. And that getting these libraries to
work together is not always easy -- most of the ones we use were
initially written before the standard was adopted, and use their own
private string class, for example.

Post by Eugene Gershnik
I can either use 3rd party libraries for that or reuse company
specific libraries. First approach is too costly or unreliable (from
the managers point of view) and the second usually ends in disaster
given the fact that most in-house library designers are not exactly
Andrei Alexandrescu, or you or any of this forum regulars. The end
result is that within the time, budget and other organizational
constraints I have no choice but to use one of the "popular
languages".

I know the problem. And I agree that it is a problem. A real problem.

Post by Eugene Gershnik
Now, before the whole world jumps on me explaining why the arguments
above are BS let me say that I know that pretty well. I also tend to
agree with your often expressed opinion that "popular language #1" is
unsuitable for large scale software development. Despite all that I
simply cannot sell C++ to my managers when items like "write/license
library X for portable text manipulation" are on any C++ project plan.

Been there. Done that. I know what you mean.

Post by Eugene Gershnik

Post by James Kanze
std::cout << std::setprecision( 4 )
<< std::setw( 8 )
<< std::fixed
<< someDouble ;
printf( "%8.4f", someDouble ) ;
:-).

To risk wandering too far off-topic even the next version of Java
language will finally include C compatible printf. I find it very
encouraging to see how an old but simple and elegant design still
beats all new inventions.

Actually, if you need formatted text, say in a table, nothing beats a
Cobol PIC clause, or most Basic's PRINT USING:-). For printf style
formatting in C++, however, see GB_Format, at my site
(www.gabi-soft.fr). Implemented for the reasons you discussed earlier: I
needed it, and it wasn't available elsewhere. (Actually, I only needed a
small subset. I went on an implemented 100% of printf formatting because
it was a challange. Especially getting the '*' specifiers for length and
precision to work:-).)

Post by Eugene Gershnik

Post by James Kanze
Don't worry. One of these day's, you'll be paid by the line, and
you'll appreciate it.

I'll just stop using templates then. A few copies of std::map for each
type will make an early retirement possible ;-)

Just preprocess, and save the results:-). Or write the program in
Cobol:-).

Still, one of the reasons I'm in demand, and have no real problem
finding a job, even in a depressed market, is that I do know how to
design and write all this stuff which is already part of most other
languages. So don't knock it:-). (Note, however, that even in more
complete languages, there are always things that are missing. My one
large Java project largely involved writing threading, networking and
GUI primitives. And finding the bugs, and their corresponding
work-arounds, in the standard library:-).)

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Zev_K

2004-07-30 09:49:42 UTC

Post by Ganesh
It is a surprise to most of the "common" C++ programmers to learn that
std::string provides no simple way of doing case-insensitive
comparison.

Well, isn't everything case sensitive in C++? so why surprised at
strings being treated in case sensitive manner? :-)
STL is kind of saying "hey, strings and everything else are case
sensitive in C++, but you can replace any of my methods with your own
in a pluggable manner...". I think it is fair enough...
Simple way of doing case-insensitive comparison?
stricmp(dest.c_str(),src.c_str());
Sorry, I know my response doesn't help much, and I wish I had a better
answer....
-Vinayak

When I need to do case insensitive comparisons using strings, and I
dont want to have to resort to C methods, I usually do something to
the effect of:
s1.toLower()==s2.toLower()

However, in most cases, it just pays to store everything as either
upper or lower case, making everything simpler.

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

tom_usenet

2004-07-31 03:01:41 UTC

Post by Ganesh
It is a surprise to most of the "common" C++ programmers to learn that
std::string provides no simple way of doing case-insensitive
comparison.

Well, isn't everything case sensitive in C++? so why surprised at
strings being treated in case sensitive manner? :-)
STL is kind of saying "hey, strings and everything else are case
sensitive in C++, but you can replace any of my methods with your own
in a pluggable manner...". I think it is fair enough...
Simple way of doing case-insensitive comparison?
stricmp(dest.c_str(),src.c_str());
Sorry, I know my response doesn't help much, and I wish I had a better
answer....

stricmp is a non-standard function - you can't use it in portable
code.

Tom

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

k***@gabi-soft.fr

2004-08-05 03:22:45 UTC

Post by Ganesh
It is a surprise to most of the "common" C++ programmers to learn
that std::string provides no simple way of doing case-insensitive
comparison.

Well, isn't everything case sensitive in C++? so why surprised at
strings being treated in case sensitive manner? :-)

I appreciate the smiley.

Post by Vinayak Raghuvamshi
STL is kind of saying "hey, strings and everything else are case
sensitive in C++, but you can replace any of my methods with your own
in a pluggable manner...". I think it is fair enough...
Simple way of doing case-insensitive comparison?
stricmp(dest.c_str(),src.c_str());

Which just moves the problem. Now you have to write a function stricmp.

What's wrong with something like:

std::map< std::string, MyClass, std::locale >
myMap( std::locale( "de_DE" ) ) ;

? Or whatever, according two what you are doing. For a simple
comparison,

if ( std::use_facet< std::collate< char > >( std::locale() )
.compare( s1.data(), s1.data() + s1.size(),
s2.data(), s2.data() + s2.size() ) == 0 ) ...

should do the trick. Althoug one does wonder why the interface uses
char const*, and not std::string. I'd definitely consider wrapping this
one in a global function.

--
James Kanze GABI Software http://www.gabi-soft.fr
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

John Dibling

2004-08-03 11:22:14 UTC

Post by Ganesh
It is a surprise to most of the "common" C++ programmers to learn that
std::string provides no simple way of doing case-insensitive
comparison.

In fact, when you really sit down and look at the standard library as
a whole, you will find that there is a great deal that is "missing."
You are right, there is no built-in way to do a SI compare of
std::strings. But there is also no std::string version of sprintf(),
and I would argue that of all the string-related functions in the CRT,
sprintf() is (one of) the most-commonly used.

The library of "missing," functions goes far beyond sprintf(), and
even beyond string-related functions. For example, find() is provided
to find an element which compares equal to another element using
operator==. If operator== doesn't work for you, you can define what
it means to "be the same" youself in a functor, and use find_if()
instead of find(). That is, there are non-predicated and predicated
versions of find(). But there is no predicated version of copy(),
transform() or for_each(). It didn't occur to me for a long time that
there might be predicated versions of these algorithms. But when I
did realize it, and wrote them all myself in an STL extensions
library, they became invaluable.

There is also no copy_backward_if(), and even if there were, there is
also no bidirectional_back_insert_iterator to use with it. The list
goes on...

BTW - Scott Meyers covers CI compares of std::strings in "Effective
STL," item 35.

- John Dibling
***@yahoo.com

[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]

Steven T. Hatton

2004-08-22 23:04:48 UTC