Discussion:
Unicode regex negated case-insensitivity in 5.14.0-RC1
(too old to reply)
George Greer
2011-04-28 20:48:36 UTC
Permalink
Attempting to run $WORK's data filter/ETL on 5.14.0-RC1, which currently
runs on 5.10.0 in production. The module versions are different
between the two Perl versions but the script itself is the same.

(string scrambled to protect the innocent but still tickle behavior)

DB<73> $x = "X-Xoqp-SDR-FpCqar4-Duooery-Faad-laeC_cCesspfpads:";

DB<74> x $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/
0 1
DB<75> x $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/i
0 1
DB<76> utf8::upgrade($x)

DB<77> x $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/
0 1
DB<78> x $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/i
empty array

Script version:
$x = "X-Xoqp-SDR-FpCqar4-Duooery-Faad-laeC_cCesspfpads:";
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/ ? 1 : 0
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/i ? 1 : 0;
utf8::upgrade($x);
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/ ? 1 : 0;
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/i ? 1 : 0;
print "\n";

5.14.0-RC1 (tarball)
1110
5.12.3 (Fedora's perl-5.12.3-143.fc14.x86_64)
1111
5.10.0 (tarball)
1111

Regex is from Mail::Header testing for bad RFC822 field names through
MIME-tools.

- - - 8< - - - 8< - - -
our $FIELD_NAME = '[^\x00-\x1f\x7f-\xff :]+:';
...
defined $ctag && $ctag =~ /^($FIELD_NAME|From )/oi
or croak "Bad RFC822 field name '$tag'\n";
- - - 8< - - - 8< - - -

Using /aa does seem to fix the regex:

1
1
1 /
0 /i
1 /a
0 /ia
1 /aa
1 /iaa

No special 5.14 features used by the script (since it is 5.10 compatible).
--
George Greer
Tom Christiansen
2011-04-28 20:59:46 UTC
Permalink
It's even weirder than that. Given:

$\ = "\n";
my $x = "X-Xoqp-SDR-FpCqar4-Duooery-Faad-laeC_cCesspfpads:";
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/ || 0;
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/i || 0;
utf8::upgrade($x);
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/ || 0;
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/i || 0;

Here are the results:

% perl5.12.3 /tmp/bl
1
1
1
1

% perl5.12.3 -M5.012 /tmp/bl
1
1
1
1

% blead /tmp/bl
1
1
1
0

% blead -M5.012 /tmp/bl
1
0
1
0

So with (full) Unicode strings, it's yet again different still.

With apologies to Philip K Dick :), this is a Karl-Thing, I think.

--tom
Tom Christiansen
2011-04-28 21:01:53 UTC
Permalink
Of course, the 0101 is because it's already utf8-upgraded
once unicode_strings are operant. Forgot to mention that,
but it should be obvious.

--tom
Karl Williamson
2011-04-28 22:14:29 UTC
Permalink
Post by Tom Christiansen
Of course, the 0101 is because it's already utf8-upgraded
once unicode_strings are operant. Forgot to mention that,
but it should be obvious.
--tom
I'm looking into it.
George Greer
2011-04-28 22:45:39 UTC
Permalink
Post by Tom Christiansen
With apologies to Philip K Dick :), this is a Karl-Thing, I think.
Yes, and I'm thankful for all the work Karl has done and has taken on.
Targeting Unicode with the regex engine is definitely a task that will be
useful to me and my work's development, and it looks pretty fraught with
unexpected edge cases. I certainly appreciate him doing all of it.
--
George Greer
Karl Williamson
2011-04-28 23:27:12 UTC
Permalink
Post by Tom Christiansen
$\ = "\n";
my $x = "X-Xoqp-SDR-FpCqar4-Duooery-Faad-laeC_cCesspfpads:";
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/ || 0;
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/i || 0;
utf8::upgrade($x);
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/ || 0;
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/i || 0;
% perl5.12.3 /tmp/bl
1
1
1
1
% perl5.12.3 -M5.012 /tmp/bl
1
1
1
1
% blead /tmp/bl
1
1
1
0
% blead -M5.012 /tmp/bl
1
0
1
0
So with (full) Unicode strings, it's yet again different still.
With apologies to Philip K Dick :), this is a Karl-Thing, I think.
--tom
Fortunately for my ego, the problem isn't in my code.

Unfortunately for the project (and perhaps to my ego), the problem is
much deeper; it is an issue with multi-character folds. The reason this
doesn't match in 5.14 when full Unicode semantics is on (with or without
utf8ness) is that in 5.14 for the first time, multi-char folds work.

At least they work as designed. Perhaps there is a better design that
wouldn't have the gotcha this gives. I don't know, and am open to
suggestions. But Unicode is proposing to stop recommending that regular
expressions engines accept them. This proposal stemmed, at least in
part, to my pointing out issues to them about the feature. But their
proposal wasn't mounted until around the feature freeze time of 5.14,
after I had coded to what I thought were the correct specifications; and
the comment period for the proposal is still going -- it ends this
weekend. If you'd like to comment, see the document at
http://unicode.org/reports/tr18/proposed.html
After comments are over, they have to be evaluated, and will be
presented to May's meeting of the Unicode Technical Committee, and who
knows what will happen then.

Let me quote from part of the motivation for the changes,found at
http://unicode.org/review/pri179

"There are a number of examples where the results would be
counter-intuitive for typical users of regular expressions."

I think by "typical users" they mean anyone who doesn't have the mind of
a CPU. :)

Anyway, what's going on here is that the regex appears to have been
designed to match the graphic ASCII characters except the colon. But it
is written so as to match the complement of the complement of those
characters, with case-insensitivity thrown in. That means it is
supposed to not match our old friend the German sharp ss, "ß". But that
means it is not supposed to match the case fold of ß because we have /i
matching. And that means it isn't supposed to match the sequence 'ss',
which is the case fold of ß. And that means the match fails at the
point in the above string where there is 'ss' in a row.

That is counter-intuitive to me, but it is correct with the implemented
regex rules, and it seems to me to be correct according to what the
current Unicode TR18 says. Is there disagreement?

What to do? I think this is a 5.14 blocker. And I'm thankful George
found it now and not later. I wish I had a really good idea of how to
proceed. My proposal, unless a better idea surfaces, is to just disable
multi-character folds in regex matching for 5.14, which is the direction
that Unicode appears to be moving. Multi-character folding worked
somewhat in earlier releases, but was extremely buggy, and could not be
relied on. Thus there are some backward compatibility issues, but we
might have to do that anyway if Unicode proceeds as expected.
It's actually quite easy to change the code to do this, as almost
everything is driven off a mktables generated table. We just have to
change a line or two in mktables to ignore the multi-char folds.
There's also a line or two in regcomp, as the fold for ß has a special
case there (for performance, to avoid having to look at the tables for
non-utf8 patterns.). There's plenty of code that just won't get
exercised, which could be #ifdef'd out, but that's not really necessary.
The bigger amount of work is fixing the .t's.
Karl Williamson
2011-04-28 23:32:32 UTC
Permalink
Post by Karl Williamson
Post by Tom Christiansen
$\ = "\n";
my $x = "X-Xoqp-SDR-FpCqar4-Duooery-Faad-laeC_cCesspfpads:";
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/ || 0;
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/i || 0;
utf8::upgrade($x);
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/ || 0;
print $x =~ /^[^\x00-\x1f\x7f-\xff :]+:/i || 0;
% perl5.12.3 /tmp/bl
1
1
1
1
% perl5.12.3 -M5.012 /tmp/bl
1
1
1
1
% blead /tmp/bl
1
1
1
0
% blead -M5.012 /tmp/bl
1
0
1
0
So with (full) Unicode strings, it's yet again different still.
With apologies to Philip K Dick :), this is a Karl-Thing, I think.
--tom
Fortunately for my ego, the problem isn't in my code.
Unfortunately for the project (and perhaps to my ego), the problem is
much deeper; it is an issue with multi-character folds. The reason this
doesn't match in 5.14 when full Unicode semantics is on (with or without
utf8ness) is that in 5.14 for the first time, multi-char folds work.
At least they work as designed. Perhaps there is a better design that
wouldn't have the gotcha this gives. I don't know, and am open to
suggestions. But Unicode is proposing to stop recommending that regular
expressions engines accept them. This proposal stemmed, at least in
part, to my pointing out issues to them about the feature. But their
proposal wasn't mounted until around the feature freeze time of 5.14,
after I had coded to what I thought were the correct specifications; and
the comment period for the proposal is still going -- it ends this
weekend. If you'd like to comment, see the document at
http://unicode.org/reports/tr18/proposed.html
After comments are over, they have to be evaluated, and will be
presented to May's meeting of the Unicode Technical Committee, and who
knows what will happen then.
Let me quote from part of the motivation for the changes,found at
http://unicode.org/review/pri179
"There are a number of examples where the results would be
counter-intuitive for typical users of regular expressions."
I think by "typical users" they mean anyone who doesn't have the mind of
a CPU. :)
Anyway, what's going on here is that the regex appears to have been
designed to match the graphic ASCII characters except the colon. But it
is written so as to match the complement of the complement of those
characters, with case-insensitivity thrown in. That means it is supposed
to not match our old friend the German sharp ss, "ß". But that means it
is not supposed to match the case fold of ß because we have /i matching.
And that means it isn't supposed to match the sequence 'ss', which is
the case fold of ß. And that means the match fails at the point in the
above string where there is 'ss' in a row.
That is counter-intuitive to me, but it is correct with the implemented
regex rules, and it seems to me to be correct according to what the
current Unicode TR18 says. Is there disagreement?
What to do? I think this is a 5.14 blocker. And I'm thankful George
found it now and not later. I wish I had a really good idea of how to
proceed. My proposal, unless a better idea surfaces, is to just disable
multi-character folds in regex matching for 5.14, which is the direction
that Unicode appears to be moving. Multi-character folding worked
somewhat in earlier releases, but was extremely buggy, and could not be
relied on. Thus there are some backward compatibility issues, but we
might have to do that anyway if Unicode proceeds as expected.
It's actually quite easy to change the code to do this, as almost
everything is driven off a mktables generated table. We just have to
change a line or two in mktables to ignore the multi-char folds. There's
also a line or two in regcomp, as the fold for ß has a special case
there (for performance, to avoid having to look at the tables for
non-utf8 patterns.). There's plenty of code that just won't get
exercised, which could be #ifdef'd out, but that's not really necessary.
The bigger amount of work is fixing the .t's.
And fixing the .pod's
Tom Christiansen
2011-04-28 23:39:52 UTC
Permalink
Post by Karl Williamson
And fixing the .pod's
:)

I have to think about this a bit.
Complementing multichar folds is a weird area.

I think Perl doing full case-mapping is a feature. And
sure it will still do it with functions.

I know that Java only does full CM on its equiv of lc/uc/etc.
It does only simple CM with its regexes.

If Perl doesn't do full case mapping, does that mean we won't
be able to match "\xDF" =~ /ss/i and vice versa anymore?

I notice the pattern was forbidding the 128-255 code points,
which also pulled ANGSTROM SIGN and such. But that one is
ok since we have no builtin NFD matching to wrack our brains over.

I think I agree this should be a release blocker until we've
thought it through.

--tom
Karl Williamson
2011-04-29 00:04:43 UTC
Permalink
Post by Tom Christiansen
Post by Karl Williamson
And fixing the .pod's
:)
I have to think about this a bit.
Complementing multichar folds is a weird area.
I thought about disabling it only for complementing, but thing that may
be a can of worms
Post by Tom Christiansen
I think Perl doing full case-mapping is a feature. And
sure it will still do it with functions.
Yes.
Post by Tom Christiansen
I know that Java only does full CM on its equiv of lc/uc/etc.
It does only simple CM with its regexes.
So we would be following Java's paradigm.
Post by Tom Christiansen
If Perl doesn't do full case mapping, does that mean we won't
be able to match "\xDF" =~ /ss/i and vice versa anymore?
Yes.


BTW, in the comment period that is about to end, they are effectively
saying that /\p{ASCII_Hex_Digit}/i match outside ASCII, which I think is
a security issue.
Karl Williamson
2011-04-29 00:13:12 UTC
Permalink
Post by Karl Williamson
Post by Tom Christiansen
Post by Karl Williamson
And fixing the .pod's
:)
I have to think about this a bit.
Complementing multichar folds is a weird area.
I thought about disabling it only for complementing, but thing that may
be a can of worms
Post by Tom Christiansen
I think Perl doing full case-mapping is a feature. And
sure it will still do it with functions.
Yes.
Post by Tom Christiansen
I know that Java only does full CM on its equiv of lc/uc/etc.
It does only simple CM with its regexes.
So we would be following Java's paradigm.
Post by Tom Christiansen
If Perl doesn't do full case mapping, does that mean we won't
be able to match "\xDF" =~ /ss/i and vice versa anymore?
Yes.
One more thought. We could add something in 5.16 to enable multi-char
matching. A regex modifier or pragma. I think it's too late for 5.14
to do something like that.
Post by Karl Williamson
BTW, in the comment period that is about to end, they are effectively
saying that /\p{ASCII_Hex_Digit}/i match outside ASCII, which I think is
a security issue.
Tom Christiansen
2011-04-29 00:24:56 UTC
Permalink
Post by Karl Williamson
One more thought. We could add something in 5.16 to enable multi-char
matching. A regex modifier or pragma. I think it's too late for 5.14
to do something like that.
That's exactly what I have been thinking about. I'm
sure Java would just add some more flags like Pattern.MULTICHAR_FOLDS
or Pattern.COMPLEMENTED_MULTICHAR_FOLDS. They've settled on
Pattern.UNICODE_CHARACTER_CLASSES -- or (?U) -- for RL1.2a
compliance, so they don't mind adding long things.

I really do feel that this *is* correct behavior:

% blead -le 'print "stuff" =~ /^[^\x{FB00}-\x{FB06}]+$/ || 0'
1
% blead -le 'print "stuff" =~ /^[^\x{FB00}-\x{FB06}]+$/i || 0'
0

% blead -E 'say "bless" =~ /^[^\xE6\xDF]+$/i || 0'
0
% blead -E 'say "bless" =~ /^[^\xE6\xDF]+$/ || 0'
1

Wouldn't backing out multichar folds for 5.14 introduce a regression?

--tom
Tom Christiansen
2011-04-29 00:32:50 UTC
Permalink
Post by Tom Christiansen
Wouldn't backing out multichar folds for 5.14 introduce a regression?
Specifically, it would break things like this, which already worked:

% perl5.12.0 -E 'say "\x{FB00}" =~ /ff/i || 0'
1
...
% perl5.12.3 -E 'say "\x{FB00}" =~ /ff/i || 0'
1

--tom
Karl Williamson
2011-04-29 01:47:44 UTC
Permalink
Post by Tom Christiansen
Post by Tom Christiansen
Wouldn't backing out multichar folds for 5.14 introduce a regression?
% perl5.12.0 -E 'say "\x{FB00}" =~ /ff/i || 0'
1
...
% perl5.12.3 -E 'say "\x{FB00}" =~ /ff/i || 0'
1
--tom
Yes it would. My point was that appears to be where Unicode is headed.
But there are no guarantees that that is where they'll end up.

A middle position would be to disable them only in bracketed character
classes. I think that the most astonishment stems from those, when they
are inverted. This is where it was most buggy pre-5.14. There were
cases where it worked; but mostly it didn't. And most of the cases
where it worked were when the class got optimized into an EXACTF node.
We'd have to worry about what to do with that situation now. My
position would be that we wouldn't do that optimization if the result
would match multiple characters.

To state more clearly, I guess I'm now putting forth the idea that the
least worst case for 5.14 is that we say that a bracketed character
class can only match a single input character. Most people expect that
anyway, and it would have the fewest regressions. Almost all
regressions would be of the form that /[ß]/i would no longer mean the
same thing as /ß/i.

The idea scares me of allowing a non-inverted class match multiple char
folds vs an inverted one
Tom Christiansen
2011-04-29 02:23:38 UTC
Permalink
Post by Karl Williamson
Yes it would. My point was that appears to be where Unicode is headed.
But there are no guarantees that that is where they'll end up.
I'm not entirely sure that they should, either.
Post by Karl Williamson
A middle position would be to disable them only in bracketed character
classes. I think that the most astonishment stems from those, when they
are inverted. This is where it was most buggy pre-5.14. There were
cases where it worked; but mostly it didn't. And most of the cases
where it worked were when the class got optimized into an EXACTF node.
We'd have to worry about what to do with that situation now. My
position would be that we wouldn't do that optimization if the result
would match multiple characters.
To state more clearly, I guess I'm now putting forth the idea that the
least worst case for 5.14 is that we say that a bracketed character
class can only match a single input character. Most people expect that
anyway, and it would have the fewest regressions. Almost all
regressions would be of the form that /[ß]/i would no longer mean the
same thing as /ß/i.
The idea scares me of allowing a non-inverted class match multiple char
folds vs an inverted one
I have always been bugged by the idea that a bracketed character class
could every match more than a single code point. It's like /./ suddenly
matching more than one, but you're not in grapheme mode. Character classes
seem to be inherent singletons.

It's because of this that we can't do certain kinds of lookbehinds anymore:

% blead -E 'say "psst" =~ /(?<=[\x80-\xFF])t/ || 0'
0

% blead -E 'say "psst" =~ /(?<=[\x80-\xFF])t/i || 0'
Variable length lookbehind not implemented in regex m/(?<=[\x80-\xFF])t/ at -e line 1.
Exit 255

% blead -E 'say "psst" =~ /(?<=[^\x80-\xFF])t/iaa || 0'
1

And it's not the character class that's doing it, either; it's this:

% blead -E 'say "psst" =~ /(?<=\xDF)t/ || 0'
0

% blead -E 'say "psst" =~ /(?<=\xDF)t/i || 0'
Variable length lookbehind not implemented in regex m/(?<=\xdf)t/ at -e line 1.
Exit 255

% blead -E 'say "psst" =~ /(?<=\xDF)t/iaa || 0'
0

So this is already a weirdness even without bringing a
character class into it, whether it's inverted or not.

That one at least has a possible fix. You turn something like

(?<=\R)

into

(?:(?<=\r\n)|(?<=\v))

just as you turn

/(?<=\xDF)/i

into

/(?:(?i:(?<=ss))|(?-i:((?<=\xDF)))/i

Perhaps I've strayed a bit from the matter at hand; I could certainly live
with no multichar folds in charclasses (positive or negative alike), but
multichar folds are still a bit of a curiosity, charitably put. Even so,
I do not think we can dare consider breaking:

% perl5.8.1 -le 'print "\x{FB00}" =~ /ff/i || 0'
1

And I don't know why the Unicode folks might want (us) to at this point.

--tom
Nicholas Clark
2011-04-29 12:13:21 UTC
Permalink
Post by Karl Williamson
Post by Tom Christiansen
Post by Tom Christiansen
Wouldn't backing out multichar folds for 5.14 introduce a regression?
% perl5.12.0 -E 'say "\x{FB00}" =~ /ff/i || 0'
1
...
% perl5.12.3 -E 'say "\x{FB00}" =~ /ff/i || 0'
1
--tom
Yes it would. My point was that appears to be where Unicode is headed.
But there are no guarantees that that is where they'll end up.
A middle position would be to disable them only in bracketed character
classes. I think that the most astonishment stems from those, when they
are inverted. This is where it was most buggy pre-5.14. There were
This stuff is confusing.

If I understand correctly, this is pretty much the simplest test case:

$ perl5.8.9 -lwe '$_ = "ss"; utf8::upgrade($_); print "$] ", /\A[^\xDF]+\z/i ? "Y" : "N";'
5.008009 Y
$ perl -lwe '$_ = "ss"; utf8::upgrade($_); print "$] ", /\A[^\xDF]+\z/i ? "Y" : "N";'
5.012003 Y
$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print "$] ", /\A[^\xDF]+\z/i ? "Y" : "N";'
5.014000 N


and that what happens is that to [^\xDF] is processed all in one, not as a
sequence:

a range
an inverted range
in a case insensitive match


so it's not implemented as a human *might* think, in terms of

* process the ranges inside the [^...] construction to make a list of code
points (in my case that's one code point, U+00DF)
* [^...] means invert the list (in my case, that's several million code points)
* now match the inverted list against the input string
* oh yes, do that insensitively


but all in one step, given that (if I'm understanding this correctly)

/[\xDF]/i is equivalent to /ss/i

so

$_ =~ /[\xDF]/i implies $_ =~ /ss/i


and hence that

$_ =~ /[^\xDF]/i implies $_ !~ /ss/i

and it's that last jump that is really catching everyone out.


Or am I getting this subtly wrong?


Whichever way, it does feel that this spec currently, where inversion happens
at the point of insensitive matching, has emergent behaviour which makes it
dangerous and counterintuitive to the point of uselessness to almost anyone in
the real world.
Post by Karl Williamson
To state more clearly, I guess I'm now putting forth the idea that the
least worst case for 5.14 is that we say that a bracketed character
class can only match a single input character. Most people expect that
anyway, and it would have the fewest regressions. Almost all
regressions would be of the form that /[ß]/i would no longer mean the
same thing as /ß/i.
Expressing it like that troubles me too, as the way my mental model works,
a (non-inverted) character range of one in my head is the same as a literal.
Post by Karl Williamson
The idea scares me of allowing a non-inverted class match multiple char
folds vs an inverted one
Could you give examples of what you mean by this? I'm not quite sure I'm
understanding it correctly. (And then again, I'm not sure if my head will
cope if I did understand it)

I think what I'm finding really confusing is that the current spec and
implementation means that the inversion and folding all happen together.
such that this:

$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print "$] ", /\A[^\xDF]+\z/i ? "Y" : "N";'
5.014000 N

*doesn't* mean build a match built up progressively:

1: U+00DF
2: everything but the list in step 1
3: one or more of the set of step 2
4: match insensitively with step 3, anchored


I am (so far) lacking the imagination to spot how to blow holes in that plan.

Nicholas Clark
Tom Christiansen
2011-04-29 13:07:05 UTC
Permalink
Post by Nicholas Clark
This stuff is confusing.
Ya think? :)
Post by Nicholas Clark
but all in one step, given that (if I'm understanding this correctly)
/[\xDF]/i is equivalent to /ss/i
so
$_ =~ /[\xDF]/i implies $_ =~ /ss/i
and hence that
$_ =~ /[^\xDF]/i implies $_ !~ /ss/i
and it's that last jump that is really catching everyone out.
Or am I getting this subtly wrong?
No, I think that's exactly it.
Post by Nicholas Clark
Whichever way, it does feel that this spec currently, where inversion happens
at the point of insensitive matching, has emergent behaviour which makes it
dangerous and counterintuitive to the point of uselessness to almost anyone in
the real world.
Back in simple ASCII days, one could implement a bracketed character class
very efficiently. One way to do it is as a 128-bit mask where each bit
represents whether that charclass includes the corresponding code point.
To complement the class, you could just complement the bitmask. (In
practice, one used a 256-bit mask for the full range of byte values.) I
think similar logic was also once used for tr.

It seems that this is the mental model we still have with bracketed
character class. Logically, the complemented [^\xDF] of is all (single)
code points except for that one, just as the complemented [^abc] is all
code points apart from those three. In a repertoire with over a million
characters, you can't have a bitmask for each of those, so you have to
implement it differently. And that's where the trouble arises.

We look *for* the particular thing, since looking for all the others
directly it too hard. We use extra logic to decide whether it's a
complemented case. *And* we use casefolding rules when there's a /i
involved. It's case-folding that leads to multichar situation.

I'd like to think about how people would use this stuff *in practice*. The
problem is that in practice, the \xDF case isn't too common, so we don't
have many examples to go by. However, there are quite a lot of Greek code
points where this arises. This is due to their weird lowercase Mark,
U+0345 COMBINING GREEK YPOGEGRAMMENI, which is both \p{Lower} and \p{Mn}.

lower: ᾲ στο διάολο
lower: \x{1FB2} \x{3C3}\x{3C4}\x{3BF} \x{3B4}\x{3B9}\x{3AC}\x{3BF}\x{3BB}\x{3BF}

title: Ὰͅ Στο Διάολο
title: \x{1FBA}\x{345} \x{3A3}\x{3C4}\x{3BF} \x{394}\x{3B9}\x{3AC}\x{3BF}\x{3BB}\x
{3BF}

upper: ᾺΙ ΣΤΟ ΔΙΆΟΛΟ
upper: \x{1FBA}\x{399} \x{3A3}\x{3A4}\x{39F} \x{394}\x{399}\x{386}\x{39F}\x{39B}\x
{39F}

That's because U+1FB2 goes to U+1FBA U+0399 for uppercase, but
it goes to U+1FBA U+0345 in titlecase.

I am quite sure that someone would want to use /^\x{1FB2}/i and
have it catch all three cases, of

The lowercase
"\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
becomes this two-codepoint sequence in uppercase:
"\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{GREEK CAPITAL LETTER IOTA}"
but becomes this two-codepoint sequence in uppercase:
"\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI}"

But what I don't know is whether they are expecting /^[^\x{1FB2}]/i
to rule all three of those *out*, that is, be like !/^[\x{1FB2}]/i.

The trouble is that I suspect they might.
Post by Nicholas Clark
Expressing it like that troubles me too, as the way my mental model works,
a (non-inverted) character range of one in my head is the same as a literal.
Exactly. We're doing a different sort of set-logic in our heads
than necessarily seems to follow here. But I look at the Greek
case and wonder what real users are going to expect here.

--tom
Tom Christiansen
2011-04-29 13:10:01 UTC
Permalink
This should have been:

The lowercase
"\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
becomes this two-codepoint sequence in uppercase:
"\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{GREEK CAPITAL LETTER IOTA}"
but becomes this two-codepoint sequence in TITLECASE not uppercase:
"\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI}"

--tom
Nicholas Clark
2011-04-29 13:26:22 UTC
Permalink
Post by Tom Christiansen
But what I don't know is whether they are expecting /^[^\x{1FB2}]/i
to rule all three of those *out*, that is, be like !/^[\x{1FB2}]/i.
The trouble is that I suspect they might.
Thanks,

That's certainly one part that I'm missing in the bigger picture.

Nicholas Clark
Tom Christiansen
2011-04-29 14:04:40 UTC
Permalink
Post by Nicholas Clark
Post by Tom Christiansen
But what I don't know is whether they are expecting /^[^\x{1FB2}]/i
to rule all three of those *out*, that is, be like !/^[\x{1FB2}]/i.
The trouble is that I suspect they might.
Thanks,
That's certainly one part that I'm missing in the bigger picture.
I suppose if we'd any Greek speakers on the list, they'd've chimed
in by now. Pity, that.

However this sorts out, I feel we're developing insights here that
would be extremely helpful to the Unicode Consortium, especially
given Karl's reminder of the imminent deadline this weekend for feedback
on something that could very well affect us all.

I hope to have something to say to them by them the end of the day.
It would be great if others did, too.

--tom
Tom Christiansen
2011-04-29 14:25:44 UTC
Permalink
I now think there's also something else important going on here.

It's that regexes written in a bytewise mentality do not
automatically translate well into the Unicode world.

George's pattern was this:

[^\x00-\x1f\x7f-\xff :]

That is assuming that the complements of the "high-bit bytes" is the
low-bit bytes, that is; ASCII. But in Unicode, this is not at all true.
Using [^\x80-\xFF] to get nothing but ASCII (and nothing more) works
*only if* you're in ASCII mode; If you're in Unicode mode, it gets the
full repertoire of a million-plus code points with those 128 of them
subtracted from that huge set.

The cardinality of the ASCII set is quite different from that
of the Unicode set. Complements pull in much much much more now.

That's why I suggested /aa.

I think these incorredt cardinality assumptions underlie these
troubles. It may be we've been skating on thin ice for some time
now and just hadn't realized it yet.

Similarly, the complement of ASCII via [^\x00-\x7F] is not high-bit
bytes at all. It's a huge number of characters.

--tom
Karl Williamson
2011-04-29 15:46:16 UTC
Permalink
Post by Tom Christiansen
I'd like to think about how people would use this stuff *in practice*. The
problem is that in practice, the \xDF case isn't too common, so we don't
have many examples to go by.
Actually, I remember reading the opposite, that ß was the most common
and important of the multi-char folds. I believe that the reason it
exists is simply for mathematical completeness.

I am not a German speaker but we do have some on this list. My
understanding is that ß is already lower case, there is no lower case
equivalent to it, but the upper case of ß is 'SS'. The case fold of
'SS' is 'ss', and therefore by extension so should be the case fold of
ß. But this is in some sense contrary to real German, where there are
minimal pair words that differ only by ß and ss and mean different
things. I think maße and masse is an example, or is it müße and müsse
(I don't know).

As an English speaker, I would use /i to try to get the same word in all
its possible capitalizations. (People have pointed out that that isn't
really possible in English either due to homonyms and acronyms, but it's
something I and others do expect, nonetheless.) The Unicode practice of
assuming transitivity where it really doesn't happen in the native
language leads to the case fold of ß being 'ss', when in fact I don't
think it is called for in the language. I asked Steffen this question
on IRC some months ago.
Post by Tom Christiansen
there are quite a lot of Greek code
points where this arises. This is due to their weird lowercase Mark,
U+0345 COMBINING GREEK YPOGEGRAMMENI, which is both \p{Lower} and \p{Mn}.
lower: ᾲ στο διάολο
lower: \x{1FB2} \x{3C3}\x{3C4}\x{3BF} \x{3B4}\x{3B9}\x{3AC}\x{3BF}\x{3BB}\x{3BF}
title: Ὰͅ Στο Διάολο
title: \x{1FBA}\x{345} \x{3A3}\x{3C4}\x{3BF} \x{394}\x{3B9}\x{3AC}\x{3BF}\x{3BB}\x
{3BF}
upper: ᾺΙ ΣΤΟ ΔΙΆΟΛΟ
upper: \x{1FBA}\x{399} \x{3A3}\x{3A4}\x{39F} \x{394}\x{399}\x{386}\x{39F}\x{39B}\x
{39F}
That's because U+1FB2 goes to U+1FBA U+0399 for uppercase, but
it goes to U+1FBA U+0345 in titlecase.
I am quite sure that someone would want to use /^\x{1FB2}/i and
have it catch all three cases, of
The lowercase
"\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
becomes this two-codepoint sequence in uppercase:
"\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{GREEK CAPITAL
LETTER IOTA}"
but becomes this two-codepoint sequence in TITLECASE not uppercase:
"\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK
YPOGEGRAMMENI}"
Post by Tom Christiansen
But what I don't know is whether they are expecting /^[^\x{1FB2}]/i
to rule all three of those *out*, that is, be like !/^[\x{1FB2}]/i.
I'm including in the cc list the 2 other people who read this list that
I know are at least somewhat familiar with Greek (modern or ancient),
and with Tom's correction to his original.
Tom Christiansen
2011-04-29 16:07:58 UTC
Permalink
Post by Karl Williamson
I'm including in the cc list the 2 other people who read this list that
I know are at least somewhat familiar with Greek (modern or ancient),
and with Tom's correction to his original.
Good, I've pinged a few people I know who have a bit of Greek, too.
My long message boils down to what *should* happen with things like:

ᾲ στο διάολο (lowercase)
Ὰͅ Στο Διάολο (titlecase)
ᾺΙ ΣΤΟ ΔΙΆΟΛΟ (uppercase)

when you are doing a case-insensitive match with /ᾲ/.

Right now, /^ᾲ/i will match all three lines, and similarly
both of !/^ᾲ/i and /^[^ᾲ]/i will rule all three of those out.

The pressing question is whether this is what someone familiar
with Greek would *expect* to happen? Is the current behavior
described above what they would expect, or isn't it?

--tom
Tom Christiansen
2011-04-29 16:30:23 UTC
Permalink
Here's some feedback from somebody who has some Greek.

--tom

In Ancient Greek the beginnings of sentences are NOT capitalized,
maybe even for this reason. The beginning of an entire chapter might
be capitalized, and proper names are capitalized, but proper names
don't begin with really odd things as a rule. And things written in
all uppercase usually just lose their breathings and accents, though
you are right about the iota subscript becoming a "real" letter. I've
never seen a "title case" used in classical Greek. I don't think you
are SUPPOSED to use an iota subscript with a capital letter like you do
in your title case example. Just looked, and the name of the book of
Hebrews has no breathing (h sound) or accent and is written in ALL
CAPS.

As to what modern Greek speakers expect, I have no clue. I think they
don't have any written accents anyhow. Just looked at a Greek
newpaper--no accents, no breathings, and most important NO TITLE CASE.
Headlines only have their first letter capitalized.
lowercase: ᾲ στο διάολο
titlecase: Aι στο διάολο
uppercase: AΙ ΣΤΟ ΔΙAΟΛΟ
Karl Williamson
2011-04-29 17:18:26 UTC
Permalink
Here's another proposal to consider instead:

Don't have the default in 'use 5.1[24]' be /u matching.

I actually think the default should be /aa matching. But the idea of
making it be /a matching led to earlier vociferous resistance.

I don't know how George's example came up; somehow the source string got
to be in utf8, I presume. But the regex was written with an ASCII
mind-set, not considering the possibility of Unicode semantics. George,
how did it get to be utf8?

Natural language is complicated, full of quirks and exceptions that
don't lead to being representable by relatively simple mathematical
models. But that is what Unicode has attempted. I do get the sense
that they are somewhat ivory-towered, more concerned with their document
than real-world applications, in spite of Unicode by its nature, being
very real-worldly. And it turns out that they haven't gotten
case-insensitive matching worked out correctly yet, a fact that I was
unaware of when I started.

I think it was a big mistake for Perl to long ago decide that existing
programs could straddle the non/Unicode world without problems. We are
seeing here how the consequences of an ASCII mind-set causes problems
when extended to the full Unicode set.

The belief back then was nothing would change unless Unicode was called
for, and Unicode was called for only when a string had to be encoded in
utf8. There is some merit to that, but we have found lots of problems
with it over the years, that led rise to the term "The Unicode Bug".

Unicode is a big, scary, not-well understood, and it turns out, buggy,
world. I view it as unconscionable for us to expose naive innocents to
it without their having signed a consent agreement that explains the
risks. It is NOT just a transparent extension to ASCII. There are, for
example, little-understood security issues, with potential large
economic, and worse consequences. I now believe that to get Unicode
semantics, one should have to explicitly say one wants it, say by an
explicit 'use feature unicode_strings', which is not part of a feature
bundle.

Making /aa be the default also solves the issue for when something gets
encoded as utf8, but really shouldn't be treated as full Unicode.
George Greer
2011-04-29 18:09:40 UTC
Permalink
Post by Karl Williamson
Don't have the default in 'use 5.1[24]' be /u matching.
I actually think the default should be /aa matching. But the idea of making
it be /a matching led to earlier vociferous resistance.
I don't know how George's example came up; somehow the source string got to
be in utf8, I presume. But the regex was written with an ASCII mind-set, not
considering the possibility of Unicode semantics. George, how did it get to
be utf8?
The strings for the MIME::Entity::->build() call were literal strings with
some variables interpolated that came from XML files. The XML files
themselves are declared with UTF-8 encoding although in practice it is
almost always ASCII.

Effectively:

perl -MMIME::Entity -le '
$a="ss"; utf8::upgrade($a); # Pretend from XML.
print MIME::Entity::->build("X-$a" => "X", "Data" => "X");
'
==>
Bad RFC822 field name 'X-SS' at /home/ggreer/bin/perlbrew/perls/perl-5.14.0-RC1/lib/site_perl/5.14.0/MIME/Entity.pm line 679
Post by Karl Williamson
Making /aa be the default also solves the issue for when something gets
encoded as utf8, but really shouldn't be treated as full Unicode.
The module's regex attempted to validate ASCII-ness but now it is a wild
Unicode world. It using the /i indiscriminately didn't help either.

However there's a lot of mixing going on now that we are stuck with. If
someone does:

perl -ne 'print if /[\x80-\xff]/' */*.xml

Are they looking for high-bit bytes or just a range of Unicode characters?
--
George Greer
Tom Christiansen
2011-04-29 19:59:22 UTC
Permalink
Post by Karl Williamson
Don't have the default in 'use 5.1[24]' be /u matching.
I actually think the default should be /aa matching. But the idea of
making it be /a matching led to earlier vociferous resistance.
Wouldn't that break *all* old code that relies about
charclass abbreviations working on Unicode?

% perl5.8.0 -le 'print "\x{1FB2}" =~ /\w/ || 0'
1

% perl5.10.0 -le 'print "\x{1FB2}" =~ /\w/ || 0'
1

% perl5.12.3 -le 'print "\x{1FB2}" =~ /\w/ || 0'
1

% blead -le 'print "\x{1FB2}" =~ /\w/ || 0'
1

% blead -Mre=/a -le 'print "\x{1FB2}" =~ /\w/ || 0'
0

% blead -Mre=/aa -le 'print "\x{1FB2}" =~ /\w/ || 0'
0

Given that we've done the RL1.2a thing with \w et al for
something like ten years now, I don't believe making
/a or /aa the default is possible.

I beleive there had also been a Rule-1 dictum on this that
we would must *not* go back to the ASCII-only world.

I guess you might try for Rule 2, but I wouldn't hold my breath.

--tom
Karl Williamson
2011-04-29 23:38:53 UTC
Permalink
Post by Tom Christiansen
Post by Karl Williamson
Don't have the default in 'use 5.1[24]' be /u matching.
I actually think the default should be /aa matching. But the idea of
making it be /a matching led to earlier vociferous resistance.
Wouldn't that break *all* old code that relies about
charclass abbreviations working on Unicode?
% perl5.8.0 -le 'print "\x{1FB2}" =~ /\w/ || 0'
1
% perl5.10.0 -le 'print "\x{1FB2}" =~ /\w/ || 0'
1
% perl5.12.3 -le 'print "\x{1FB2}" =~ /\w/ || 0'
1
% blead -le 'print "\x{1FB2}" =~ /\w/ || 0'
1
% blead -Mre=/a -le 'print "\x{1FB2}" =~ /\w/ || 0'
0
% blead -Mre=/aa -le 'print "\x{1FB2}" =~ /\w/ || 0'
0
Given that we've done the RL1.2a thing with \w et al for
something like ten years now, I don't believe making
/a or /aa the default is possible.
I beleive there had also been a Rule-1 dictum on this that
we would must *not* go back to the ASCII-only world.
I believe that making something the default does not contravene that
invocation of Rule 1. The proposal then was that \d, \w, and \s be
ASCII only. This merely changes the default.

I believe we may be at a point where we can't please both sides. The
default should be to favor least astonishment.
Tom Christiansen
2011-04-29 23:53:05 UTC
Permalink
Post by Karl Williamson
I believe that making something the default does not contravene that
invocation of Rule 1. The proposal then was that \d, \w, and \s be
ASCII only. This merely changes the default.
I believe we may be at a point where we can't please both sides. The
default should be to favor least astonishment.
I haven't checked 5.7, but certainly since 5.8.0 these all return "Yes":

% perl5.8.0 -wle 'print lc("\x{1FB2}") =~ /^\w+$/i ? "Yes" : "No"'
% perl5.8.0 -wle 'print ucfirst("\x{1FB2}") =~ /^\w+$/i ? "Yes" : "No"'
% perl5.8.0 -wle 'print uc("\x{1FB2}") =~ /^\w+$/i ? "Yes" : "No"'

% perl5.8.0 -wle 'print lc("\x{1FB2}") =~ /^\x{1FB2}$/i ? "Yes" : "No"'
% perl5.8.0 -wle 'print ucfirst("\x{1FB2}") =~ /^\x{1FB2}$/i ? "Yes" : "No"'
% perl5.8.0 -wle 'print uc("\x{1FB2}") =~ /^\x{1FB2}$/i ? "Yes" : "No"'

Where those are respectively:

LC: "\x{1FB2}"
TC: "\x{1FBA}\x{345}"
UC: "\x{1FBA}\x{399}"

We cannot know how much code is out there that relies on this sort of thing
working. Given that it *has* worked for going on ten years now, there is
almost certainly be a ton of it. So I cannot see how one can countenance a
release whose default breaks those. That really cannot be what you're
suggesting, can it? :(

--tom
Karl Williamson
2011-04-30 15:57:01 UTC
Permalink
I decided to step back and write an executive summary of the issue to
clarify it for people. In doing so, I found myself coming to a
different conclusion than before. So here it is:

Unicode has two properties which are used by Perl in /i matching. One
of these, called "Simple Case Folding", maps a character to a single
other character. The simple case fold of 'A' is 'a'. The other, called
"Full Case Folding", maps a single character to two to three characters.
About 10% of all the defined case folds are of the Full variety.

The Unicode standard does not place requirements on regular expression
matching. However, they have a document of recommendations, perhaps
viewed as a statement of best practices, called Technical Standard #18
(confusingly abbreviated TR18), that Perl has treated as standard.
Because of various issues inherent to multi-character folding, Unicode
is proposing to stop recommending full case fold matching and to
recommend only single case fold matching. A comment period on this
proposal is just ending, and it will be reviewed by Unicode's Technical
Committee in a couple of weeks. Its ultimate disposition, or even the
timing are not currently known.

Perl has long purported to support full case folding. If the current
Unicode proposal is adopted, Perl will be out of compliance, but again,
TR18 is a set of recommendations, and not the official standard. Perl's
implementation has been EXTREMELY buggy, especially with [bracketed
character classes]. 5.14 fixes many, but not all of these bugs (as
Nicholas just discovered).

George has found a case in real-life customer data where the application
of full case folding (in an inverted [class]) leads to counter-intuitive
results. Because of the bugs in earlier Perl releases, 5.14 is the
first release where this shows up. Cases like this are part of the
reason that Unicode is proposing to recommend only simple case folding.

I believe there isn't any solution that doesn't break some extant
programs. Some programs rely on full case folding; some rely on it not
working.

It has been my impression that Unicode doesn't even consider making
changes like the one proposed unless there is really really good reason
to. That means that they have now realized that full case folding is
inherently very problematic. This leads me to believe that it should
therefore not be enabled by default. If Unicode admits to no longer
believing in it, we should take heed. (Even if their proposal is changed
to retain full case folding, they will have to say that it is problematic.)

But there are programs that currently rely on it. I therefore think
there needs to be a new one-line construct to enable it, something like
'use re "full_folding"'. It is awfully late in the development cycle to
be introducing something like this, but I don't see any alternative. I
can have it implemented in a few (2-3) days; I can see us flailing
around for that long just discussing other alternatives.
Tom Christiansen
2011-04-30 16:17:15 UTC
Permalink
Post by Karl Williamson
I decided to step back and write an executive summary of the issue to
clarify it for people.
But there are programs that currently rely on it.
I think we've come against the issue of how programs that rely on the
old working behavior versus how many rely on the old buggy behavior.

And I bet we can't know how many there are in either camp, let alone
try to balance them.
Post by Karl Williamson
I therefore think there needs to be a new one-line construct to
enable it, something like 'use re "full_folding"'. It is
awfully late in the development cycle to be introducing
something like this,
I agree.
Post by Karl Williamson
but I don't see any alternative.
Isn't the alternative shipping 5.14.0 as is, then taking our time to
introduce new flags for 5.14.1? I don't like rushing out changes after
the horse has left the gate, so to speak. How horrible would that be?

How else would we get the time to consider this fully?
Post by Karl Williamson
I can have it implemented in a few (2-3) days; I can see us
flailing around for that long just discussing other
alternatives.
Perhaps there may be a couple possible alternatives:

use re "full_folding";
no re "full_folding";

use re "simple_folding";
no re "simple_folding";

I momentarily considered whether this should be something like a
unicode_case flag, but decided against it. Although supports
such, with a warning in the docs that there may be a performance
penalty, ICU does not, since everything is supposed to be Unicode
there anyway.

Thanks, Karl. Have you submitted feedback with both pros and cons?
I think today is the last day.

--tom
Tom Christiansen
2011-04-30 16:25:30 UTC
Permalink
I momentarily considered whether [THERE] should be something like a
unicode_case flag, but decided against it. Although [JAVA] supports
such, with a warning in the docs that there may be a performance
penalty, ICU does not, since everything is supposed to be Unicode
there anyway.
--tom
Jesse Vincent
2011-04-30 16:26:33 UTC
Permalink
Post by Tom Christiansen
Isn't the alternative shipping 5.14.0 as is, then taking our time to
introduce new flags for 5.14.1? I don't like rushing out changes after
the horse has left the gate, so to speak. How horrible would that be?
We'd be introducing the new flag for 5.16.0, not 5.14.1. But yes, that
is the other option I've just put to Karl on IRC.
Jesse Vincent
2011-04-30 16:37:09 UTC
Permalink
Post by Jesse Vincent
We'd be introducing the new flag for 5.16.0, not 5.14.1. But yes, that
is the other option I've just put to Karl on IRC.
Ok, that sounds safer.
What then is left to do for an RC2 on 5.14.0?
At this point, write out a comprehensive explanation of this breakage
and apropraite workarounds for the perldelta.
And are there any concrete plans regarding 5.14.1?
One month after we ship 5.14.0 with fixes for the catastrophic screwups
we don't catch before 5.14.0.

Best,
Jesse
thanks,
--tom
--
Karl Williamson
2011-04-30 18:41:27 UTC
Permalink
I really don't like the idea of breaking code like George found, for a
number of reasons.

So another tack to take is to make 5.14 no worse than 5.12.

The real problem area is in bracketed character classes. I think the
5.14 behavior is probably acceptable for everything else.

So, let's just talk about the classes. There was code for
multi-character folds in these classes when I started out, but I believe
none of it worked. I put a fix into 5.10.1, I believe, that got some of
it working. But not much. There were no further changes until 5.14.

So the situation in 5.12 for multi-character folds in classes in 5.12
was that they almost entirely didn't work. It would be hard to get back
to the exact state of what worked and didn't work then, but I think the
least worst solution then for 5.14 is to just disable multi-char folds
in bracketed character classes.

Almost all the cases where they did work in 5.12 was when the class got
optimized into a straight sequence.

So I made essentially this proposal a couple of days ago, and Tom seemed
to think this was an ok idea, but Nicholas was concerned that it would
mean that /ß/i would no longer mean /[ß]/i.

So here is a revised proposal. Instead of documenting that we don't
plan to accept multi-char folds in classes, we say that we don't yet
accept them unless the class is optimized away, and that this is a
continuation of the 5.12 behavior, except that a few instances where
they did work in 5.12 will no longer. This would be the only code that
breaks, and I doubt that there is very much code like this at all. It
certainly is a lot smaller amount of affected code than any other proposal.

I don't believe any multi-char fold in an inverted class worked in 5.12.
Yet another option would be to not allow just those in 5.14.

I could have either of these submitted within an hour.
Karl Williamson
2011-04-30 18:48:57 UTC
Permalink
Post by Karl Williamson
I really don't like the idea of breaking code like George found, for a
number of reasons.
So another tack to take is to make 5.14 no worse than 5.12.
The real problem area is in bracketed character classes. I think the
5.14 behavior is probably acceptable for everything else.
So, let's just talk about the classes. There was code for
multi-character folds in these classes when I started out, but I believe
none of it worked. I put a fix into 5.10.1, I believe, that got some of
it working. But not much. There were no further changes until 5.14.
So the situation in 5.12 for multi-character folds in classes in 5.12
was that they almost entirely didn't work. It would be hard to get back
to the exact state of what worked and didn't work then, but I think the
least worst solution then for 5.14 is to just disable multi-char folds
in bracketed character classes.
Almost all the cases where they did work in 5.12 was when the class got
optimized into a straight sequence.
So I made essentially this proposal a couple of days ago, and Tom seemed
to think this was an ok idea, but Nicholas was concerned that it would
mean that /ß/i would no longer mean /[ß]/i.
So here is a revised proposal. Instead of documenting that we don't plan
to accept multi-char folds in classes, we say that we don't yet accept
them unless the class is optimized away, and that this is a continuation
of the 5.12 behavior, except that a few instances where they did work in
5.12 will no longer. This would be the only code that breaks, and I
doubt that there is very much code like this at all. It certainly is a
lot smaller amount of affected code than any other proposal.
I don't believe any multi-char fold in an inverted class worked in 5.12.
Yet another option would be to not allow just those in 5.14.
I could have either of these submitted within an hour.
(Not including documentation changes)
Tom Christiansen
2011-04-30 19:12:15 UTC
Permalink
Isn't 0xDF and "SS" *the* big problem? I don't think the others are
troublesome, are they? What about not generating multichar folds in
charclasses that contain nothing over 255? Or would that be resurrecting
the Unicode Bug?

--tom
Karl Williamson
2011-04-30 19:43:52 UTC
Permalink
Post by Tom Christiansen
Isn't 0xDF and "SS" *the* big problem? I don't think the others are
troublesome, are they? What about not generating multichar folds in
charclasses that contain nothing over 255? Or would that be resurrecting
the Unicode Bug?
--tom
I think you're right that all or nearly all existing code that's going
to get broken will be over ß and ss.

The Unicode Bug is about utf8 vs non-utf8 encoding having different
semantics, so no, this wouldn't be resurrecting it. But it is kind of
like the Unicode bug, where addition of a new character to the class
would suddenly change the behavior of the class for non-obvious and not
really related reasons.

I would prefer a more uniform approach of what I've said before, or we
just exclude this one code point always for 5.14. But I think your
approach is much better than releasing 5.14 as-is.

(And BTW, in 5.16 I think it would be something like "use re folding X"
where X is one of "simple" "full" "nfd", nfkd, etc.)
Karl Williamson
2011-04-30 21:20:08 UTC
Permalink
Post by Karl Williamson
Isn't 0xDF and "SS" *the* big problem? I don't think the others are
troublesome, are they? What about not generating multichar folds in
charclasses that contain nothing over 255? Or would that be resurrecting
the Unicode Bug?
--tom
I think you're right that all or nearly all existing code that's going
to get broken will be over ß and ss.
The Unicode Bug is about utf8 vs non-utf8 encoding having different
semantics, so no, this wouldn't be resurrecting it. But it is kind of
like the Unicode bug, where addition of a new character to the class
would suddenly change the behavior of the class for non-obvious and not
really related reasons.
I would prefer a more uniform approach of what I've said before, or we
just exclude this one code point always for 5.14. But I think your
approach is much better than releasing 5.14 as-is.
(And BTW, in 5.16 I think it would be something like "use re folding X"
where X is one of "simple" "full" "nfd", nfkd, etc.)
In thinking about this some more, given the bug that Nicholas found that
affects all multi-character folds, not just \xdf, in character classes,
I think it would be best to just not offer any of them in 5.14.
Tom Christiansen
2011-04-30 21:31:23 UTC
Permalink
Post by Karl Williamson
In thinking about this some more, given the bug that Nicholas found that
affects all multi-character folds, not just \xdf, in character classes,
I think it would be best to just not offer any of them in 5.14.
You mean undo something that's been there since 5.8?

% perl5.8.0 -le 'print "\x{1FB2}" =~ /\x{1FB2}/i || 0'
1
% perl5.8.0 -le 'print ucfirst("\x{1FB2}") =~ /\x{1FB2}/i || 0'
1
% perl5.8.0 -le 'print uc("\x{1FB2}") =~ /\x{1FB2}/i || 0'
1

I suppose that, after a great deal of consideration and testing,
there might be a way to eventually withdraw them for 5.16. (I
also don't think we should, but I'll leave that for then.)

But I don't see how it can be done between an RC1 and an RC2.
It seems completely against what I understand freezes to mean.

Or did you mean something else?

--tom
Nicholas Clark
2011-04-30 22:00:28 UTC
Permalink
Post by Tom Christiansen
Post by Karl Williamson
In thinking about this some more, given the bug that Nicholas found that
affects all multi-character folds, not just \xdf, in character classes,
^^^^^^^^^^^^^^^^^^^^
Post by Tom Christiansen
Post by Karl Williamson
I think it would be best to just not offer any of them in 5.14.
You mean undo something that's been there since 5.8?
% perl5.8.0 -le 'print "\x{1FB2}" =~ /\x{1FB2}/i || 0'
1
% perl5.8.0 -le 'print ucfirst("\x{1FB2}") =~ /\x{1FB2}/i || 0'
1
% perl5.8.0 -le 'print uc("\x{1FB2}") =~ /\x{1FB2}/i || 0'
1
Or did you mean something else?
I think you missed the "in character classes" part of Karl's thought.
Your examples don't use []


I'm still not sure *what* I think.

But *if* a class consisting of a single character is always equivalent to a
literal of that character (ie /[a]/ is /a/, /[ß]/ is /ß/, /[ß]/i is /ß/i,
etc), one of the things I'm not about is whether it's better to say "no
multi character folds in character classes" or "no multi character folds in
character classes, except classes consisting of exactly one character". I
think (I think) that it's useful to maintain that explicit correspondence,
as (IIRC) Yves worked to get the engine to optimise /[a]/ to /a/ and /[.]/ to
/\./, as it was a common idiom in some circles to use regexp character class
syntax as an alternative to backslash quoting.

The downside, obviously, is that (for starters) it's more complex to explain.


Digression:

Because as a general rule, rightly or wrongly on my part, I feel that it's
unfortunate if two or more different syntax choices for the same action
produce notably different performance because they trigger different
runtime implementations, where both

a: one is unambiguously always slower than the other
b: it would be possible for the compile time implementation to automatically
select the faster implementation, whichever syntax was used


because that way

a: all existing code goes faster without change
b: it kills dead style arguments based on "but this one is more efficient"
letting people pick style based on clarity (or their opinions of clarity)


(eg reverse sort ...; is now internally optimised to tell sort to sort in
reverse, so no slower than sort {$b cmp $a} ...; but usually somewhat clearer)


Nicholas Clark
Tom Christiansen
2011-04-30 22:30:39 UTC
Permalink
Post by Nicholas Clark
Post by Tom Christiansen
Post by Karl Williamson
In thinking about this some more, given the bug that Nicholas found that
affects all multi-character folds, not just \xdf, in character classes,
^^^^^^^^^^^^^^^^^^^^
Post by Tom Christiansen
Or did you mean something else?
I think you missed the "in character classes" part of Karl's thought.
Your examples don't use []
You're right. I did. Thanks, Nick.
Post by Nicholas Clark
I'm still not sure *what* I think.
But *if* a class consisting of a single character is always equivalent to a
literal of that character (ie /[a]/ is /a/, /[ß]/ is /ß/, /[ß]/i is /ß/i,
etc), one of the things I'm not about is whether it's better to say "no
multi character folds in character classes" or "no multi character folds in
character classes, except classes consisting of exactly one character". I
think (I think) that it's useful to maintain that explicit correspondence,
as (IIRC) Yves worked to get the engine to optimise /[a]/ to /a/ and /[.]/ to
/\./, as it was a common idiom in some circles to use regexp character class
syntax as an alternative to backslash quoting.
I remember that. There was also the way a*a*a*a*a*[b] worked
differently that a*a*a*a*a*b worked.

I agree that /[b]/ and /b/ should really always be the same.

Withdrawing multichar folds in charclasses would certainly be
less of an issue than withdrawing them altogether.

I don't know how we'll ever know what something *would* do
without doing it. And I would like to have a workaround
at ready for anyone who gets affected in a negative way,
no matter what gets done.

--tom
Karl Williamson
2011-05-01 17:02:26 UTC
Permalink
I wrote out a list of the concrete proposals (hopefully I got them all)
for moving ahead with this, with the advantages and disadvantages that I
see. I put it in an attachment so it can be easily edited with new
ideas and corrections.

Keep in mind that Nicholas has found a bug in multi-char folds in
character classes that is, roughly, if the character class includes both
a multi-char fold and the first letter of that fold, then it won't ever
match the fold; it will get stuck at matching that first letter.

I am not comfortable shipping 5.14 as-is. I think we need to do
something. If it is deemed ok for the churn, then I think option 4 is
the best; otherwise options 3b or 3c would be my recommendation.
Karl Williamson
2011-05-01 17:09:12 UTC
Permalink
The previous email had a file-encoding issue on my system. The
attachment was saved in Latin1. Here is the same email, with the
attachment in utf8

I am not comfortable shipping 5.14 as-is. I think we need to do
something. If it is deemed ok for the churn, then I think option 4 is
the best; otherwise options 3b or 3c would be my recommendation.
Tom Christiansen
2011-05-01 17:21:03 UTC
Permalink
Post by Karl Williamson
The previous email had a file-encoding issue on my system. The
attachment was saved in Latin1.
It worked fine for me.
Post by Karl Williamson
Here is the same email, with the attachment in utf8
Um, howso? Here are the MIME instructions from the first piece:

Content-Type: text/plain;
name="options"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="options"

whereas here they are from second piece:

Content-Type: text/plain;
name="options"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="options"

I see no difference at all.

When I send things like that, I send them this way:

Content-Description: the greek_mfold.t file (in UTF-8)
Content-Disposition: inline
Content-Type: text/plain; charset="UTF-8"; filename="greek_mfold.t"; name="greek_mfol
d.t"
Content-ID: <***@chthon.perl.com>
Content-Transfer-Encoding: quoted-printable

That way the charset is plainly [:)] specified. Your anonymous
base64 binaries don't. Although I don't know why for you they
got double-encoded into Latin1. You're using:

User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.14) Gecko/20110223 Thunderbird/3.1.8

So that must be doing something sneaky.
Post by Karl Williamson
I am not comfortable shipping 5.14 as-is. I think we need to do
something. If it is deemed ok for the churn, then I think option 4 is
the best; otherwise options 3b or 3c would be my recommendation.
You've done a customarily thorough job in delineating the problem-space
for us, Karl. Thank you very very much for that.

As for anything else, I would like to take some time this morning to
weight the various pros and cons of each individual option.

Central to all this is the surprises that can occur when patterns
implicitly written for byte data are extended to Unicode character data.
I have a hunch this is only one clear example of this issue, and that
there are others, too.

I'll get back to you.

thanks again,

--tom
Karl Williamson
2011-05-01 17:16:36 UTC
Permalink
On 05/01/2011 11:09 AM, Karl Williamson wrote:
It still doesn't look right on my system. If you have weirdness on
yours, the single non-ASCII character is \xDF, LATIN SMALL
LETTER SHARP S
Tom Christiansen
2011-05-01 17:21:54 UTC
Permalink
Post by Karl Williamson
It still doesn't look right on my system. If you have weirdness on
yours, the single non-ASCII character is \xDF, LATIN SMALL
LETTER SHARP S
I think it's a browser-qua-mailer bug. Try composing it the way I do.

--tom
Tom Christiansen
2011-05-02 01:50:02 UTC
Permalink
Karl came up with this list of possible directions:

1) Ship 5.14 as-is
2) Exclude just the ß=>ss multi-char fold in character classes in
some manner:
2a) Exclude ß in just an inverted class.
2b) Exclude ß in just an inverted class unless there is an
explicit code point above 255.
2c) Exclude ß in any class unless there is a code point
above 255.
2d) Exclude ß in any class always.
3) Exclude all multi-char folds in classes in some manner
3a) in just inverted classes
3b) in all classes
3c) in all classes except those that optimize to EXACTF nodes
4) Use simple folding in classes and non-classes unless specify
new pragma 'use re qw(folding full)'
5) Change regex default to /aa

Summary of a long conversation:

We don't need to have the same long-term goals as short-term ones.

We cannot know how, when, or even whether The Unicode Consortium
is going to change their minds about UTS#18, so that cannot be
a factor in any short-term measure.

Perhaps it would be possible or desirable to emit some sort of warning, and
if so, when. Maybe that could accompany some of the hairier choices above.
That would directly address the problem of things silently behaving
differently, weirdly, or unexpectedly. That might make more of the
possible short-term measures above more acceptable, even option 1.

We've had almost nine years of providing at least some measure of full case
folding (thank you, Unicode Consortium; pity you hadn't though of this back
then), so we cannot just yank it altogether. Even changing the defaults
should be only that, still allowing for the old behavior if possible.
Someday we might want to think more about a way to say to use the same
behavior as a particular release used. [I noted that Unicode::Collate
already does this.]

Probably the best long-term goal is to allow the user to use a
pragma to specify whether they want simple or full case folding.
That begs the question of which one should be the default, and
it is far from obvious or certain that that default should be
simple case folding. Delay choosing the default till the future.

We must never make /aa the default because that sacrifices the
future for the sake of the past. Perl has supported Unicode
in regex for a long time. That would be throwing that all away,
and is unacceptable as a silent default.

It may not be a good idea to forever think of character classes and
dot and such as representing single code points only, because if
there's ever a "grapheme mode", they will not.

I just got done talking to Larry for an hour, and those were the points
we went over as I remember them. Everything is heavily paraphrased and
summarized mostly in my own words, although some echos of Larry are
still there.

Here, though, are my own thoughts.

It's important that 5.14 get out the door soon, but not more important than
that it get out right. We can't do anything major in the timeframe needed
to get 5.14 out the door, which rules out choice #4 before 5.16. I really
hope everyone rules out #5 for good, at least as the default, because that
runs counter to the decision to follow the RL1.2a path which was long ago
decided. Most of the middle-of-the-road solutions seem ok. I'm now a bit
queasy about the #2 set, although maybe with a warning some of them would
be less bad; dunno. So I think I might go for #3c if pressed right now.

But I'd like to discuss and consider what warnings might be possible, or
whether they actually wouldn't be, especially for cases #1, #2, or #3.
Maybe that would alter the balance and suggest a different short-term
solution that would allow us to get 5.14 out the door.

--tom
Karl Williamson
2011-05-02 03:30:46 UTC
Permalink
Post by Tom Christiansen
1) Ship 5.14 as-is
2) Exclude just the ß=>ss multi-char fold in character classes in
2a) Exclude ß in just an inverted class.
2b) Exclude ß in just an inverted class unless there is an
explicit code point above 255.
2c) Exclude ß in any class unless there is a code point
above 255.
2d) Exclude ß in any class always.
3) Exclude all multi-char folds in classes in some manner
3a) in just inverted classes
3b) in all classes
3c) in all classes except those that optimize to EXACTF nodes
4) Use simple folding in classes and non-classes unless specify
new pragma 'use re qw(folding full)'
5) Change regex default to /aa
We don't need to have the same long-term goals as short-term ones.
We cannot know how, when, or even whether The Unicode Consortium
is going to change their minds about UTS#18, so that cannot be
a factor in any short-term measure
I disagree here. It has been my impression that Unicode will not admit
error if they can possibly avoid it. That they are even contemplating
this is significant, and so is valid for us to take into consideration.
Post by Tom Christiansen
Perhaps it would be possible or desirable to emit some sort of warning, and
if so, when. Maybe that could accompany some of the hairier choices above.
That would directly address the problem of things silently behaving
differently, weirdly, or unexpectedly. That might make more of the
possible short-term measures above more acceptable, even option 1.
I think you must mean, not option 1 which is to make absolutely no
changes, but a new option 6) which adds a warning instead of the other
things that have been discussed.
Tom Christiansen
2011-05-02 11:23:47 UTC
Permalink
Post by Karl Williamson
Post by Tom Christiansen
We cannot know how, when, or even whether The Unicode Consortium
is going to change their minds about UTS#18, so that cannot be
a factor in any short-term measure
I disagree here. It has been my impression that Unicode will not admit
error if they can possibly avoid it. That they are even contemplating
this is significant, and so is valid for us to take into consideration.
I meant that since we cannot *in the short-term* know just what
they're going to do, nor when they're going to whatever that
might be, that we cannot possibly do something now about a
condition that we we will not know until then.

Basically, there's no guarantee of what they'll say nor when
they'll say it.
Post by Karl Williamson
Post by Tom Christiansen
Perhaps it would be possible or desirable to emit some sort of warning, and
if so, when. Maybe that could accompany some of the hairier choices above.
That would directly address the problem of things silently behaving
differently, weirdly, or unexpectedly. That might make more of the
possible short-term measures above more acceptable, even option 1.
I think you must mean, not option 1 which is to make absolutely no
changes, but a new option 6) which adds a warning instead of the other
things that have been discussed.
Well, yes. I think it's possible that any of options #1, #2, #3 *might*
become more palatable if some sort of warning could be concocted to alert
the user to these strange circumstances.

It may be in the medium-term that casing will become more like
normalization, in that it will need to be explicitly and manually
arranged for by the user of the regex. Just as we must now do

NFD($s) =~ /pattern/
NFC($s) =~ /pattern/
NFKD($s) =~ /pattern/
NFCD($s) =~ /pattern/

ourselves, we may end up having to do

lc($s) =~ /pattern/
uc($s) =~ /pattern/

instead of

$s =~ /pattern/i

This would be unfortunate in various ways, but I begin to wonder
whether it may be unavoidable.

And we still don't have a good

UCA1($s) =~ /pattern/
UCA2($s) =~ /pattern/
UCA3($s) =~ /pattern/
UCA4($s) =~ /pattern/

--tom
Tom Christiansen
2011-05-02 11:38:31 UTC
Permalink
Post by Karl Williamson
I disagree here. It has been my impression that Unicode will not admit
error if they can possibly avoid it. That they are even contemplating
this is significant, and so is valid for us to take into consideration.
Don't most committees who put out large tree-killing standards
documents usually work that way?

They are backing off the canonical-matching idea, having been
shown that it cannot possibly work as specified.

It is hard to know how to match a base-letter plus a diacritic
when the diacritic is not guaranteed to immediately follow the
base when in canonical mode, because that leaves you with
discontiguous matches, and the engine isn't made in a way that
lets it go back (and forth) to pick up holes in what it has
somehow skipped over in its matches.

The whole idea of a grapheme mode leaves a lot of unanswered
questions. I begin to think that perhaps we should delay
multichar folds until such time as we have a better understanding
of what grapheme mode should really mean. I am increasingly
less certain that this is a functionality in actual active use.
I just hate to undo something we've been doing, bugs and all,
for many years.

--tom
Aristotle Pagaltzis
2011-05-02 12:48:33 UTC
Permalink
Post by Tom Christiansen
The whole idea of a grapheme mode leaves a lot of unanswered
questions.
It does, doesn’t it.
Post by Tom Christiansen
I begin to think that perhaps we should delay multichar folds
until such time as we have a better understanding of what
grapheme mode should really mean. I am increasingly less
certain that this is a functionality in actual active use.
I just hate to undo something we've been doing, bugs and all,
for many years.
Maybe it’d be another option to do multi-char folding only for
ligatures? The expectations are fairly straightforward for those,
whereas they are unclear for all else, and the current behaviour
surprising. Perl could continue to support what has worked and
was unproblematic in that way.

I haven’t thought this through.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
Abigail
2011-05-02 13:23:12 UTC
Permalink
Post by Aristotle Pagaltzis
Post by Tom Christiansen
The whole idea of a grapheme mode leaves a lot of unanswered
questions.
It does, doesn’t it.
Post by Tom Christiansen
I begin to think that perhaps we should delay multichar folds
until such time as we have a better understanding of what
grapheme mode should really mean. I am increasingly less
certain that this is a functionality in actual active use.
I just hate to undo something we've been doing, bugs and all,
for many years.
Maybe it’d be another option to do multi-char folding only for
ligatures? The expectations are fairly straightforward for those,
whereas they are unclear for all else, and the current behaviour
surprising. Perl could continue to support what has worked and
was unproblematic in that way.
Actually, I find all multi-char folding surprising. The fact that
/ff/i can match, where /f+/i or /(f)(f)/i doesn't is something I
find very hard to accept.


Abigail
Aristotle Pagaltzis
2011-05-02 14:09:12 UTC
Permalink
Post by Abigail
Post by Aristotle Pagaltzis
Post by Tom Christiansen
The whole idea of a grapheme mode leaves a lot of
unanswered questions.
It does, doesn’t it.
Post by Tom Christiansen
I begin to think that perhaps we should delay multichar
folds until such time as we have a better understanding of
what grapheme mode should really mean. I am increasingly
less certain that this is a functionality in actual active
use. I just hate to undo something we've been doing, bugs
and all, for many years.
Maybe it’d be another option to do multi-char folding only
for ligatures? The expectations are fairly straightforward
for those, whereas they are unclear for all else, and the
current behaviour surprising. Perl could continue to support
what has worked and was unproblematic in that way.
Actually, I find all multi-char folding surprising. The fact
that /ff/i can match, where /f+/i or /(f)(f)/i doesn't is
something I find very hard to accept.
For /f+/i I’d just call bug, honestly. I see no surprise in it
matching “ff”. But /(f)(f)/i is a problem…

Maybe the only sane approach is indeed an explicit normalisation
step (or some kind of flagging mechanism) with which the user
must specify their choice of how such cases are to be treated,
if they want multi-char folding.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
Jesse Vincent
2011-05-02 20:44:27 UTC
Permalink
Post by Tom Christiansen
5) Change regex default to /aa
We must never make /aa the default because that sacrifices the
future for the sake of the past. Perl has supported Unicode
in regex for a long time. That would be throwing that all away,
and is unacceptable as a silent default.
Proposal 5 is not currently on the table. I agree that it feels like a
bad direction for us to head.

...one option down. Four more to wrestle with.

-j

--
Father Chrysostomos
2011-05-01 20:05:28 UTC
Permalink
The previous email had a file-encoding issue on my system. The attachment was saved in Latin1. Here is the same email, with the attachment in utf8
I am not comfortable shipping 5.14 as-is. I think we need to do something. If it is deemed ok for the churn, then I think option 4 is the best; otherwise options 3b or 3c would be my recommendation.
...
3b) in all classes
BUT: /ß/i and /[ß]/i don't have the same meaning (nor any other
multi-char fold) though because of bugs, this can't be relied on in
earlier Perl releases, so this isn't introducing many regressions)
I think this is the best option.
Aristotle Pagaltzis
2011-05-02 12:34:12 UTC
Permalink
Post by Father Chrysostomos
3b) in all classes
BUT: /ß/i and /[ß]/i don't have the same meaning (nor any
other multi-char fold) though because of bugs, this can't be
relied on in earlier Perl releases, so this isn't introducing
many regressions)
I think this is the best option.
So do I.

I agree with Tom that it could be a good idea to emit a warning
somewhere in there.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
Jesse Vincent
2011-05-02 20:54:51 UTC
Permalink
Post by Karl Williamson
I am not comfortable shipping 5.14 as-is. I think we need to do
something. If it is deemed ok for the churn, then I think option 4
is the best; otherwise options 3b or 3c would be my recommendation.
3) Exclude all multi-char folds in classes in some manner
These all involve about the same amount of work, which was about an
hour. In spite of many people not realizing that character classes can
match more than a single code point, Perl has done this to some extent
in previous releases. However, this mechanism has been unreliable, with
some constructs with some characters working, but most, including the
simple ones, did not; so there actually is relatively little code
affected.
3a) in just inverted classes
Inverted classes are where the real non-obvious cases lie.
BUT: it neglects that Unicode is finding that the whole mechanism has
issues. And many people will not expect a character class to match more
than one character.
Based on the available evidence, 3a) sounds like the safest path for
5.14. I'd like to know how hard it would be to implement (and how much
churn it would cause).

Additionally, I'd like to hear from others about why this would be an
awful choice that would badly hurt Perl 5.14.

Thanks!

-Jesse
Karl Williamson
2011-05-02 21:51:45 UTC
Permalink
Post by Jesse Vincent
Post by Karl Williamson
I am not comfortable shipping 5.14 as-is. I think we need to do
something. If it is deemed ok for the churn, then I think option 4
is the best; otherwise options 3b or 3c would be my recommendation.
3) Exclude all multi-char folds in classes in some manner
These all involve about the same amount of work, which was about an
hour. In spite of many people not realizing that character classes can
match more than a single code point, Perl has done this to some extent
in previous releases. However, this mechanism has been unreliable, with
some constructs with some characters working, but most, including the
simple ones, did not; so there actually is relatively little code
affected.
3a) in just inverted classes
Inverted classes are where the real non-obvious cases lie.
BUT: it neglects that Unicode is finding that the whole mechanism has
issues. And many people will not expect a character class to match more
than one character.
Based on the available evidence, 3a) sounds like the safest path for
5.14. I'd like to know how hard it would be to implement (and how much
churn it would cause).
Additionally, I'd like to hear from others about why this would be an
awful choice that would badly hurt Perl 5.14.
I'm comfortable with any choice but #1. And, I had already coded most
of #3b, taking less than an hour; the change to #3a is tiny.

(Today, I coded up #4, "Use simple folding in classes and non-classes
unless specify new pragma 'use re qw(folding full)'".)

What's left to do is some .t work and doc changes.

Karl Williamson
2011-04-30 23:28:17 UTC
Permalink
Post by Nicholas Clark
Post by Tom Christiansen
Post by Karl Williamson
In thinking about this some more, given the bug that Nicholas found that
affects all multi-character folds, not just \xdf, in character classes,
^^^^^^^^^^^^^^^^^^^^
Post by Tom Christiansen
Post by Karl Williamson
I think it would be best to just not offer any of them in 5.14.
You mean undo something that's been there since 5.8?
% perl5.8.0 -le 'print "\x{1FB2}" =~ /\x{1FB2}/i || 0'
1
% perl5.8.0 -le 'print ucfirst("\x{1FB2}") =~ /\x{1FB2}/i || 0'
1
% perl5.8.0 -le 'print uc("\x{1FB2}") =~ /\x{1FB2}/i || 0'
1
Or did you mean something else?
I think you missed the "in character classes" part of Karl's thought.
Your examples don't use []
Precisely. So those examples would not be broken. Only bracketed
character classes in regular expressions. You said a couple days ago
that "I have always been bugged by the idea that a bracketed character
class could ever match more than a single code point. It's like /./
suddenly matching more than one, but you're not in grapheme mode.
Character classes seem to be inherent singletons." So it appeared that
you agreed with me.

Multi-char folds in bracketed character classes did not work in 5.10,
and I presume earlier, though Yves was surprised at the time that they
didn't. I'm the one who filed a trouble ticket on the issue, and put in
what turned out to be a very partial fix for 5.10.1. They still don't
work right in 5.14, given the flaw that Nicholas found. (That could be
fixed for some dot release.)
Post by Nicholas Clark
I'm still not sure *what* I think.
But *if* a class consisting of a single character is always equivalent to a
literal of that character (ie /[a]/ is /a/, /[ß]/ is /ß/, /[ß]/i is /ß/i,
etc), one of the things I'm not about is whether it's better to say "no
multi character folds in character classes" or "no multi character folds in
character classes, except classes consisting of exactly one character". I
think (I think) that it's useful to maintain that explicit correspondence,
as (IIRC) Yves worked to get the engine to optimise /[a]/ to /a/ and /[.]/ to
/\./, as it was a common idiom in some circles to use regexp character class
syntax as an alternative to backslash quoting.
The downside, obviously, is that (for starters) it's more complex to explain.
I just realized that this is mostly a red herring. I think it was me
who brought it up, and I apologize. Only Latin1 code points have ever
been optimized this way. The only Latin1 code point that has a multi
character fold is ß. In 5.12, a /[ß]/i was optimized into an EXACTF
node. But this is one of the tricky folds, which fails Ilya's optimizer
tests. Thus almost certainly /[ß]/i would not work in 5.12. Therefore
we are not introducing a regression if we don't have it work in 5.14.

The other multi-char cases of single-characters in classes are
non-Latin1 and have never been optimized. Thus, they didn't work in
5.10 (and I presume earlier), and only under rare circumstances through
5.12, and I don't know what those circumstances are now.

% perl5.12.2 -E 'say "fi" =~ /[\N{U+FB01}]/i || 0'
0

So we aren't introducing much of any regressions if we don't have these
work in 5.14. So the single code point vs multiple code point issue is
not an issue.
Post by Nicholas Clark
Because as a general rule, rightly or wrongly on my part
I think it is rightly.


, I feel that it's
Post by Nicholas Clark
unfortunate if two or more different syntax choices for the same action
produce notably different performance because they trigger different
runtime implementations, where both
a: one is unambiguously always slower than the other
b: it would be possible for the compile time implementation to automatically
select the faster implementation, whichever syntax was used
because that way
a: all existing code goes faster without change
b: it kills dead style arguments based on "but this one is more efficient"
letting people pick style based on clarity (or their opinions of clarity)
(eg reverse sort ...; is now internally optimised to tell sort to sort in
reverse, so no slower than sort {$b cmp $a} ...; but usually somewhat clearer)
Nicholas Clark
Another digression: in 5.14, I added the optimization that classes of
the form [Bb] with exactly two Latin1 code points where the two are
folds of each other get optimized into EXACTFish nodes. This isn't the
case for [Kk] because of the Kelvin sign being part of the fold equation.
Tom Christiansen
2011-04-30 23:31:26 UTC
Permalink
Post by Karl Williamson
Precisely. So those examples would not be broken. Only bracketed
character classes in regular expressions. You said a couple days ago
that "I have always been bugged by the idea that a bracketed character
class could ever match more than a single code point. It's like /./
suddenly matching more than one, but you're not in grapheme mode.
Character classes seem to be inherent singletons." So it appeared that
you agreed with me.
I do. (I think.)

--tom
Tom Christiansen
2011-04-30 23:37:53 UTC
Permalink
Here's a slightly more extensive version of the tester I wrote,
which all pass under RC0. Perhaps you could use this sort of
thing for other .t files, although you will have to change the
tests if you end up making them fail. But at least this way
we will keep aware of the issue. The RC0 output is appended
in an unused DATA portion.

--tom
Tom Christiansen
2011-04-30 16:30:30 UTC
Permalink
Post by Jesse Vincent
We'd be introducing the new flag for 5.16.0, not 5.14.1. But yes, that
is the other option I've just put to Karl on IRC.
Ok, that sounds safer.

What then is left to do for an RC2 on 5.14.0? And are there any
concrete plans regarding 5.14.1?

thanks,

--tom
Jan Dubois
2011-04-29 17:56:21 UTC
Permalink
Post by Karl Williamson
But this is in some sense contrary to real German, where there are
minimal pair words that differ only by ß and ss and mean different
things. I think maße and masse is an example, or is it müße and müsse
(I don't know).
It is actually Maßen (small amounts) and Massen (big quantities). I also
learned that prior to the 1996 spelling reform you were supposed to
capitalize Maßen as MASZEN to avoid confusion with MASSEN. Despite of
having lived in (northern) Germany for 30 years prior to that reform, I
wasn't aware of this rule until the time when "they" announced that it no
longer applied. :)

Cheers,
-Jan
Father Chrysostomos
2011-05-01 20:04:00 UTC
Permalink
Post by Tom Christiansen
I'd like to think about how people would use this stuff *in practice*. The
problem is that in practice, the \xDF case isn't too common, so we don't
have many examples to go by.
Actually, I remember reading the opposite, that ß was the most common and important of the multi-char folds. I believe that the reason it exists is simply for mathematical completeness.
I am not a German speaker but we do have some on this list. My understanding is that ß is already lower case, there is no lower case equivalent to it, but the upper case of ß is 'SS'. The case fold of 'SS' is 'ss', and therefore by extension so should be the case fold of ß. But this is in some sense contrary to real German, where there are minimal pair words that differ only by ß and ss and mean different things. I think maße and masse is an example, or is it müße and müsse (I don't know).
As an English speaker, I would use /i to try to get the same word in all its possible capitalizations. (People have pointed out that that isn't really possible in English either due to homonyms and acronyms, but it's something I and others do expect, nonetheless.) The Unicode practice of assuming transitivity where it really doesn't happen in the native language leads to the case fold of ß being 'ss', when in fact I don't think it is called for in the language. I asked Steffen this question on IRC some months ago.
Post by Tom Christiansen
there are quite a lot of Greek code
points where this arises. This is due to their weird lowercase Mark,
U+0345 COMBINING GREEK YPOGEGRAMMENI, which is both \p{Lower} and \p{Mn}.
It’s the letter iota, which is written as a subscript when it’s not pronounced.
Post by Tom Christiansen
lower: ᾲ στο διάολο
lower: \x{1FB2} \x{3C3}\x{3C4}\x{3BF} \x{3B4}\x{3B9}\x{3AC}\x{3BF}\x{3BB}\x{3BF}
title: Ὰͅ Στο Διάολο
title: \x{1FBA}\x{345} \x{3A3}\x{3C4}\x{3BF} \x{394}\x{3B9}\x{3AC}\x{3BF}\x{3BB}\x
{3BF}
upper: ᾺΙ ΣΤΟ ΔΙΆΟΛΟ
upper: \x{1FBA}\x{399} \x{3A3}\x{3A4}\x{39F} \x{394}\x{399}\x{386}\x{39F}\x{39B}\x
{39F}
That's because U+1FB2 goes to U+1FBA U+0399 for uppercase, but
it goes to U+1FBA U+0345 in titlecase.
I am quite sure that someone would want to use /^\x{1FB2}/i and
have it catch all three cases, of
The lowercase
"\N{GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI}"
"\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{GREEK CAPITAL LETTER IOTA}"
"\N{GREEK CAPITAL LETTER ALPHA WITH VARIA}\N{COMBINING GREEK YPOGEGRAMMENI}"
Accents and breathing marks are almost always dropped in all caps. I’ve only seen one publication, from the nineteenth century, that included them. So Unicode’s case-folding rules for Greek have little practical use.

Some Western academic publications of classical texts use a capital iota for the hypogegrammene in titles. But most of the time, it either remains an hypogegrammene or becomes a lowercase iota (that’s right: the lowercase iota is the ‘capital’ hypogegrammene), depending on the publisher’s choice.
Post by Tom Christiansen
But what I don't know is whether they are expecting /^[^\x{1FB2}]/i
to rule all three of those *out*, that is, be like !/^[\x{1FB2}]/i.
I would never write such a regular expression except by mistake (and I *do* often write regular expressions for Greek text). If I did do it by mistake, I would expect it to match everything except \x{1fb2}.

(Funny that ᾲ is used in the examples, as I’ve never seen it used in real text. ᾴ and ᾷ are fairly common, though.)
Nicholas Clark
2011-04-29 13:20:39 UTC
Permalink
Post by Nicholas Clark
I think what I'm finding really confusing is that the current spec and
implementation means that the inversion and folding all happen together.
$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print "$] ", /\A[^\xDF]+\z/i ? "Y" : "N";'
5.014000 N
1: U+00DF
2: everything but the list in step 1
3: one or more of the set of step 2
4: match insensitively with step 3, anchored
I am (so far) lacking the imagination to spot how to blow holes in that plan.
Right. If I understand it correctly. If there are only 4 valid code points,
A C G T, then these transcriptions hold:

a single item in a range can be expressed as a literal

/[A]/ /A/

a range can be expressed as an alternation of literals

/[AC]/ /(?:A|C)/


and one would like to think that this one does:

that the inversion of a range can be expressed as an alteration of (a lot)
of literals:

/[^A]/ /(?:C|G|T)/


which would mean that

/[\xDF]/ /\xDF/
/[A\xDF]/ /(?:A|\xDF)/

and that

/[^\xDF]/ /(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )/


(a long list, but a bounded list)

and that

/#[^\xDF]#/ /#(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )#/


and so that would mean this:


/#[^\xDF]#/ /#(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )#/

and hence this:

/#[^\xDF]#/i /#(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )#/i


*But* it doesn't

*And* that the way that the Unicode consortium wants this to work, there's
actually no regexp construction of literal atoms, alternation and grouping,
to express it.

Because in a case insensitive match they currently treat [^\xDF] as being
something like (?![\xDF]) only it's not a negative look ahead *assertion*,
instead it's a negative match construction that consumes.

We don't have general negative match primitives, do we?


Where did I make a mistake?

Nicholas Clark
Karl Williamson
2011-04-29 18:21:21 UTC
Permalink
[Sorry if this is a duplicate, seems my mail server was blacklisted.]
Not a duplicate
I hope we can all agree on this much. Programmers expect negation in a
character class to be a convenience to allow them to enumerate the few
rather than the many, nothing more.
At first blush, I agree with that, and hope it provides a way forward.
Post by Nicholas Clark
[...]
/#[^\xDF]#/i /#(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )#/i
*But* it doesn't
[...]
Where did I make a mistake?
Nicholas Clark
You're on the right track, and it sounds good at first, but let's set
Unicode aside for the moment and consider this regex (inspired by The
/[^a-z]/i
/(?:\x0|\x1|\x2| ... |\^|_|`|\{|\||\}| ... )/i
Clearly, this is NOT what the programmer wants, because the characters
without "a" through "z" include the characters from "A" through "Z", and
if these are then matched with case-insensitivity, this expansion then
matches EVERY character, including the letters "a" through "z" which
were explicitly excluded from the class.
/[^A-Za-z]/
Exactly. And, there were moments in development when the code didn't
work this way, and caused all sorts of failures.
The programmer expects the case-insensitive flag to be convenience to
avoid enumerating all case variations, much like the character class
negation is a convenience to avoid enumerating the entire character set
without a few unwanted characters.
Again, I agree.
Now, can full Unicode case-folding semantics be integrated into this
existing mental model without causing unexpected breakage like the
example George encountered? I'm not going to try to address that
question at the moment, but I wanted to add some food for thought.
Deven
Tom Christiansen
2011-04-29 19:37:08 UTC
Permalink
[NB: file greek_mfold.t included for regex tests]
៲ στο Ύιάολο (lowercase)
៺ͅ Στο Διάολο (titlecase)
៺Ι Σ΀Ο ΔΙΆΟΛΟ (uppercase)
when you are doing a case-insensitive match with /៲/.
Right now, /^៲/i will match all three lines, and similarly
both of !/^៲/i and /^[^៲]/i will rule all three of those out.
The pressing question is whether this is what someone familiar
with Greek would *expect* to happen? Is the current behavior
described above what they would expect, or isn't it?
Another of my Greek-enabled correspondents has gotten back to me,
and he says that what Perl does there with those matches is what
he would indeed expect it to do. I didn't tell him that it only
does that in blead, not in 5.12.

Remembering what Karl said about the .t files, here's a little testeri
that verifies this all works. All tests pass under blead, but 6/30
fail under 5.12.3. So we *have* made progress. I shouldn't care to
lose that, but I don't want to freak out the byte-ish people either.

Probably though the byte-ish people shall *always* be freaked out
about Unicode. :)

--tom
Nicholas Clark
2011-04-29 21:14:01 UTC
Permalink
Post by Karl Williamson
The programmer expects the case-insensitive flag to be convenience to
avoid enumerating all case variations, much like the character class
negation is a convenience to avoid enumerating the entire character set
without a few unwanted characters.
Again, I agree.
Except that negation can't actually be equivalent to enumerating the entire
character set less unwanted, else this would match:

$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print /\A[^ ]\z/i ? "Y" : "N"'
N

because "all of Unicode less space" includes ß, and /ß/i matches "ss"

So negation is behaving equivalent to multiple non-match (lookahead)
assertions, and a match on qr/./ (ie consume exactly one code point)

[which is making sense to me now, but is a surprise if you're thinking in sets]



aargh. Also, as this matches:

$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print /\A[\x80-\xFF]\z/i ? "Y" : "N"'
Y

shouldn't this?

$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print /\A[\x00-\xFF]\z/i ? "Y" : "N"'
N


(I was trying to test whether [^ ] was equivalent to [\x00-\x1F\x21-\x{1FFFF}]
and finding it a surprise)

Nicholas Clark
Tom Christiansen
2011-04-29 21:30:16 UTC
Permalink
Post by Nicholas Clark
$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print /\A[\x80-\xFF]\z/i ? "Y" : "N"'
Y
shouldn't this?
$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print /\A[\x00-\xFF]\z/i ? "Y" : "N"'
N
Oh my.

OK, I've tried all the cases of i-j for i ranging from 0..0xDF and
for j ranging from 0xDF .. 0x100.

$_ = "ss";
utf8::upgrade($_);
for $i ( 0 .. 0xDF ) {
for $j ( 0xDF .. 0x100 ) {
$pat = sprintf "\\A[\\x{%02X}-\\x{%02X}]\\z", $i, $j;
printf "%s\t%s\n", $pat, /$pat/i ? "Y" : "N";
}
}

With these results:

% perl5.12.3 /tmp/range | grep -c 'Y$'
109
% perl5.12.3 /tmp/range | grep -c 'N$'
7507

% blead /tmp/range | grep -c 'N$'
3944
% -Ilib /tmp/range | grep -c 'Y$'
3672

--tom
Karl Williamson
2011-04-29 23:29:36 UTC
Permalink
Post by Nicholas Clark
$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print /\A[\x80-\xFF]\z/i ? "Y" : "N"'
Y
shouldn't this?
$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print /\A[\x00-\xFF]\z/i ? "Y" : "N"'
N
Yes it should match, and it is a bug. And you have found a fundamental
bug in the way Perl implements bracketed character classes with
multi-character folds. I have always been somewhat uncomfortable with
the mechanism, and had it worked out how to change it in 5.16, and would
have done it in 5.14 given enough tuits, but I also had performance
concerns, as it means more parsing in pass 1 of the compilation that
has to be duplicated in pass2. I imagine I would have raised the
priority if I had known about this.

What's happening is this, as I discovered with a few minutes in gdb: A
[class] creates an atomic node. In the second expression, the first 's'
matches the 's' in the class. But there is nothing left for the second
's' in the string to match, and since the node is atomic, the engine
doesn't know enough to backtrack and try something else, so the regex
fails. In the first example, there is no 's' in the class to match, so
it tries the full 'ss' and succeeds.

My planned solution was to internally rewrite the class as
(?:[\x00-\xFF]|(?i)ss)

The complement would have been parsed as
(?:(?![\x00-\xFF]|(?i)ss).)

as Nicholas has astutely noticed.

These would generate multiple nodes that the engine would be able to
backtrack over.
Tom Christiansen
2011-04-29 23:57:54 UTC
Permalink
Post by Nicholas Clark
(I was trying to test whether [^ ] was equivalent to
[\x00-\x1F\x21-\x{1FFFF}] and finding it a surprise)
If you meant \x{10FFFF} there, that's something I mistype all the
time, too. I've been trying to train myself to write \x{10_FFFF}
instead so I can see it more clearly.

--tom
Deven T. Corzine
2011-04-29 17:34:52 UTC
Permalink
Post by Nicholas Clark
[...]
/[^\xDF]/ /(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )/
(a long list, but a bounded list)
I hope we can all agree on this much. Programmers expect negation in a
character class to be a convenience to allow them to enumerate the few
rather than the many, nothing more.
Post by Nicholas Clark
[...]
/#[^\xDF]#/i /#(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )#/i
*But* it doesn't
[...]
Where did I make a mistake?
Nicholas Clark
You're on the right track, and it sounds good at first, but let's set
Unicode aside for the moment and consider this regex (inspired by The
Sidhekin's example):

/[^a-z]/i

By your logic above, this could be interpreted as:

/(?:\x0|\x1|\x2| ... |\^|_|`|\{|\||\}| ... )/i

Clearly, this is NOT what the programmer wants, because the characters
without "a" through "z" include the characters from "A" through "Z", and
if these are then matched with case-insensitivity, this expansion then
matches EVERY character, including the letters "a" through "z" which
were explicitly excluded from the class.

The real equivalence is instead this:

/[^A-Za-z]/

which could also be written as:

/(?:\x0|\x1|\x2| ... |\>|\?|\@|\[|\\|\]|\^|\_|\`|\{|\||\}| ... )/

The programmer expects the case-insensitive flag to be convenience to
avoid enumerating all case variations, much like the character class
negation is a convenience to avoid enumerating the entire character set
without a few unwanted characters.

Now, can full Unicode case-folding semantics be integrated into this
existing mental model without causing unexpected breakage like the
example George encountered? I'm not going to try to address that
question at the moment, but I wanted to add some food for thought.

Deven
Corzine, Deven
2011-04-29 17:53:45 UTC
Permalink
[Sorry if this is a duplicate, seems my mail server was blacklisted.]
Post by Nicholas Clark
[...]
/[^\xDF]/ /(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )/
(a long list, but a bounded list)
I hope we can all agree on this much. Programmers expect negation in a
character class to be a convenience to allow them to enumerate the few
rather than the many, nothing more.
Post by Nicholas Clark
[...]
/#[^\xDF]#/i /#(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )#/i
*But* it doesn't
[...]
Where did I make a mistake?
Nicholas Clark
You're on the right track, and it sounds good at first, but let's set
Unicode aside for the moment and consider this regex (inspired by The
Sidhekin's example):

/[^a-z]/i

By your logic above, this could be interpreted as:

/(?:\x0|\x1|\x2| ... |\^|_|`|\{|\||\}| ... )/i

Clearly, this is NOT what the programmer wants, because the characters
without "a" through "z" include the characters from "A" through "Z", and
if these are then matched with case-insensitivity, this expansion then
matches EVERY character, including the letters "a" through "z" which
were explicitly excluded from the class.

The real equivalence is instead this:

/[^A-Za-z]/

which could also be written as:

/(?:\x0|\x1|\x2| ... |\>|\?|\@|\[|\\|\]|\^|\_|\`|\{|\||\}| ... )/

The programmer expects the case-insensitive flag to be convenience to
avoid enumerating all case variations, much like the character class
negation is a convenience to avoid enumerating the entire character set
without a few unwanted characters.

Now, can full Unicode case-folding semantics be integrated into this
existing mental model without causing unexpected breakage like the
example George encountered? I'm not going to try to address that
question at the moment, but I wanted to add some food for thought.

Deven
George Greer
2011-04-29 13:40:15 UTC
Permalink
Post by Nicholas Clark
Post by Karl Williamson
Post by Tom Christiansen
Post by Tom Christiansen
Wouldn't backing out multichar folds for 5.14 introduce a regression?
% perl5.12.0 -E 'say "\x{FB00}" =~ /ff/i || 0'
1
...
% perl5.12.3 -E 'say "\x{FB00}" =~ /ff/i || 0'
1
--tom
Yes it would. My point was that appears to be where Unicode is headed.
But there are no guarantees that that is where they'll end up.
A middle position would be to disable them only in bracketed character
classes. I think that the most astonishment stems from those, when they
are inverted. This is where it was most buggy pre-5.14. There were
This stuff is confusing.
$ perl5.8.9 -lwe '$_ = "ss"; utf8::upgrade($_); print "$] ", /\A[^\xDF]+\z/i ? "Y" : "N";'
5.008009 Y
$ perl -lwe '$_ = "ss"; utf8::upgrade($_); print "$] ", /\A[^\xDF]+\z/i ? "Y" : "N";'
5.012003 Y
$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print "$] ", /\A[^\xDF]+\z/i ? "Y" : "N";'
5.014000 N
and that what happens is that to [^\xDF] is processed all in one, not as a
a range
an inverted range
in a case insensitive match
so it's not implemented as a human *might* think, in terms of
* process the ranges inside the [^...] construction to make a list of code
points (in my case that's one code point, U+00DF)
* [^...] means invert the list (in my case, that's several million code points)
* now match the inverted list against the input string
* oh yes, do that insensitively
but all in one step, given that (if I'm understanding this correctly)
/[\xDF]/i is equivalent to /ss/i
so
$_ =~ /[\xDF]/i implies $_ =~ /ss/i
and hence that
$_ =~ /[^\xDF]/i implies $_ !~ /ss/i
and it's that last jump that is really catching everyone out.
Correct. Going back to the original (somewhat nonsensical[1]) regex that
triggered this problem:

/[^\x00-\x1f\x7f-\xff :]+:/i

So "s" is an acceptable part of the regex but due to multi-character case
folding "ss" is not. So you have the peculiar case that:

"s s" =~ /^[^\xDF]+$/i => Y
"ss" =~ /^[^\xDF]+$/i => N

which can end up very surprising when your word isn't German and the only
reason \xDF is in the list is because it was caught in a range.
Post by Nicholas Clark
Or am I getting this subtly wrong?
Whichever way, it does feel that this spec currently, where inversion happens
at the point of insensitive matching, has emergent behaviour which makes it
dangerous and counterintuitive to the point of uselessness to almost anyone in
the real world.
Yes. If "s" is in the set but "ss" isn't, which one "wins"? Do you
exclude "class" because it has "ss" or accept it because "s" isn't
excluded?
Post by Nicholas Clark
Post by Karl Williamson
To state more clearly, I guess I'm now putting forth the idea that the
least worst case for 5.14 is that we say that a bracketed character
class can only match a single input character. Most people expect that
anyway, and it would have the fewest regressions. Almost all
regressions would be of the form that /[?]/i would no longer mean the
same thing as /?/i.
Expressing it like that troubles me too, as the way my mental model works,
a (non-inverted) character range of one in my head is the same as a literal.
Post by Karl Williamson
The idea scares me of allowing a non-inverted class match multiple char
folds vs an inverted one
Could you give examples of what you mean by this? I'm not quite sure I'm
understanding it correctly. (And then again, I'm not sure if my head will
cope if I did understand it)
I think he means that [\xDF] would use the multi-character fold to "ss"
but [^\xDF] would not. Whether that would be surprising as not reversible
is another question.
Post by Nicholas Clark
I think what I'm finding really confusing is that the current spec and
implementation means that the inversion and folding all happen together.
$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print "$] ", /\A[^\xDF]+\z/i ? "Y" : "N";'
5.014000 N
1: U+00DF
2: everything but the list in step 1
3: one or more of the set of step 2
4: match insensitively with step 3, anchored
I am (so far) lacking the imagination to spot how to blow holes in that plan.
Your theory being that since "s" matches in step 2 that it will then be
excluded from the "ss" match in 4 and thus the regex succeeds? ("Y") Or
maybe someone really did want "ss" excluded?


1: The regex should never have used /i if it was including "all" the
codepoints anyway. The comment in the module states:

# Pattern to match a RFC822 Field name ( Extract from RFC #822)
#
# field = field-name ":" [ field-body ] CRLF
#
# field-name = 1*<any CHAR, excluding CTLs, SPACE, and ":">
#
# CHAR = <any ASCII character> ; ( 0-177, 0.-127.)
# CTL = <any ASCII control ; ( 0- 37, 0.- 31.)
# character and DEL> ; ( 177, 127.)
# I have included the trailing ':' in the field-name
#
our $FIELD_NAME = '[^\x00-\x1f\x7f-\xff :]+:';
--
George Greer
Nicholas Clark
2011-04-29 17:36:59 UTC
Permalink
Post by Nicholas Clark
which would mean that
/[\xDF]/ /\xDF/
/[A\xDF]/ /(?:A|\xDF)/
and that
/[^\xDF]/ /(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )/
(a long list, but a bounded list)
and that
/#[^\xDF]#/ /#(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )#/
/#[^\xDF]#/ /#(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )#/
/#[^\xDF]#/i /#(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )#/i
*But* it doesn't
*And* that the way that the Unicode consortium wants this to work, there's
actually no regexp construction of literal atoms, alternation and grouping,
to express it.
I'm wrong. (My excuse is "it's hard, and I have a cold")
Post by Nicholas Clark
Where did I make a mistake?
*My* mental model of inversion is set-like. I think of it like this:

/#[^\xDF]#/ /#(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )#/
/#[^\xDF]#/i /#(?:\x0|\x1|\x2| ... |\xDE|\xE0| ... )#/i

and

/#[^\xD0-\xD3]#/ /#(?:\x0|\x1|\x2| ... |\xCF|\xD4| ... )#/


ie "not this" maps to "that" or "that" or "that" or...
[where '"that" or "that" or "that" or...' is the set complement of "this"]

The Unicode model of inversion is

/#[^\xDF]#/ /#(?!\xDF).#/
/#[^\xDF]#/i /#(?!\xDF).#/i

and

/#[^\xD0-\xD3]#/ /#(?!\xD0)(?!\xD1)(?!\xD2)(?!\xD3).#/i

ie "not this" maps to "negative lookhead for this" "anything"
and for a range "not this or that" maps to "negative lookhead for this"
"negative lookhead for that" "anything"


and the reason this is all so frightfully confusing is that when we see

/#[^\xDF]#/i

we think "that range matches 1 code point". But the Unicode model translates
it to:

/#(?!\xDF).#/i

and that negative lookahead assertion could match multiple code points, even
though the . paired with it will only *consume* one.

[on re-reading. Yes, that one above is not actually going to be a problem. But
our friend here is:

/#[^\xDF]+#/i

because Unicode translates *that* to:

/\#
(?:
(?!\xDF) # not followed by something that matches ß
. # any one code point
)+ # repeated
\#
/i


and "of course", for "#ss#", after the "#" is consumed, the assertion
"not followed by something that matches ß" fails, so the match fails.
It's obvious. When you think of it *this* way. But we don't.
]
Post by Nicholas Clark
Correct. Going back to the original (somewhat nonsensical[1]) regex that
/[^\x00-\x1f\x7f-\xff :]+:/i
So "s" is an acceptable part of the regex but due to multi-character case
"s s" =~ /^[^\xDF]+$/i => Y
"ss" =~ /^[^\xDF]+$/i => N
which can end up very surprising when your word isn't German and the only
reason \xDF is in the list is because it was caught in a range.
In this case it's within a range. But I find it surprising even when I write
it as a single code point, but inverted.
Post by Nicholas Clark
Post by Nicholas Clark
1: U+00DF
2: everything but the list in step 1
3: one or more of the set of step 2
4: match insensitively with step 3, anchored
I am (so far) lacking the imagination to spot how to blow holes in that plan.
Your theory being that since "s" matches in step 2 that it will then be
excluded from the "ss" match in 4 and thus the regex succeeds? ("Y") Or
maybe someone really did want "ss" excluded?
No, my theory being that my head translates a range inversion to a very long
list of positive alternations ("possibility" *or* "possibility" *or* ...)

Whereas Unicode translates a range inversion to a short superposition of
negative assertions, followed by one anything.
("exclude" *and* "exclude") "anything"

ie I'm treating ranges as sets of characters, and range inversions as set
complements. They aren't. And their negative assertions can match multiple
code points, whereas their match "anything" matches exactly one. And it's
*that* disparity that is tripping (nearly) everyone up. Unicode's model for
how to implement range inversions is inconsistent between how far forwards
it rejects on, versus how far forward it consumes on.

[even if it what it achieves is actually useful for expressing how to process
the natural language text in question]
Post by Nicholas Clark
1: The regex should never have used /i if it was including "all" the
Yes. Possibly that's the real bug here.

You *can* treat ranges as sets, but only if you turn off all notions of case
folding.

Nicholas Clark
Aristotle Pagaltzis
2011-05-01 06:41:14 UTC
Permalink
Correct. Going back to the original (somewhat nonsensical[1])
/[^\x00-\x1f\x7f-\xff :]+:/i
So "s" is an acceptable part of the regex but due to
multi-character case folding "ss" is not. So you have the
"s s" =~ /^[^\xDF]+$/i => Y
"ss" =~ /^[^\xDF]+$/i => N
which can end up very surprising when your word isn't German
and the only reason \xDF is in the list is because it was
caught in a range.
It’s surprising even when your word is German. I think the
orthography reform has made it so you can always substitute
a double s for a sharp s. (If memory serves, this was not
always the case before. I’m unsure on both counts.) But you
can definitely not replace any old double s by a sharp s. The
canonical example is “Wasser”: spelling it “Waßer” has always
been an error and so it remains.

This means the regex engine cannot make *any* reasonable guess
whatsoever at which match is desired or even acceptable in any
particular case without the user indicating it explicitly.

I’m iffy about the entire notion of multi-character case folds
(for regex matching), outside of designated pure ligatures.

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm.org/>
The Sidhekin
2011-04-29 16:33:13 UTC
Permalink
Post by Nicholas Clark
This stuff is confusing.
[...]
I think what I'm finding really confusing is that the current spec and
Post by Nicholas Clark
implementation means that the inversion and folding all happen together.
$ ./perl -Ilib -lwe '$_ = "ss"; utf8::upgrade($_); print "$] ",
/\A[^\xDF]+\z/i ? "Y" : "N";'
5.014000 N
1: U+00DF
2: everything but the list in step 1
3: one or more of the set of step 2
4: match insensitively with step 3, anchored
***@bluebird[18:19:23]~$ perl -lwe '$_ = "ss"; utf8::upgrade($_); print
"$] ", /\A[^\x53]+\z/i ? "Y" : "N";'
5.010001 N
***@bluebird[18:23:05]~$

So, clearly this *doesn't* mean:

1: U+0053
2: everything but the list in step 1
3: one or more of the set of step 2
4: match insensitively with step 3, anchored

Indeed, if it did, [^a-z] would be identical to (?s:.) under /i.


Forget Greek for the moment. In English: What would you expect
(?i:[^a-z]) to match?

Any character at all?

Any character but "a".."z"?

Any character but "a".."z", "A".."Z"?

Any character but "a".."z", "A".."Z", &c?


These could generalize to German: (?i:[^ß])

Any character at all?

Any character but "ß"?

Any character but "ß", "ss", "Ss", "SS"?

Any character but "ß", "ss", "Ss", "SS", &c?


Eirik
Tom Christiansen
2011-04-29 16:36:41 UTC
Permalink
Post by The Sidhekin
These could generalize to German: (?i:[^ß])
Any character at all?
Any character but "ß"?
Any character but "ß", "ss", "Ss", "SS"?
Any character but "ß", "ss", "Ss", "SS", &c?
Of those, only "ß" is "a character". The others
are *two* characters.

--tom
The Sidhekin
2011-04-29 17:05:14 UTC
Permalink
So, hang on, are we looking at "ss" being matched by /\A[^ß]{2}\z/i – but
not by /\A[^ß]+\z/? :-\
... or more to the point, not by /\A[^ß]+\z/i? (I should stick to
(?i:...) – I tend to forget the /i too often.)
(... backtracking bug?)
(... backtracking into a character class?!)


Eirik
The Sidhekin
2011-04-29 16:59:35 UTC
Permalink
Post by Tom Christiansen
Post by The Sidhekin
These could generalize to German: (?i:[^ß])
Any character at all?
Any character but "ß"?
Any character but "ß", "ss", "Ss", "SS"?
Any character but "ß", "ss", "Ss", "SS", &c?
Of those, only "ß" is "a character". The others
are *two* characters.
A matter of definition, no? But you're right, if uncharacteristically
terse. ;-)

No matter how you define "a character", (?:[^ß]) could easily match either
half of any of "the others".

So, hang on, are we looking at "ss" being matched by /\A[^ß]{2}\z/i – but
not by /\A[^ß]+\z/? :-\

(... backtracking bug?)


Eirik
Karl Williamson
2011-04-29 17:43:10 UTC
Permalink
Post by Nicholas Clark
and that what happens is that to [^\xDF] is processed all in one, not as a
a range
an inverted range
in a case insensitive match
so it's not implemented as a human*might* think, in terms of
* process the ranges inside the [^...] construction to make a list of code
points (in my case that's one code point, U+00DF)
* [^...] means invert the list (in my case, that's several million code points)
* now match the inverted list against the input string
* oh yes, do that insensitively
The crux is your "oh yes, do that insensitively". The word "that" means
the previous step has to be modified. The way it currently works for
cases like this is that it creates the union of the characters not to
match and their folds, plus a flag that says complement the result at
execution time, which means the list is essentially all the non-matches.
A single 's' is not in the list of non-matches, but 'ss' is. George is
right, which wins?
Tom Christiansen
2011-04-29 01:24:01 UTC
Permalink
Post by Karl Williamson
One more thought. We could add something in 5.16 to enable multi-char
matching. A regex modifier or pragma. I think it's too late for 5.14
to do something like that.
Couldn't people who want (something like) the old behavior "just" put a

use if $] >= 5.014, re => "/aa";

in scope? That makes George's code again do the same thing under perl
5.12.3 and under blead. Is the goal for the old code to continue to work
untouched that way? I cannot see how to back out the multichar fold stuff.
I'm sure it is tied up with all the rest of the Unicode strings business.
And if we *break* multichar folks, we break something that has worked for a
very long time:

% perl5.8.1 -le 'print "\x{FB00}" =~ /ff/i || 0'
1

I don't think we can do that in good conscience.

--tom
Tom Christiansen
2011-04-29 01:41:47 UTC
Permalink
I don't know what choices are available now with respect
to 5.14.

To disable the multichar fold stuff seems a bigger
deal than shipping with it as is. You would break
stuff that has worked since 5.8. That's going too far.

I sincerely doubt that Karl can back out *just* the changes he's
made that removed the bugs with multichar folds return us back to
that shakey position where they kinda worked sometimes but not
others without *also* taking away the reliable Unicode strings.

People who want their stuff to be treated as binary octets
can use /aa. That's why it was added.

I understand that this is all surprising. And not particularly
pleasant. But I don't see a way forward that involves going backwards.

--tom
Karl Williamson
2011-04-29 01:51:48 UTC
Permalink
Post by Tom Christiansen
Post by Karl Williamson
One more thought. We could add something in 5.16 to enable multi-char
matching. A regex modifier or pragma. I think it's too late for 5.14
to do something like that.
Couldn't people who want (something like) the old behavior "just" put a
use if $]>= 5.014, re => "/aa";
in scope? That makes George's code again do the same thing under perl
5.12.3 and under blead. Is the goal for the old code to continue to work
untouched that way?
The problem here is that it is not obvious to anyone what is happening.
One shouldn't have to use the debugger to figure out why one's regex
is not doing what one thought it would.


I cannot see how to back out the multichar fold stuff.
Post by Tom Christiansen
I'm sure it is tied up with all the rest of the Unicode strings business
Again, it is trivial to cause the multichar folds to not be generated.
That's not the issue.
Post by Tom Christiansen
And if we *break* multichar folks, we break something that has worked for a
% perl5.8.1 -le 'print "\x{FB00}" =~ /ff/i || 0'
1
I don't think we can do that in good conscience.
This is a real issue that we have to think carefully about
Tom Christiansen
2011-04-29 15:31:11 UTC
Permalink
Running

% unichars -gas 'grep { length > 1 } lc, ucfirst, uc'

shows that for multichar folds, there are 6 Armenian, 16 Latin, and 81
Greek code points. The Latin examples are comparitively rare and mostly
concerned with compatibility ligatures to ensure that round-tripping with
legacy encodings will preserve the originals.

In contrast, the Greek examples look perfectly normal, routine, and
expected — and not just because of the YPOGEGRAMMENI, either. That's
why I feel we really need to be thinking of Greek cases to help us
assess real-world expectations on matches involving multichar folds.

1 և U+0587 GC=Ll SC=Armenian ARMENIAN SMALL LIGATURE ECH YIWN
2 ﬔ U+FB14 GC=Ll SC=Armenian ARMENIAN SMALL LIGATURE MEN ECH
3 ﬕ U+FB15 GC=Ll SC=Armenian ARMENIAN SMALL LIGATURE MEN INI
4 ﬗ U+FB17 GC=Ll SC=Armenian ARMENIAN SMALL LIGATURE MEN XEH
5 ﬓ U+FB13 GC=Ll SC=Armenian ARMENIAN SMALL LIGATURE MEN NOW
6 ﬖ U+FB16 GC=Ll SC=Armenian ARMENIAN SMALL LIGATURE VEW NOW

1 ẚ U+1E9A GC=Ll SC=Latin LATIN SMALL LETTER A WITH RIGHT HALF RING
2 ffi U+FB03 GC=Ll SC=Latin LATIN SMALL LIGATURE FFI
3 ffl U+FB04 GC=Ll SC=Latin LATIN SMALL LIGATURE FFL
4 ff U+FB00 GC=Ll SC=Latin LATIN SMALL LIGATURE FF
5 fi U+FB01 GC=Ll SC=Latin LATIN SMALL LIGATURE FI
6 fl U+FB02 GC=Ll SC=Latin LATIN SMALL LIGATURE FL
7 ẖ U+1E96 GC=Ll SC=Latin LATIN SMALL LETTER H WITH LINE BELOW
8 İ U+0130 GC=Lu SC=Latin LATIN CAPITAL LETTER I WITH DOT ABOVE
9 ǰ U+01F0 GC=Ll SC=Latin LATIN SMALL LETTER J WITH CARON
10 ß U+00DF GC=Ll SC=Latin LATIN SMALL LETTER SHARP S
11 ſt U+FB05 GC=Ll SC=Latin LATIN SMALL LIGATURE LONG S T
12 st U+FB06 GC=Ll SC=Latin LATIN SMALL LIGATURE ST
13 ẗ U+1E97 GC=Ll SC=Latin LATIN SMALL LETTER T WITH DIAERESIS
14 ẘ U+1E98 GC=Ll SC=Latin LATIN SMALL LETTER W WITH RING ABOVE
15 ẙ U+1E99 GC=Ll SC=Latin LATIN SMALL LETTER Y WITH RING ABOVE
16 ʼn U+0149 GC=Ll SC=Latin LATIN SMALL LETTER N PRECEDED BY APOSTROPHE

1 ᾀ U+1F80 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
2 ᾁ U+1F81 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI
3 ᾂ U+1F82 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI
4 ᾃ U+1F83 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI
5 ᾄ U+1F84 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI
6 ᾅ U+1F85 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI
7 ᾆ U+1F86 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
8 ᾇ U+1F87 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
9 ᾈ U+1F88 GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI
10 ᾉ U+1F89 GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI
11 ᾊ U+1F8A GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI
12 ᾋ U+1F8B GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI
13 ᾌ U+1F8C GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI
14 ᾍ U+1F8D GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI
15 ᾎ U+1F8E GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
16 ᾏ U+1F8F GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
17 ᾲ U+1FB2 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI
18 ᾳ U+1FB3 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI
19 ᾴ U+1FB4 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI
20 ᾶ U+1FB6 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH PERISPOMENI
21 ᾷ U+1FB7 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI
22 ᾼ U+1FBC GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
23 ᾐ U+1F90 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI
24 ᾑ U+1F91 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI
25 ᾒ U+1F92 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI
26 ᾓ U+1F93 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI
27 ᾔ U+1F94 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI
28 ᾕ U+1F95 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI
29 ᾖ U+1F96 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
30 ᾗ U+1F97 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
31 ᾘ U+1F98 GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI
32 ᾙ U+1F99 GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI
33 ᾚ U+1F9A GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI
34 ᾛ U+1F9B GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
35 ᾜ U+1F9C GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI
36 ᾝ U+1F9D GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI
37 ᾞ U+1F9E GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
38 ᾟ U+1F9F GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
39 ῂ U+1FC2 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI
40 ῃ U+1FC3 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI
41 ῄ U+1FC4 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI
42 ῆ U+1FC6 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH PERISPOMENI
43 ῇ U+1FC7 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI
44 ῌ U+1FCC GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
45 ΐ U+0390 GC=Ll SC=Greek GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
46 ῒ U+1FD2 GC=Ll SC=Greek GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA
47 ΐ U+1FD3 GC=Ll SC=Greek GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
48 ῖ U+1FD6 GC=Ll SC=Greek GREEK SMALL LETTER IOTA WITH PERISPOMENI
49 ῗ U+1FD7 GC=Ll SC=Greek GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI
50 ῤ U+1FE4 GC=Ll SC=Greek GREEK SMALL LETTER RHO WITH PSILI
51 ΰ U+03B0 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
52 ὐ U+1F50 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH PSILI
53 ὒ U+1F52 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
54 ὔ U+1F54 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
55 ὖ U+1F56 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI
56 ῢ U+1FE2 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA
57 ΰ U+1FE3 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
58 ῦ U+1FE6 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH PERISPOMENI
59 ῧ U+1FE7 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI
60 ᾠ U+1FA0 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI
61 ᾡ U+1FA1 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI
62 ᾢ U+1FA2 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI
63 ᾣ U+1FA3 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI
64 ᾤ U+1FA4 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI
65 ᾥ U+1FA5 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI
66 ᾦ U+1FA6 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
67 ᾧ U+1FA7 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
68 ᾨ U+1FA8 GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI
69 ᾩ U+1FA9 GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI
70 ᾪ U+1FAA GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI
71 ᾫ U+1FAB GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI
72 ᾬ U+1FAC GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI
73 ᾭ U+1FAD GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI
74 ᾮ U+1FAE GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
75 ᾯ U+1FAF GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
76 ῲ U+1FF2 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI
77 ῳ U+1FF3 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
78 ῴ U+1FF4 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI
79 ῶ U+1FF6 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH PERISPOMENI
80 ῷ U+1FF7 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI
81 ῼ U+1FFC GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI

Here are the only Latin titlecase letters; none have multichar folds:

1 Dz U+01F2 GC=Lt SC=Latin LATIN CAPITAL LETTER D WITH SMALL LETTER Z
2 Dž U+01C5 GC=Lt SC=Latin LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
3 Lj U+01C8 GC=Lt SC=Latin LATIN CAPITAL LETTER L WITH SMALL LETTER J
4 Nj U+01CB GC=Lt SC=Latin LATIN CAPITAL LETTER N WITH SMALL LETTER J

Another interesting point with Eszett is that even though both of them have
primary UCA strengths identical to that of "ss":

% unichars 'UCA eq UCA("ss")'
ß U+00DF GC=Ll SC=Latin LATIN SMALL LETTER SHARP S
ẞ U+1E9E GC=Lu SC=Latin LATIN CAPITAL LETTER SHARP S

It turns out that the weirdness of the lowercase version
does not occur with the uppercase version:

lowercase "\x{DF}" => "\x{DF}"
titlecase "\x{DF}" => "Ss"
uppercase "\x{DF}" => "SS"

lowercase "\x{1E9E}" => "\x{DF}"
titlecase "\x{1E9E}" => "\x{1E9E}"
uppercase "\x{1E9E}" => "\x{1E9E}"

This is really bizarre, but it also shows why using casefolding in
matches, whether simple or full, is still not as good as checking
for *whether they are the same letters*, which is what the primary
strength comparison is doing.

In RL3.4, "Tailored Loose Matches" at

http://unicode.org/reports/tr18/#Tailored_Loose_Matches

they give an example syntax using \v{PRIMARY} to indicate such, but this
they put in the locale order, not just regular UCA primary. They would
have you say [\v{PRIMARY}\x{DF}] to mean something whose UCA1 strength
is the same as U+00DF's — and thus "ss", "Ss", "SS", and also U+1E93, too.

I guess if you also included RL2.2 Extended Grapheme Clusters, which is
where \b{g} vs \b{w} etc come in, that could be written [\v{PRIMARY}\q{ss}]
with a custom contraction of "ss". I fear the ramifications of multichar
folds and any other contractions. I can easily imagine something like
this in Perl:

use re "UCA=1"; # better than /i !!

/\x{DF}/ # includes "SS", "ss", "Ss", "ß", "ẞ"
/[abd\x{DF}]/ # same, plus "Å", "ẚ", "ª", "ℬ", "đ", "ꝺ", ...
/[^\x{DF}]/ # NOT "ss", NOT "šš", NOT "ⓢⓢ", NOT "ſſ", NOT "ꞄꞄ", ...

The last one gets seriously strange, doesn't it now? It forbids doubled
letters, but DOES allow the singles: "s", "š", "ⓢ", "ſ", "Ꞅ", etc.

Perhaps something along those lines is what we'll eventually have to do
to get the multichar folds working in sets and set-complements in a way
that doesn't confuse the user who's still thinking in 7/8-bit repertoires.

This is the kind of thing I meant when I said I didn't think the Unicode
folks had thought through all the issues with case-insensitive matching.

--tom
Loading...