Running
% unichars -gas 'grep { length > 1 } lc, ucfirst, uc'
shows that for multichar folds, there are 6 Armenian, 16 Latin, and 81
Greek code points. The Latin examples are comparitively rare and mostly
concerned with compatibility ligatures to ensure that round-tripping with
legacy encodings will preserve the originals.
In contrast, the Greek examples look perfectly normal, routine, and
expected — and not just because of the YPOGEGRAMMENI, either. That's
why I feel we really need to be thinking of Greek cases to help us
assess real-world expectations on matches involving multichar folds.
1 և U+0587 GC=Ll SC=Armenian ARMENIAN SMALL LIGATURE ECH YIWN
2 ﬔ U+FB14 GC=Ll SC=Armenian ARMENIAN SMALL LIGATURE MEN ECH
3 ﬕ U+FB15 GC=Ll SC=Armenian ARMENIAN SMALL LIGATURE MEN INI
4 ﬗ U+FB17 GC=Ll SC=Armenian ARMENIAN SMALL LIGATURE MEN XEH
5 ﬓ U+FB13 GC=Ll SC=Armenian ARMENIAN SMALL LIGATURE MEN NOW
6 ﬖ U+FB16 GC=Ll SC=Armenian ARMENIAN SMALL LIGATURE VEW NOW
1 ẚ U+1E9A GC=Ll SC=Latin LATIN SMALL LETTER A WITH RIGHT HALF RING
2 ffi U+FB03 GC=Ll SC=Latin LATIN SMALL LIGATURE FFI
3 ffl U+FB04 GC=Ll SC=Latin LATIN SMALL LIGATURE FFL
4 ff U+FB00 GC=Ll SC=Latin LATIN SMALL LIGATURE FF
5 fi U+FB01 GC=Ll SC=Latin LATIN SMALL LIGATURE FI
6 fl U+FB02 GC=Ll SC=Latin LATIN SMALL LIGATURE FL
7 ẖ U+1E96 GC=Ll SC=Latin LATIN SMALL LETTER H WITH LINE BELOW
8 İ U+0130 GC=Lu SC=Latin LATIN CAPITAL LETTER I WITH DOT ABOVE
9 ǰ U+01F0 GC=Ll SC=Latin LATIN SMALL LETTER J WITH CARON
10 ß U+00DF GC=Ll SC=Latin LATIN SMALL LETTER SHARP S
11 ſt U+FB05 GC=Ll SC=Latin LATIN SMALL LIGATURE LONG S T
12 st U+FB06 GC=Ll SC=Latin LATIN SMALL LIGATURE ST
13 ẗ U+1E97 GC=Ll SC=Latin LATIN SMALL LETTER T WITH DIAERESIS
14 ẘ U+1E98 GC=Ll SC=Latin LATIN SMALL LETTER W WITH RING ABOVE
15 ẙ U+1E99 GC=Ll SC=Latin LATIN SMALL LETTER Y WITH RING ABOVE
16 ʼn U+0149 GC=Ll SC=Latin LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
1 ᾀ U+1F80 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
2 ᾁ U+1F81 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI
3 ᾂ U+1F82 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI
4 ᾃ U+1F83 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI
5 ᾄ U+1F84 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI
6 ᾅ U+1F85 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI
7 ᾆ U+1F86 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
8 ᾇ U+1F87 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
9 ᾈ U+1F88 GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI
10 ᾉ U+1F89 GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI
11 ᾊ U+1F8A GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI
12 ᾋ U+1F8B GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI
13 ᾌ U+1F8C GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI
14 ᾍ U+1F8D GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI
15 ᾎ U+1F8E GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
16 ᾏ U+1F8F GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
17 ᾲ U+1FB2 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI
18 ᾳ U+1FB3 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI
19 ᾴ U+1FB4 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENI
20 ᾶ U+1FB6 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH PERISPOMENI
21 ᾷ U+1FB7 GC=Ll SC=Greek GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENI
22 ᾼ U+1FBC GC=Lt SC=Greek GREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENI
23 ᾐ U+1F90 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI
24 ᾑ U+1F91 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI
25 ᾒ U+1F92 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI
26 ᾓ U+1F93 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI
27 ᾔ U+1F94 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI
28 ᾕ U+1F95 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI
29 ᾖ U+1F96 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
30 ᾗ U+1F97 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
31 ᾘ U+1F98 GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI
32 ᾙ U+1F99 GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI
33 ᾚ U+1F9A GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI
34 ᾛ U+1F9B GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI
35 ᾜ U+1F9C GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI
36 ᾝ U+1F9D GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI
37 ᾞ U+1F9E GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
38 ᾟ U+1F9F GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
39 ῂ U+1FC2 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI
40 ῃ U+1FC3 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH YPOGEGRAMMENI
41 ῄ U+1FC4 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENI
42 ῆ U+1FC6 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH PERISPOMENI
43 ῇ U+1FC7 GC=Ll SC=Greek GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENI
44 ῌ U+1FCC GC=Lt SC=Greek GREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENI
45 ΐ U+0390 GC=Ll SC=Greek GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
46 ῒ U+1FD2 GC=Ll SC=Greek GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA
47 ΐ U+1FD3 GC=Ll SC=Greek GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
48 ῖ U+1FD6 GC=Ll SC=Greek GREEK SMALL LETTER IOTA WITH PERISPOMENI
49 ῗ U+1FD7 GC=Ll SC=Greek GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI
50 ῤ U+1FE4 GC=Ll SC=Greek GREEK SMALL LETTER RHO WITH PSILI
51 ΰ U+03B0 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
52 ὐ U+1F50 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH PSILI
53 ὒ U+1F52 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
54 ὔ U+1F54 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
55 ὖ U+1F56 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI
56 ῢ U+1FE2 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA
57 ΰ U+1FE3 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
58 ῦ U+1FE6 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH PERISPOMENI
59 ῧ U+1FE7 GC=Ll SC=Greek GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI
60 ᾠ U+1FA0 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI
61 ᾡ U+1FA1 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI
62 ᾢ U+1FA2 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI
63 ᾣ U+1FA3 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI
64 ᾤ U+1FA4 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI
65 ᾥ U+1FA5 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI
66 ᾦ U+1FA6 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI
67 ᾧ U+1FA7 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI
68 ᾨ U+1FA8 GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI
69 ᾩ U+1FA9 GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI
70 ᾪ U+1FAA GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI
71 ᾫ U+1FAB GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI
72 ᾬ U+1FAC GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI
73 ᾭ U+1FAD GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI
74 ᾮ U+1FAE GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI
75 ᾯ U+1FAF GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI
76 ῲ U+1FF2 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI
77 ῳ U+1FF3 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENI
78 ῴ U+1FF4 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI
79 ῶ U+1FF6 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH PERISPOMENI
80 ῷ U+1FF7 GC=Ll SC=Greek GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENI
81 ῼ U+1FFC GC=Lt SC=Greek GREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENI
Here are the only Latin titlecase letters; none have multichar folds:
1 Dz U+01F2 GC=Lt SC=Latin LATIN CAPITAL LETTER D WITH SMALL LETTER Z
2 Dž U+01C5 GC=Lt SC=Latin LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
3 Lj U+01C8 GC=Lt SC=Latin LATIN CAPITAL LETTER L WITH SMALL LETTER J
4 Nj U+01CB GC=Lt SC=Latin LATIN CAPITAL LETTER N WITH SMALL LETTER J
Another interesting point with Eszett is that even though both of them have
primary UCA strengths identical to that of "ss":
% unichars 'UCA eq UCA("ss")'
ß U+00DF GC=Ll SC=Latin LATIN SMALL LETTER SHARP S
ẞ U+1E9E GC=Lu SC=Latin LATIN CAPITAL LETTER SHARP S
It turns out that the weirdness of the lowercase version
does not occur with the uppercase version:
lowercase "\x{DF}" => "\x{DF}"
titlecase "\x{DF}" => "Ss"
uppercase "\x{DF}" => "SS"
lowercase "\x{1E9E}" => "\x{DF}"
titlecase "\x{1E9E}" => "\x{1E9E}"
uppercase "\x{1E9E}" => "\x{1E9E}"
This is really bizarre, but it also shows why using casefolding in
matches, whether simple or full, is still not as good as checking
for *whether they are the same letters*, which is what the primary
strength comparison is doing.
In RL3.4, "Tailored Loose Matches" at
http://unicode.org/reports/tr18/#Tailored_Loose_Matches
they give an example syntax using \v{PRIMARY} to indicate such, but this
they put in the locale order, not just regular UCA primary. They would
have you say [\v{PRIMARY}\x{DF}] to mean something whose UCA1 strength
is the same as U+00DF's — and thus "ss", "Ss", "SS", and also U+1E93, too.
I guess if you also included RL2.2 Extended Grapheme Clusters, which is
where \b{g} vs \b{w} etc come in, that could be written [\v{PRIMARY}\q{ss}]
with a custom contraction of "ss". I fear the ramifications of multichar
folds and any other contractions. I can easily imagine something like
this in Perl:
use re "UCA=1"; # better than /i !!
/\x{DF}/ # includes "SS", "ss", "Ss", "ß", "ẞ"
/[abd\x{DF}]/ # same, plus "Å", "ẚ", "ª", "ℬ", "đ", "ꝺ", ...
/[^\x{DF}]/ # NOT "ss", NOT "šš", NOT "ⓢⓢ", NOT "ſſ", NOT "ꞄꞄ", ...
The last one gets seriously strange, doesn't it now? It forbids doubled
letters, but DOES allow the singles: "s", "š", "ⓢ", "ſ", "Ꞅ", etc.
Perhaps something along those lines is what we'll eventually have to do
to get the multichar folds working in sets and set-complements in a way
that doesn't confuse the user who's still thinking in 7/8-bit repertoires.
This is the kind of thing I meant when I said I didn't think the Unicode
folks had thought through all the issues with case-insensitive matching.
--tom