Discussion:
Switching to UTF-8
Markus Kuhn
2002-05-01 19:02:57 UTC
Permalink
Now I'm almost done with switching to ko_KR.UTF-8 on my Linux box. It
works more or less fine in that I can do *more than* what I could
do under ko_KR.EUC-KR.
I have for some time now been using UTF-8 more frequently than
ISO 8859-1. The three critical milestones that still keep me from
moving entirely to UTF-8 are

a) Still many people use Netscape 4, where UTF-8 triggers the use of a
far too large and ugly font.

Solution: Mozilla 0.9.9 is 1-2 orders of magnitude more stable
(more clicks between crashes) than Netscape 4 now. I urge
distribution maintainers to discontinue shipping Netscape 4 as
soon as possible. There are loads of other problems with
Netscape 4 as well, for example the brocken CSS support.

b) LaTeX - I'm not keen on switching to Omega just to have UTF-8
umlauts in my TeX sources.

Solution: LaTeX has already an input encoding package, and with

http://www.unruh.de/DniQ/latex/unicode/

you now just replace in your LaTeX header the old

\usepackage[latin1]{inputenc}

with

\usepackage[utf8]{inputenc}

and everything works as desired. I hope this UTF-8 extension
will soon find its way into standard LaTeX distributions.

c) Emacs - Current Emacs UTF-8 support is still a bit too provisional
for my comfort. In particular, I don't like that the UTF-8 mode is not
binary transparent. Work on turning Emcas completely into a UTF-8
editor is under way, and I'd be very curious to hear about the
current status and whether there is anything to test already.
Anyone?

So it is really just Emacs now that still keeps me from completely
moving to UTF-8 forever. (There is also bash/readline, but I simply
manage to avoid non-ASCII characters in command lines, so it's not a
really big problem in daily life.)

Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
Florian Weimer
2002-05-01 20:43:59 UTC
Permalink
Post by Markus Kuhn
c) Emacs - Current Emacs UTF-8 support is still a bit too provisional
for my comfort. In particular, I don't like that the UTF-8 mode is not
binary transparent. Work on turning Emcas completely into a UTF-8
editor is under way, and I'd be very curious to hear about the
current status and whether there is anything to test already.
Anyone?
AFAIK, there is some activity on the Emacs 22 branch. XEmacs is in
the process of switching to UCS for its internal character set, too.
Gaspar Sinai
2002-05-02 00:51:44 UTC
Permalink
Post by Florian Weimer
c) Emacs - Current Emacs UTF-8 support is still a bit too provisi=
onal
Post by Florian Weimer
for my comfort. In particular, I don't like that the UTF-8 mod=
e is not
Post by Florian Weimer
binary transparent. Work on turning Emcas completely into a UT=
=46-8
Post by Florian Weimer
editor is under way, and I'd be very curious to hear about the
current status and whether there is anything to test already.
Anyone?
AFAIK, there is some activity on the Emacs 22 branch. XEmacs is in
the process of switching to UCS for its internal character set, too.
I am not much of an Emacs guy but if I were I would probably
use QEmacs, which looks pretty decent to me:

http://fabrice.bellard.free.fr/qemacs/

As I don't use Emacs so I can not really tell the difference,
it might not have all the functionality that Emacs has. But
I have a feeling that the functionality you can expect from a
text editor is there.

I like that Qemacs has a much smaller memory and binary size
than =E2=80=9Cmainstream=E2=80=9D Emacs.

Open Source is funny: you probably will never hear Microsoft
praising Java =E2=98=BA

G=C3=A1sp=C3=A1r=E3=83=BB=E3=82=AC=E3=83=BC=E3=82=B7=E3=83=A5=E3=83=91=E3=
=83=BC=E3=83=AB=E3=83=BB=D0=93=D0=B0=D1=88=D1=8C=D0=BF=D0=B0=D1=80=E3=83=
=BB=EA=B0=93=ED=8C=94=E3=83=BB=CE=93=CE=B1=CF=83=CF=80=CE=B1=CF=81
=E1=8F=B1=E1=8E=A6=E1=8F=8A =E1=8E=A3=E1=8F=8C=E1=8F=82=E1=8F=B3 =E1=8E=
=A0=E1=8F=93=E1=8F=85=E1=8F=99 =E1=8E=A0=E1=8F=93=E1=8F=99=E1=8E=B5=E1=8E=
=A9 =E1=8F=82=E1=8E=AA=E1=8E=AF=E1=8E=B8=E1=8E=A2 =E1=8E=BE=E1=8F=8D=E1=
=8F=8B =E1=8E=A4=E1=8F=A0=E1=8F=AF=E1=8F=8D=E1=8F=97 =E1=8F=82=E1=8E=AF=
=2E
Yann Dirson
2002-05-06 21:08:54 UTC
Permalink
Post by Gaspar Sinai
I am not much of an Emacs guy but if I were I would probably
http://fabrice.bellard.free.fr/qemacs/
I had a quick look at qemacs a couple of weeks ago, for other reasons
(namely docbook support), and found out that this is a project in its early
phases of development, nowhere near a full-blown editor.
--
Yann Dirson <***@altern.org> | Why make M$-Bill richer & richer ?
Debian-related: <***@debian.org> | Support Debian GNU/Linux:
Pro: <***@fr.alcove.com> | Freedom, Power, Stability, Gratuity
http://ydirson.free.fr/ | Check <http://www.debian.org/>
Tomohiro KUBOTA
2002-05-02 02:38:38 UTC
Permalink
Hi,

At Wed, 01 May 2002 20:02:57 +0100,
Post by Markus Kuhn
I have for some time now been using UTF-8 more frequently than
ISO 8859-1. The three critical milestones that still keep me from
moving entirely to UTF-8 are
How about bash? Do you know any improvement?

Please note that tcsh have already supported east Asian EUC-like
multibyte encodings. I don't know it also supports UTF-8.

How about zsh?


For Japanese, character width problems and mapping table problems
should be solved to _start_ migration to UTF-8. (This is why
several "Japanese localization patches" are available for several
UTF-8-based softwares such as Mutt. We should find ways to stop
such localization patches.)

Also, I want people who develop UTF-8-based softwares to have
a custom to specify the range of UTF-8 support. For example,

* range of codepoints
U+0000 - U+2fff? all BMP? SMP/SIP?

* special processings
combining characters? bidi? Arab shaping? Indic scripts?
Mongol (which needs vertical direction)? How about wcwidth()?

* input methods
Any way to input complex languages which cannot be supported
by xkb mechanism (i.e., CJK) ? XIM? IIIMP? (How about Gnome2?)
Or, any software-specific input methods (like Emacs or Yudit)?

* fonts availability
Though each software is not responsible for this, "This software
is designed to require Times font" means that it cannot use
non-Latin/Greek/Cyrillic characters.

Though people in ISO-8859-1/2/15 region people don't have to care
about these terms, other peole can easily believe a "UTF-8-supported"
software and then disappointed to use it. Then he/she will become
distrust "UTF-8-supported" softwares. We should avoid many people
will become such.

---
Tomohiro KUBOTA <***@debian.org>
http://www.debian.or.jp/~kubota/
"Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/
Glenn Maynard
2002-05-02 04:16:25 UTC
Permalink
Post by Tomohiro KUBOTA
* input methods
Any way to input complex languages which cannot be supported
by xkb mechanism (i.e., CJK) ? XIM? IIIMP? (How about Gnome2?)
Or, any software-specific input methods (like Emacs or Yudit)?
How much extra work do X apps currently need to do to support input
methods?

In Windows, you do need to do a little--there's a small API to tell the
input method the cursor position (for when it opens a character selection
box) and to receive characters. (The former can be omitted and it'll
still be usable, if annoying--the dialog will be at 0x0. The latter can
be omitted for Unicode-based programs, or if the system codepage happens
to match the characters.)

It's little enough to add it easily to programs, but the fact that it
exists at all means that I can't enter CJK into most programs. Since
the regular 8-bit character message is in the system codepage, it's
impossible to send CJK through.

How does this compare with the situation in X?
Post by Tomohiro KUBOTA
* fonts availability
Though each software is not responsible for this, "This software
is designed to require Times font" means that it cannot use
non-Latin/Greek/Cyrillic characters.
I can't think of ever using an (untranslated, English) X program and having
it display anything but Latin characters. When is this actually a problem?
--
Glenn Maynard
Tomohiro KUBOTA
2002-05-02 06:06:55 UTC
Permalink
Hi,

At Thu, 2 May 2002 00:16:25 -0400,
Post by Glenn Maynard
Post by Tomohiro KUBOTA
* input methods
Any way to input complex languages which cannot be supported
by xkb mechanism (i.e., CJK) ? XIM? IIIMP? (How about Gnome2?)
Or, any software-specific input methods (like Emacs or Yudit)?
How much extra work do X apps currently need to do to support input
methods?
Much work. I think this is one problematic point of XIM which
caused very few softwares (which are developed by XIM-knowing
developers, who are very few) can input CJK languages.

X.org distribution (and XFree86 distribution) has a specification
of XIM protocol. However, it is difficult. (At least I could not
understand it). So, for practical usage by developers,
http://www.ainet.or.jp/~inoue/im/index-e.html
would be useful to develop XIM clients. I have not read a good
introduction article to develop XIM servers.

I think that low-level API should integrate XIM (or other input
method protocols) support so that XIM-innocent developers (well,
almost all developers in the world) can use it and they cannot
annoy CJK people. Gnome2 seems to take this way. However, I
wonder why Xlib doesn't have such wrapper functions which omit
XIM programming troubles.
Post by Glenn Maynard
It's little enough to add it easily to programs, but the fact that it
exists at all means that I can't enter CJK into most programs. Since
the regular 8-bit character message is in the system codepage, it's
impossible to send CJK through.
Well, I am talking about Unicode-based softwares. More and more
developers in the world start to understand that 8bit is not enough
for Unicode because it is a unversal fact. I am optimistic in this
field; many developers will think 8bit character is a bad idea in
near future. However, it is unlikely many developers will recognize
the need of XIM (or other input method) support in near future because
it is needed only for CJK languages. My concern is how to force thse
XIM-innocent people to develop CJK-supporting softwares.
Post by Glenn Maynard
How does this compare with the situation in X?
Though I don't know about Windows programming, I often use Windows
for my work. Imported softwares usually cannot handle Japanese
because of font problem. However, input method (IME?) seems to be
invoked even in these imported softwares.
Post by Glenn Maynard
Post by Tomohiro KUBOTA
* fonts availability
Though each software is not responsible for this, "This software
is designed to require Times font" means that it cannot use
non-Latin/Greek/Cyrillic characters.
I can't think of ever using an (untranslated, English) X program and having
it display anything but Latin characters. When is this actually a problem?
For example, XCreateFontSet("-*-times-*") cannot display Japanese
because there are no Japanese fonts which meet the name. (Instead,
"mincho" and "gothic" are popular Japanese typefaces.) Such
types of implementation is often seen in window managers and their
theme files.

---
Tomohiro KUBOTA <***@debian.org>
http://www.debian.or.jp/~kubota/
"Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/
Jungshik Shin
2002-05-02 06:03:06 UTC
Permalink
Post by Glenn Maynard
Post by Tomohiro KUBOTA
* input methods
Any way to input complex languages which cannot be supported
by xkb mechanism (i.e., CJK) ? XIM? IIIMP? (How about Gnome2?)
Or, any software-specific input methods (like Emacs or Yudit)?
How much extra work do X apps currently need to do to support input
methods?
In Windows, you do need to do a little--there's a small API to tell the
input method the cursor position (for when it opens a character selection
...
Post by Glenn Maynard
How does this compare with the situation in X?
I know very little about Win32 APIs, but according to what little
I learned from Mozilla source code, it doesn't seem to be so simple as
you wrote in Windows, either. Actually, my impression is that Windows
IME APIs are almost parallel (concept-wise) to those of XIM APIs. (btw,
MS WIndows XP introduced an enhanced IM related APIs called TSF?.) In
both cases, you have to determine what type of preediting support
(in XIM terms, over-the-spot, on-the-spot, off-the-spot and none?)
is shared by clients and IM server. Depending on the preediting type,
the amount of works to be done by clients varies.


I'm afraid your impression that Windows IME clients have very little
to do to get keyboard input comes from your not having written programs
that can accept input from CJK IME(input method editors) as it appears
to be confirmed by what I'm quoting below.

It just occurred to me that Mozilla.org has an excellent summary
of input method supports on three major platforms (Unix/X11, MacOS,
MS-Windows). See

http://www.mozilla.org/projects/intl/input-method-spec.html.
Post by Glenn Maynard
It's little enough to add it easily to programs, but the fact that it
exists at all means that I can't enter CJK into most programs. Since
the regular 8-bit character message is in the system codepage, it's
impossible to send CJK through.
Even in English or any SBCS-based Windows 9x/ME, you
can write programs that can accept CJK characters from CJK (global)
IMEs. Mozilla, MS IE, MS Word, and MS OE are good examples.

Jungshik Shin
Glenn Maynard
2002-05-02 06:35:42 UTC
Permalink
Post by Jungshik Shin
I know very little about Win32 APIs, but according to what little
I learned from Mozilla source code, it doesn't seem to be so simple as
you wrote in Windows, either. Actually, my impression is that Windows
IME APIs are almost parallel (concept-wise) to those of XIM APIs. (btw,
MS WIndows XP introduced an enhanced IM related APIs called TSF?.) In
both cases, you have to determine what type of preediting support
(in XIM terms, over-the-spot, on-the-spot, off-the-spot and none?)
is shared by clients and IM server. Depending on the preediting type,
the amount of works to be done by clients varies.
I'm afraid your impression that Windows IME clients have very little
to do to get keyboard input comes from your not having written programs
that can accept input from CJK IME(input method editors) as it appears
to be confirmed by what I'm quoting below.
I wrote the patch for PuTTY to accept input from Win2K's IME, and some
fixes for Vim's. What I said is all that's necessary for simple
support, and the vast majority of applications don't need any more than
that.

Of course, what you do with this input is up to the application, and if
you have no support for storing anything but text in the system codepage,
there might be a lot of work to do. That's a different topic entirely,
of course.
Post by Jungshik Shin
It just occurred to me that Mozilla.org has an excellent summary
of input method supports on three major platforms (Unix/X11, MacOS,
MS-Windows). See
http://www.mozilla.org/projects/intl/input-method-spec.html.
I've never seen any application do anything other than what this
describes as "Over-The-Spot composition". This includes system dialogs,
Word, Notepad and IE.

This document incorrectly says:

"Windows does not use the off-the-spot or over-the-spot styles of input."

As far as I know, Windows uses *only* "over-the-spot" input. Perhaps
on-the-spot can be implemented (and most people would probably agree
that it's cosmetically better), but it would proably take a lot more
work.

Ex:
Loading Image...
Loading Image...

(The rest of the first half of the document describes input styles that
most programs don't use.) The document states "Last modified May 18,
1999", so the information on it is probably out of date.

The only other thing you have to handle is described in "Platform
Protocols": WM_IME_COMPOSITION. The other two messages can be ignored.

The only API function listed here that's often needed is SetCaretPosition,
to set the cursor position.
Post by Jungshik Shin
Post by Glenn Maynard
It's little enough to add it easily to programs, but the fact that it
exists at all means that I can't enter CJK into most programs. Since
the regular 8-bit character message is in the system codepage, it's
impossible to send CJK through.
Even in English or any SBCS-based Windows 9x/ME, you
can write programs that can accept CJK characters from CJK (global)
IMEs. Mozilla, MS IE, MS Word, and MS OE are good examples.
Yes, you're agreeing with what you quoted.
--
Glenn Maynard
Gaspar Sinai
2002-05-02 08:31:12 UTC
Permalink
Post by Glenn Maynard
I wrote the patch for PuTTY to accept input from Win2K's IME, and some
fixes for Vim's. What I said is all that's necessary for simple
support, and the vast majority of applications don't need any more than
that.
Of course, what you do with this input is up to the application, and if
you have no support for storing anything but text in the system codepage,
there might be a lot of work to do. That's a different topic entirely,
of course.
Post by Jungshik Shin
It just occurred to me that Mozilla.org has an excellent summary
of input method supports on three major platforms (Unix/X11, MacOS,
MS-Windows). See
http://www.mozilla.org/projects/intl/input-method-spec.html.
I've never seen any application do anything other than what this
describes as "Over-The-Spot composition". This includes system dialogs,
Word, Notepad and IE.
"Windows does not use the off-the-spot or over-the-spot styles of input."
As far as I know, Windows uses *only* "over-the-spot" input. Perhaps
on-the-spot can be implemented (and most people would probably agree
that it's cosmetically better), but it would proably take a lot more
work.
http://zewt.org/~glenn/over1.jpg
http://zewt.org/~glenn/over2.jpg
(The rest of the first half of the document describes input styles that
most programs don't use.) The document states "Last modified May 18,
1999", so the information on it is probably out of date.
The only other thing you have to handle is described in "Platform
Protocols": WM_IME_COMPOSITION. The other two messages can be ignored.
The only API function listed here that's often needed is SetCaretPosition,
to set the cursor position.
I also found it easy to program the input method for my Windows
port of Yudit. What I could not figure out how to do simple things
like setting backround and foreground color of the input method
window . My guess is that you can not set it programmatically
on Windows - on X11 it is trivial.

What X needs is:

o Library with a simple interface for pluggable input methods, with
good docs, so that anyone could contribute easily. The number of
input methods will grow inversely proportinally to the work that
needs to be done to develop one input method. For instance, making
the input method one single text file encouraged Yudit users to
contribute over 80 input maps so far.

The development time required is important. People usually write
just a few input methods for the scripts they know - if has a steep
learning curve they might not even start. I did look at IIMF:
http://www.li18nux.org/subgroups/im/IIIMF/index.html
and tried to make it work - but it required too much time so
I gave up.

o A collection of input methods that can be transparently activated
and could pass utf-8 input string from any of its input methods,
selectable programmatically, or even from an external GUI of the
input method collection. This way we could avoid restaring
naive programs with a new XMODIFIERS variable.

Cheers
gaspar
Roger So
2002-05-02 13:43:29 UTC
Permalink
Post by Gaspar Sinai
The development time required is important. People usually write
just a few input methods for the scripts they know - if has a steep
http://www.li18nux.org/subgroups/im/IIIMF/index.html
and tried to make it work - but it required too much time so
I gave up.
Have you revisited the above page recently? In particular, a C library
(libiiimf) will be released shortly, which will greatly simplify the
development of IIIMP clients and servers, just like what Xlib did to the
X protocol. (Well, Xlib is just bearable, but it's much better than
working on the raw protocol.)
Post by Gaspar Sinai
o A collection of input methods that can be transparently activated
and could pass utf-8 input string from any of its input methods,
selectable programmatically, or even from an external GUI of the
input method collection. This way we could avoid restaring
naive programs with a new XMODIFIERS variable.
IIIM clients are meant to be able to switch to different IIIM language
engines at runtime. It's implemented in Xlib-I18N, available as part of
the IIIMF source. It's also implemented in Solaris 8 and 9.

Regards
Roger
--
Roger So Debian Developer
Sun Wah Linux Limited i18n/L10n Project Leader
Tel: +852 2250 0230 ***@sw-linux.com
Fax: +852 2259 9112 http://www.sw-linux.com/
Jungshik Shin
2002-05-02 06:14:29 UTC
Permalink
Post by Tomohiro KUBOTA
At Wed, 01 May 2002 20:02:57 +0100,
Post by Markus Kuhn
I have for some time now been using UTF-8 more frequently than
ISO 8859-1. The three critical milestones that still keep me from
moving entirely to UTF-8 are
How about bash? Do you know any improvement?
Please note that tcsh have already supported east Asian EUC-like
multibyte encodings. I don't know it also supports UTF-8.
It doesn't seem to support UTF-8 locale as of tcsh 6.10.0
(2000-11-19). I can't find anything about UTF-8 at http://www.tcsh.org.
The newest release is 6.11.0 The same is true of zsh.
(http://www.zsh.org)
Post by Tomohiro KUBOTA
combining characters? bidi? Arab shaping? Indic scripts?
and Hangul :-)
Post by Tomohiro KUBOTA
Mongol (which needs vertical direction)? How about wcwidth()?
Pango and ST should certainly help, here....
Post by Tomohiro KUBOTA
* input methods
Any way to input complex languages which cannot be supported
by xkb mechanism (i.e., CJK) ? XIM? IIIMP? (How about Gnome2?)
You mean IIIMF, didn't you? If there's any actual implementation,
I'd love to try it out. We need to have Windows 2k/XP or MacOS 9/X
style keyboard/IM switching mechanism/UI so that keyboard/IM modules
targeted at/customized for each language can coexist and be brought up as
necessary. It appears that IIIMF seems to be the only way unless somebody
writes a gigantic one-fits-all XIM server for UTF-8 locale(s).

How about just running your favorite XIM under ja_JP.EUC-JP while
all other applications are launched under ja_JP.UTF-8? As you know well,
it just works fine although the character repertoire you can enter
is limited to that of EUC-JP. Of course, this is not full-blown UTF-8
support, but at least it should give you the same degree of Japanese
input support under ja_JP.UTF-8 as under ja_JP.EUC-JP. Well, then
you would say what the point of moving to UTF-8 is. You can at least
display more characters under UTF-8 than under EUC-JP, can't you? :-)

In Korean case, as I wrote a couple of days ago, I had to
modify Ami (a popular Korean XIM) to make it run under ko_KR.UTF-8
because otherwise even though my applications are running under and
fully aware of UTF-8 (e.g. vim under UTF-8 xterm), I couldn't enter
over 8,000 Hangul syllables not in EUC-KR but in UTF-8. Moreover,
under ko_KR.UTF-8, Xterm-16x and Vim 6.1 with a single line patch works
almost flawlessly with U+1100 Hangul Jamos. Markus, can you update your
UTF-8 FAQ on this issue? Xterm has been supporting Thai script and that
certainly brought in almost automagically Middle Korean support as
a by-product.

BTW, Xkb may work for Korean Hangul, too and we don't need
XIM if we use 'three-set keyboard' instead of 'two-set keyboard' and can
live without Hanjas. I have to know more about Xkb to be certain, though.
Post by Tomohiro KUBOTA
Or, any software-specific input methods (like Emacs or Yudit)?
Yudit supports Indic, Thai, Arabic pretty well as far as I know.
And, judging from what Gaspar wrote to me, Middle Korean support with
U+1100 jamo is not so far away. Most of what's necessary is firmly in
place because Gaspar has written a very generic complex script support
routines which hopefully can be used for Middle Korean without much
effort.

Jungshik Shin
Tomohiro KUBOTA
2002-05-02 07:11:03 UTC
Permalink
Hi,

At Thu, 2 May 2002 02:14:29 -0400 (EDT),
Post by Jungshik Shin
You mean IIIMF, didn't you? If there's any actual implementation,
I'd love to try it out. We need to have Windows 2k/XP or MacOS 9/X
style keyboard/IM switching mechanism/UI so that keyboard/IM modules
targeted at/customized for each language can coexist and be brought up as
necessary. It appears that IIIMF seems to be the only way unless somebody
writes a gigantic one-fits-all XIM server for UTF-8 locale(s).
I heard that IIIMF has some security problems from Project HEKE
people http://www.kmc.gr.jp/proj/heke/ . I don't know whether
it is true or not, nor the problem (if any) is solved or not.

There _is_ already an implementation of IIIMF. You can download
it from Li18nux site. However, I could not succeeded to try it.
Since I have heard several reports of IIIMF users, it is simply
my fault.

There seems to be some XIM-based implementations which can input
multiple complex languages.

One is "ximswitch" software in Kondara Linux distribution.
http://www.kondara.org . I downloaded it but I didn't test it yet.

Another is mlterm http://mlterm.sourceforge.net/ which is entirely
client-side solution to switch multiple XIM servers. Though I
don't think it is a good idea to require clients to have such
mechanisms, it is the only practical way so far to realize multiple
language input.
Post by Jungshik Shin
How about just running your favorite XIM under ja_JP.EUC-JP while
all other applications are launched under ja_JP.UTF-8? As you know well,
it just works fine although the character repertoire you can enter
is limited to that of EUC-JP. Of course, this is not full-blown UTF-8
support, but at least it should give you the same degree of Japanese
input support under ja_JP.UTF-8 as under ja_JP.EUC-JP. Well, then
you would say what the point of moving to UTF-8 is. You can at least
display more characters under UTF-8 than under EUC-JP, can't you? :-)
There are, so far, no conversion engine which requires over-EUC-JP
character set. Thus, EUC-JP is enough now. If someone wants to
develop an input engine which supports more characters, he/she will
want to use UTF-8. However, I think nobody feels strong necessity
of it in Japan, besides pure technical interests for Unicode itself.
Post by Jungshik Shin
BTW, Xkb may work for Korean Hangul, too and we don't need
XIM if we use 'three-set keyboard' instead of 'two-set keyboard' and can
live without Hanjas. I have to know more about Xkb to be certain, though.
I see. This is not true for Japanese. Japanese people do need
grammar and context analysis software to get Kanji text.
How about Chinese?


---
Tomohiro KUBOTA <***@debian.org>
http://www.debian.or.jp/~kubota/
"Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/
Roger So
2002-05-02 13:54:37 UTC
Permalink
Post by Tomohiro KUBOTA
There _is_ already an implementation of IIIMF. You can download
it from Li18nux site. However, I could not succeeded to try it.
Since I have heard several reports of IIIMF users, it is simply
my fault.
Note that the source from Li18nux will try to use its own encoding
conversion mechanisms on Linux, which is broken. You need to tell it to
use iconv instead.

Maybe I should attempt to package it for Debian again, now that woody is
almost out of the way. (I have the full IIIMF stuff working well on my
development machine.)
Post by Tomohiro KUBOTA
Post by Jungshik Shin
BTW, Xkb may work for Korean Hangul, too and we don't need
XIM if we use 'three-set keyboard' instead of 'two-set keyboard' and can
live without Hanjas. I have to know more about Xkb to be certain, though.
I see. This is not true for Japanese. Japanese people do need
grammar and context analysis software to get Kanji text.
How about Chinese?
I don't think xkb is sufficient because (1) there's a large number of
different Chinese input methods out there, and (2) most of the input
methods require the user to choose from a list of candidates after
preedit.

I _do_ think xkb is sufficient for Japanese though, if you limit
"Japanese" to only hiragana and katagana. ;)

Regards
Roger
--
Roger So Debian Developer
Sun Wah Linux Limited i18n/L10n Project Leader
Tel: +852 2250 0230 ***@sw-linux.com
Fax: +852 2259 9112 http://www.sw-linux.com/
Tomohiro KUBOTA
2002-05-05 11:00:22 UTC
Permalink
Hi,

At 02 May 2002 23:54:37 +1000,
Post by Roger So
Note that the source from Li18nux will try to use its own encoding
conversion mechanisms on Linux, which is broken. You need to tell it to
use iconv instead.
I didn't know that because I am not a user of IIIMF nor other Li18nux
products. How it is broken?
Post by Roger So
Maybe I should attempt to package it for Debian again, now that woody is
almost out of the way. (I have the full IIIMF stuff working well on my
development machine.)
I found that Debian has "iiimecf" package. Do you know what it is?
Post by Roger So
I don't think xkb is sufficient because (1) there's a large number of
different Chinese input methods out there, and (2) most of the input
methods require the user to choose from a list of candidates after
preedit.
I _do_ think xkb is sufficient for Japanese though, if you limit
"Japanese" to only hiragana and katagana. ;)
I believe that you are kidding to say about such a limitation.
Japanese language has much less vowels and consonants than Korean,
which results in much more homonyms than Korean. Thus, I think
native Japanese speakers won't decide to abolish Kanji.
(Please don't be kidding in international mailing list, because
people who don't know about Japanese may think you are talking
about serious story.)

Even if we limit to input of hiragana/katakana, xkb may not be
sufficient. For one-key-one-hiragana/katakana method, I think
xkb can be used. However, more than half of Japanese computer
users use Romaji-kana conversion, two-keys-one-hiragana/katakana
method. The complexity of the algorithm is like two or three-key
input method of Hangul, I think. Do you think such an algorithm
can be implemented as xkb? If yes, I think Romaji-kana conversion
(whose complexity is like Hangul input method) can be implemented
as xkb.

---
Tomohiro KUBOTA <***@debian.org>
http://www.debian.or.jp/~kubota/
"Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/
Roger So
2002-05-05 14:21:41 UTC
Permalink
Post by Tomohiro KUBOTA
At 02 May 2002 23:54:37 +1000,
Post by Roger So
Note that the source from Li18nux will try to use its own encoding
conversion mechanisms on Linux, which is broken. You need to tell it to
use iconv instead.
I didn't know that because I am not a user of IIIMF nor other Li18nux
products. How it is broken?
The csconv library that IIIMF comes with doesn't work properly (at least
I didn't get it to work), possibly because of endianess issues. csconv
is meant to be a cross-platform replacement for iconv.
Post by Tomohiro KUBOTA
Post by Roger So
Maybe I should attempt to package it for Debian again, now that woody is
almost out of the way. (I have the full IIIMF stuff working well on my
development machine.)
I found that Debian has "iiimecf" package. Do you know what it is?
It's the IIIM Emacs Client Framework. As the name implies, it's an
implementation of an IIIM client in Emacs. I've never tried it out, as
I don't use Emacs. :)

Is it used by anyone? Last time I checked, popularity-contest said
nobody was using it...
Post by Tomohiro KUBOTA
Post by Roger So
I _do_ think xkb is sufficient for Japanese though, if you limit
"Japanese" to only hiragana and katagana. ;)
I believe that you are kidding to say about such a limitation.
Japanese language has much less vowels and consonants than Korean,
which results in much more homonyms than Korean. Thus, I think
native Japanese speakers won't decide to abolish Kanji.
(Please don't be kidding in international mailing list, because
people who don't know about Japanese may think you are talking
about serious story.)
Sorry, it wasn't meant to be a serious comment. :)

Cheers

Roger
--
Roger So Debian Developer
Sun Wah Linux Limited i18n/L10n Project Leader
Tel: +852 2250 0230 ***@sw-linux.com
Fax: +852 2259 9112 http://www.sw-linux.com/
Bruno Haible
2002-05-06 10:53:12 UTC
Permalink
Post by Roger So
The csconv library that IIIMF comes with doesn't work properly (at least
I didn't get it to work), possibly because of endianess issues. csconv
is meant to be a cross-platform replacement for iconv.
Furthermore it is not multithread safe (because it uses setlocale()
internally), whereas iconv is multithread safe.

Bruno
Jungshik Shin
2002-05-05 23:12:31 UTC
Permalink
Post by Tomohiro KUBOTA
At 02 May 2002 23:54:37 +1000,
Post by Roger So
I _do_ think xkb is sufficient for Japanese though, if you limit
"Japanese" to only hiragana and katagana. ;)
I believe that you are kidding to say about such a limitation.
Japanese language has much less vowels and consonants than Korean,
which results in much more homonyms than Korean. Thus, I think
Well, actually it's due to not so much the difference in
the number of consonants and vowels as the fact that Korean has
both closed and open syllables while Japanese has only open syllables
that makes Japanese have a lot more homonyms than Korean.
Post by Tomohiro KUBOTA
native Japanese speakers won't decide to abolish Kanji.
I don't think Japanese will ever do, either. However, I'm afraid
having too many homonyms is a little too 'feeble' a 'rationale' for
not being able to convert to all phonetic scripts like Hiragana and
Katakana. The easiest counter argument to that is how Japanese speakers
can tell which homonym is meant in oral communication if Kanji is so
important to disambiguate among homonyms. They don't have any Kanjis to
help them, (well, sometimes you may have to write down Kanjis to break
the ambiguity in the middle of conversation, but I guess it's mostly
limited to proper nouns). I heard that they don't have much trouble
because the context helps a listener a lot with figuring out which
of many homonyms is meant by a speaker. This is true in any language.
Arguably, the same thing could help readers in written communication.
Of course, using logographic/ideographic characters like Kanji certainly
helps readers very much and that should be a very good reason for Japanese
to keep Kanji in their writing system.

English writing system is also 'logographic' in a sense (so is modern
Korean orthography in pure Hangul as it departs from the strict agreement
between pronunciation and spelling ) and a spelling reform (to make
English have a similar degree of the agreement between spelling and
pronunciation as to that in Spanish) would make it harder to read written
text depriving English written text of its 'logographic' nature. On the
other hand, it would help learners and writers. It's always been struggle
between readers vs writers and listeners vs speakers....
Post by Tomohiro KUBOTA
xkb can be used. However, more than half of Japanese computer
users use Romaji-kana conversion, two-keys-one-hiragana/katakana
method. The complexity of the algorithm is like two or three-key
input method of Hangul, I think. Do you think such an algorithm
can be implemented as xkb? If yes, I think Romaji-kana conversion
(whose complexity is like Hangul input method) can be implemented
as xkb.
I also like to know whether it's possible with Xkb. BTW, if
we use three-set keyboards (where leading consonants and trailing
consonants are assigned separate keys) and use U+1100 Hangul Conjoining
Jamos, Korean Hangul input is entirely possible with Xkb alone.

Jungshik Shin
Tomohiro KUBOTA
2002-05-06 01:11:55 UTC
Permalink
Hi,

At Sun, 5 May 2002 19:12:31 -0400 (EDT),
Post by Jungshik Shin
Post by Tomohiro KUBOTA
I believe that you are kidding to say about such a limitation.
Japanese language has much less vowels and consonants than Korean,
which results in much more homonyms than Korean. Thus, I think
Well, actually it's due to not so much the difference in
the number of consonants and vowels as the fact that Korean has
both closed and open syllables while Japanese has only open syllables
that makes Japanese have a lot more homonyms than Korean.
You may be right. Anyway, the true reason is that Japanese
language has a lot of words from old Chinese. These words
which are not homonyms in Chinese will be homonyms in Japanese.
(They may or may not be homonys in Korea. I believe that
Korean also has a lot of Chinese-origin words.) Since a way to
coin a new word is based on Kanji system, Japanese language
would lose vitality without Kanji.
Post by Jungshik Shin
I don't think Japanese will ever do, either. However, I'm afraid
having too many homonyms is a little too 'feeble' a 'rationale' for
not being able to convert to all phonetic scripts like Hiragana and
Katakana.
...
Since I don't represent Japanese people, I don't say whether it is
a good idea or not to have many homonyms. You are right, there
are many other reasons for/against using Kanji and I cannot
explain everything.

Japanese pronunciation does have troubles, though it is widely
helped by accents or rhythms. However, in some cases, none of
accesnts or context can help. For example, both science and
chemistry are "kagaku" in japanese. So we sometimes call
chemistry as "bakegaku", where "bake" is another reading of
"ka" for chemistry. Another famous confusing pair of words
is "private (organization)" and "municipal (organization)",
which is called "shiritu". Thus, "private" is sometimes
called "watakushiritu" and "municipal" is called "ichiritu",
again these alias names are from different readings of kanji.
If you listen to Japanese news programs every day, you will
find these examples some day.

These days more and more Japanese people want to learn more
Kanji to use their abundance of power of expression, though
I am not one of these Kanji learners.
Post by Jungshik Shin
I also like to know whether it's possible with Xkb. BTW, if
we use three-set keyboards (where leading consonants and trailing
consonants are assigned separate keys) and use U+1100 Hangul Conjoining
Jamos, Korean Hangul input is entirely possible with Xkb alone.
Note for xkb experts who don't know Hiragana/Katakana/Hangul:
input methods of these scripts need backtracking. For example,
in Hangul, imagine I hit keys in the c-v-c-v (c: consonant,
v: vowel) sequence. When I hit c-v-c, it should represent one
Hangul syllable "c-v-c". However, when I hit the next v, it
should be two Hangul syllables of "c-v c-v".

In Hiragana/Katakana, processing of "n" is complex (though
it may be less complex than Hangul).

---
Tomohiro KUBOTA <***@debian.org>
http://www.debian.or.jp/~kubota/
"Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/
Pablo Saratxaga
2002-05-06 05:46:33 UTC
Permalink
Kaixo!
Post by Tomohiro KUBOTA
input methods of these scripts need backtracking. For example,
in Hangul, imagine I hit keys in the c-v-c-v (c: consonant,
v: vowel) sequence. When I hit c-v-c, it should represent one
Hangul syllable "c-v-c". However, when I hit the next v, it
should be two Hangul syllables of "c-v c-v".=20
That is only the case with 2-mode keyboard; with 3-mode keyboard there
is no ambiguity, as there are three groups of keys V, C1, C2; allowing
for all the possible combinations: V-C2, C1-V-C2. Eg: there are two key=
s
for each consoun: one for the leading syllab consoun, and one for the
ending syllab consoun. (I think the small round glyph to fill an empty
place in a syllab is always at place C2, that is, c-v is always written
C1-V-C2 with a special C2 that is not written in latin transliteration)=
=20
Post by Tomohiro KUBOTA
In Hiragana/Katakana, processing of "n" is complex (though
it may be less complex than Hangul).
No. The "N" is just a kana like any other, no complexity at all involve=
d.
Complexity only happens when typing in latin letters. That is why
the use of transliteration typing will always require an input
method anyways, it cannot be handled with just Xkb.
Post by Tomohiro KUBOTA
=20
---
http://www.debian.or.jp/~kubota/
"Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/
--=20
Ki =E7a vos v=E5ye b=E9n,
Pablo Saratxaga

http://www.srtxg.easynet.be/ PGP Key available, key ID: 0x8F0E4975
Jungshik Shin
2002-05-07 02:43:32 UTC
Permalink
Post by Tomohiro KUBOTA
input methods of these scripts need backtracking. For example,
in Hangul, imagine I hit keys in the c-v-c-v (c: consonant,
v: vowel) sequence. When I hit c-v-c, it should represent one
Hangul syllable "c-v-c". However, when I hit the next v, it
should be two Hangul syllables of "c-v c-v".
That is only the case with 2-mode keyboard; with 3-mode keyboard ther=
e
is no ambiguity, as there are three groups of keys V, C1, C2; allowin=
g
for all the possible combinations: V-C2, C1-V-C2. Eg: there are two k=
eys

'V-C2 and C1-V-C2' should be 'C1-V and 'C1-V-C2' :-)

To go all the way to Xkb, even three-set keyboard array has to be
modified a little because some clusters of vowels and consonants
are not assigned separate keys, but have to be entered by a sequence
of keys assigned to basic/simple vowels and consonants. Alternatively,
programs have to be modified to truly support 'L+V+T*' model of Hangul
syllables as stipulated in TUS 3.0. p. 53.
for each consoun: one for the leading syllab consoun, and one for the
ending syllab consoun. (I think the small round glyph to fill an empt=
y
place in a syllab is always at place C2, that is, c-v is always writt=
en
C1-V-C2 with a special C2 that is not written in latin transliteratio=
n)

You almost got it right except that IEung ('=E3=85=87') is NULL at th=
e
syllable onset position (i.e. it's a place holder for syllables that
begin with a vowel and does not appear in Latin transliteration). IEung
is not NULL at the syllable coda-position but corresponds to [ng] (IPA =
:
[=C5=8B] ) as in 'young'. To put in your way, V-C2 syllable is always w=
ritten
as IEung-V-C2 with IEung having no phonetic value. Here I assumed
we're not talking about the orthography of the 15th century ;-)

Jungshik Shin
Tomohiro KUBOTA
2002-05-07 02:54:09 UTC
Permalink
Hi,

At Mon, 6 May 2002 07:46:33 +0200,
Post by Tomohiro KUBOTA
In Hiragana/Katakana, processing of "n" is complex (though
it may be less complex than Hangul).
No. The "N" is just a kana like any other, no complexity at all involved.
Complexity only happens when typing in latin letters. That is why
the use of transliteration typing will always require an input
method anyways, it cannot be handled with just Xkb.
In my above sentence, "n" is a Latin letter. It may correspond to
HIRAGATA/KATAKANA LETTER N *or* 1st key stroke to n-a, n-i, n-u, n-e,
n-o, n-y-a, n-y-u, or n-y-o. (Key strokes of n-y-a should give
HIRAGANA/KATAKANA LETTER NI and following HIRAGANA/KATAKANA LETTER
SMALL YA.)

Anyway, I understand your point that Latin -> Hiragana/Katakana
cannot be implemented as xkb.

---
Tomohiro KUBOTA <***@debian.org>
http://www.debian.or.jp/~kubota/
"Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/
Pablo Saratxaga
2002-05-02 11:26:46 UTC
Permalink
Kaixo!

On Thu, May 02, 2002 at 02:14:08AM -0400, Jungshik Shin wrote:
=20
Post by Jungshik Shin
BTW, Xkb may work for Korean Hangul, too and we don't need
XIM if we use 'three-set keyboard' instead of 'two-set keyboard'
If it is indeed doable, it wouldn't be very practical due to the high
amount of possible combinations (the Xkb based solution consist to
populate the Compose file with the various possibilities, I suppose).

But that's true that hangul-only typing doesn't require any user
interactivity at all, an on-the-spot method that just analyzes the inpu=
t
and convert to preformed hangul syllabes on the fly is enoguh.

Note that Vietnamese also has a similar need for such a kind of simple
input method; but as the raw XIM protocol is too complex for a simple
non interactive input method there is nobody yet that does trough the
trouble of writing one (currently you can type vietnamese using dead ke=
ys
for the accents and the usual Xkb mechanism as for other latin alphabet
based languages, but that isn't the preferred typing method for Vientam=
ese
people).


Now, if by "Xkb is enough to type Korean" you meant typing directly
the single jamos without composing, yes, that's perfectly possbile;
but the produced output won't be in the standardized precomposed form
for the common korena syllabes, that could be a compatibility problem
if you exchange files written that way.
Post by Jungshik Shin
Post by Tomohiro KUBOTA
Or, any software-specific input methods (like Emacs or Yudit)?
=20
Yudit supports Indic, Thai, Arabic pretty well as far as I know.
Well, until now it used ascii transliteration tables; which is a bit
of a pain if you use a non-US keyboard on a regularly base as you can't
type directly the characters available on your keyboard, that is
counterintuitive and frustrating.
However, I read in the web page of yudit that the latest version suppor=
ts
Xutf8LookupString, that's good news, as it would allow someone to easil=
y
type in indic, thai, arabic, etc. without having to do anything special
at all, just usiong their standard keyboard layout, as in all other
programs (the transliteration tables are good to have for the occasiona=
l
input in other languages/scripts; but for the default everyday input
in the user language they shouldn't be needed)

--=20
Ki =E7a vos v=E5ye b=E9n,
Pablo Saratxaga

http://www.srtxg.easynet.be/ PGP Key available, key ID: 0x8F0E4975
Jungshik Shin
2002-05-05 23:33:57 UTC
Permalink
Post by Pablo Saratxaga
Post by Jungshik Shin
BTW, Xkb may work for Korean Hangul, too and we don't need
XIM if we use 'three-set keyboard' instead of 'two-set keyboard'
If it is indeed doable, it wouldn't be very practical due to the high
amount of possible combinations (the Xkb based solution consist to
As you correctly noticed, I was talking about using U+1100 Jamos.
Some Koreans believe that it's the greatest mistake of Korean nat'l
standard body to insist that 11,172 syllables are encoded in Unicode/ISO
10646 and to prevail in ISO/IEC JTC1/SC2/WG2 because 11,172 precomposed
syllables are not sufficient for _even modern_ Korean, we need U+1100
Jamo support anyway for Middle (and future) Korean and the inclusion of
11,172 syllables only delayed support of U+1100 Jamos by giving (or rather
strengthening) a **false** impression to developers that Hangul doesn't
need to be treated as a complex script (as Indic and Thai scripts do). For
instance, Sun's complex text/script support plan _did_ not mention Korean
Hangul while listing various South and Southeast Asian scripts and Hebrew
and Arabic as the targets of complex text processing. It's frustrating
to have to debunk the myth (that Korean Hangul can be treated just like
Japanese and Chinese writing systems and that it doesn't have anything in
common with South and Southeast Asian scripts) time and again. At the
moment, it's only Microsoft that fully understands the issue and offers
the full range of Hangul support. Fortunately, Pango is moving forward in
the right direction and hopefully ST and ATSUI(of MacOS) would help, too.
Post by Pablo Saratxaga
But that's true that hangul-only typing doesn't require any user
interactivity at all, an on-the-spot method that just analyzes the input
and convert to preformed hangul syllabes on the fly is enoguh.
Yup. It's basically a not-so-complex automata. For three-set
keyboard, it's very simple while for two-set keyboard, it's a bit more
complicated. An automata for two-set Middle Korean KBD would be much
more complicated than three-set Middle Korean KBD. Of course, Hanja
input does require dictionary look-up and user interaction.
Post by Pablo Saratxaga
Now, if by "Xkb is enough to type Korean" you meant typing directly
the single jamos without composing, yes, that's perfectly possbile;
You have to note that I had two conditions under which
that might be possible. One of them is that 'three-set' keyboard is used.
'Three-set' keyboard distinguishes between leading consonants
and trailing consonants while 'two-set' keyboard doesn't.
The other is that we use U+1100 Jamos to represent Hangul.
Post by Pablo Saratxaga
but the produced output won't be in the standardized precomposed form
for the common korena syllabes, that could be a compatibility problem
if you exchange files written that way.
Well, I have to quote 'standard' in standard precomposed form :-).
It's certainly true that precomposed form is widely used for Korean.
However, many people including me want to go all the way to using
exclusively U+1100 Hangul Jamos for both modern and Middle Korean when
a large enough number of programs and fonts support that. To achieve
backward compatibility, post and pre-processing (to convert modern syllables
into and out of NFC -precomposed forms) can be done if necessary.

Jungshik Shin

P.S. Attached is an example of Xkb definition for a 3-set Korean keyboard. It's
made by PARK Won Kyu <***@chem.skku.ac.kr>.
Bruno Haible
2002-05-02 12:23:19 UTC
Permalink
Post by Markus Kuhn
There is also bash/readline
SuSE 8.0 ships with a bash/readline that works fine with (at least)
width 1 characters in an UTF-8 locale.

There is also an alpha release of a readline version that attempts to
handle single-width, double-width and zero-width characters in all
multibyte locales. But it's alpha (read: it doesn't work for me yet).

Bruno
Markus Kuhn
2002-05-02 18:07:27 UTC
Permalink
Post by Bruno Haible
There is also an alpha release of a readline version that attempts to
handle single-width, double-width and zero-width characters in all
multibyte locales. But it's alpha (read: it doesn't work for me yet).
Yes, it seems the train is rolling now for UTF-8 support in
bash/readline as well, which is excellent news.

ftp://ftp.cwru.edu/hidden/bash-2.05b-alpha1.tar.gz
ftp://ftp.cwru.edu/hidden/readline-4.3-alpha1.tar.gz

Anyone interested in joining the bash-testers list to help iron out any
problems with UTF-8 support in bash/readline should contact
Chet Ramey <***@po.cwru.edu>.

http://cnswww.cns.cwru.edu/~chet/readline/rltop.html

Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
Mike Fabian
2002-05-10 23:18:22 UTC
Permalink
Post by Bruno Haible
Post by Markus Kuhn
There is also bash/readline
SuSE 8.0 ships with a bash/readline that works fine with (at least)
width 1 characters in an UTF-8 locale.
Seems to work fine for double width characters as well and it also
appears to work fine in legacy locales like ja_JP.eucJP, zh_TW.Big5
...
--
Mike Fabian <***@suse.de> http://www.suse.de/~mfabian
睡眠不足はいい仕事の敵だ。
Loading...