"JJ" <***@vfemail.net> wrote
| You're right. The "Lucida Console" font does not have a Unicode block for
| CJK characters. However, I use the "Microsoft Sans Serif" font for the
| default Windows GUI via Windows Classic theme. "Microsoft Sans Serif" font
| does not have a Unicode block for CJK characters either. Yet, Windows
| Explorer can display the CJK characters correctly.
|
| It's similar like using "Lucida Console" font (or any other
| TrueType/OpenType font) in Notepad. If you copy any CJK character from
e.g.
| Character Map, Notepad can display the characters correctly. This is
| possible because the system borrows character glyphs from other font which
| have them. CMD however, behave differently.
I just tested Lucida in my console window on XP.
I get a rectangle for a Chinese character. Ditto with
Notepad, which I keep set to Verdana. Windows
Explorer is probably more sophisticated. Likewise with
browsers. For instance, I keep a webpage for reference
that I created with the full unicode set, showing
each as:
decimal value character UTF-8 byte values
I set the font as verdana in CSS, but foreign characters
still show up. Presumably the browser knows to pick a
font that suits. I know that Firefox has settings in
about:config for that. So if I use something like
恴 to show the unicode Chinese character 24692
(6074 is the hexadecimal version) then the browser knows
to deal with that. I suspect those fonts may be built in.
But browsers are designed to show anything graphical.
Plain text windows are usually designed to show only
one font. I'm surprised your Notepad shows the characters.
Maybe MS made it more sophisticated in Vista/7 and it's
no longer a plain Win32 text window.
Also note with respect to Mike S's post: Local codepage
has nothing to do with unicode characters. It started out
as ASCII, using one byte. In 7-bit ASCII, 0-127 are basic
English characters. With the need to support foreign
languages, ANSI was developed. Still one byte per character.
0-127 are still the same. 128-255 are displayed depending
on the local codepage. In English, #149 is a bullet. In
Russian it's probably a Russian character. In Turkish,
Turkish. Etc. The codepage setting decides that. You
can set your system to function as Russian, Turkish, etc.
That solved the problem except for Korean, Chinese,
Japanese, which use a multibyte character set to deal with
the limitations of ANSI. It's still one byte per character
but some byte values are signifiers for the next byte.
So 65 is "A", for instance, but 120 65 might be the character
for "tree" using the Japanese codepage. (Just an example.
I don't know the signifier numbers offhand. Nor do I know
Japanese. :)
That's all in the world of one-byte encoding (which
confusingly includes multi-byte Asian characters).
Unicode is two byte encoding. All characters needed
have a number of their own. So Russian characters
might be, say, 340-420. Chinese characters seem to
be up in the mid-20,000s to 30,000s. It's an entirely
different approach. 0-127 are still the same as ASCII,
but the bytes for "ab" in ASCII or ANSI are 97-98.
In unicode they're 0-97-0-98. Always 2 bytes.
That created a problem. The computing world was
based on 1 byte = 1 character. Even multibyte encoding
reads one byte at a time. It's made up of numbers
from 0-255. Unicode is made up of numbers from 0
to 65535, using 2 bytes for each number. Completely
different encoding.
Unicode has been around for many years, but it
requires different treatment. Different programming
APIs. Webpages are written in ANSI. JPG EXIF tags
are in ANSI. Etc. Unicode is also superfluous to those
of us in N. America and Europe. So it's been slow to
be adopted.
To make the transition smoother, UTF-8 was
created. UTF-8 is similar to the multibyte Asian
encoding. It renders the unicode characters using
prepended flag bytes. So text can still be parsed
one byte at a time. Webpages can be ANSI or UTF-8
without changing the basic file structure. There
are no pesky null characters to screw things up.
All that's needed is for the browser to know which
way to parse. And of course, it still doesn't matter
much in the West. So everyone's happy. Since UTF-8
does actually function as unicode, copepages are
not used.
Your console window probably deals in unicode.
But fonts deal in characters. So if the window can
only render one font at a time then it won't be
able to render anything not drawn in Lucida.
That may be more that anyone cares to know. :)
But I figure it's worth explaining because the whole
thing can get very confusing and there's a lot of
misinformation about what's what when it comes to
character encoding.