"Wolf K" <***@sympatico.ca> wrote
| > The OP wants to use a plain-text editor that only uses standard ASCII
| > (not "extended ASCII", or codes - i. e. characters between 32 and 126
| > decimal [plus newline]). He hasn't said why yet, but I understand what
| > he wants. (I was going to say "... like Notepad", but Notepad does allow
| > so-called "Extended ASCII", i. e. one particular set of the codes up to
| > 255.) He is hoping for something that will render such text into
| > nearest-equivalent (such as quotes that have directional qualities all
| > into code 34 decimal).
| > []
|
| NB: ASCII is not ANSI. Ansi is ASCII plus codes 128 to 255.
|
| Note ANSI 171 and 187. These are diagonal quotes, equivalent to curly
| quotes.
|
171 and 187 are double chevrons. That's not the same
as curly quotes. You're thinking of 147 and 148.
But that's true only for the standard English codepage.
If you're Russian you'll see Cyrillic characters. Something
like a capital Y and an oval with a vertical line through it.
ASCII is standard in all uses and matches the same
numbers in unicode. It specifies a basic western character
set for byte values 0 to 127. ANSI uses a "local codepage" to
define characters 128-255, while retaining the ASCII values
up to 127. The standard webpage encoding used to be
Windows English codepage ANSI. (ASCII in most cases.)
Now UTF-8 is more common.
UTF-8 is a way to express unicode using single bytes.
Unicode-16, what's usually just referred to as unicode,
encodes thousands of characters in 2 bytes, so each character
can have its own specific encoding number in order to fit
English, Russian and everything else. ASCII and ANSI use
a one-byte-per-character encoding, except with a few
Asian languages.
In order to internationalize the Web with minimal upset,
UTF-8 became standard. It allows for encoding unicode 16
in a one-byte system. The first 128 values are still ASCII.
The second 128 are used to create values with up to 4
bytes. Thus all languages can be encoded in one system.
It's still read 1 byte at a time and most webpages don't
change because most are still basically ASCII. (Whereas
if we'd converted to unicode, all webpage files would
have had to be converted to 2-byte encoding,making for
a lot of work and doubling the size of HTML files.)
The problem comes when UTF-8 is read as ANSI. (Most
text is still handled in one-byte-per-character ASCII/ANSI
encoding. Even things like JPG EXIF tags and PE file
import headers are ACSII/ANSI.)
There might be, say, 3 characters in UTF-8 that
indicate a left curly quote. I don't know exactly offhand.
But it might be, say, capital A with an umlaut, a 1/4 sign,
and a Euro sign. In the browser it's a left curly quote. In
Notepad it shows up as 3 wacky characters. The two
programs are interpreting the bytes by different standards.
So the text is corrupted. And that's just in English. A
browser reading the UTF-8 can display it properly and in
most cases will "sniff" the page to identify it even if the
HTML code does not specify. But when that single-byte
text is pasted to ANSI you see the ANSI characters. You
might see the Euro. A Russian will see something else. A
Greek will see a third thing.
What Harry is asking for is a simple way to convert
UTF-8 to ANSI using the standard English codepage. That
requires converting the string by parsing
the bytes. When the parser encounters bytes of 127+
it would need to decide how to treat them. Is it an
ANSI bullet, character 160 in English? Or is byte 160
the first of 2, 3, or 4 bytes, together indicating a character
in UTF-8? If it turns out to be, say, 3 bytes that render a
left curly quote in UTF-8, some kind of filter has to recognize
that exact pattern and say, "Oh, that's a quote. We'll just
substitute character 34 for those 3 bytes."
So Harry's solution has to treat each specific UTF-8
character and decide what to substitute. It's not a 1-to-1
correspondence. In other words, Notepad already translated
the UTF-8 to ANSI, but now it has to be transliterated.
If those quotes were written as character 34 in the first
place then the encoding would not matter. Everyone would
see ", because " is in the ASCII range.
Whiskers made an interesting point that I wasn't aware of:
The page he links says that MS Office products have an option
for fancy characters like curly quotes. Maybe that helps explain
why so many of them are on wepages. MS Office users are
among the most parochial of all computer users. They're usually
not tech-literate but are computer-literate. The result is millions
of people who equate their computer with MS Office and
assume the whole world also uses MS Office. They're the people
who send emails from Word or send a 60,000 byte DOC file to
communicate 1 sentence of 24 bytes. Many of those same people
are probably also creating webpage from MS Word, oblivious
of the travesty.