LANG, locale, unicode, setup.py and Debian packaging

Discussion:

Donn

2008-01-13 17:28:54 UTC

No. It may use replacement characters (i.e. a question mark, or an empty
square box), but if you don't see such characters, then the terminal has
successfully decoded the file names. Whether it also correctly decoded
them is something for you to check (i.e. do they look right?)

Okay.

So, the picture I get is:
*If* my locale *happens* to be the right one then the filename will appear
properly. If it does not cover that file, then that filename will appear
with ? marks in the name.
Because I use en_ZA.utf8 it's doing a very good job of decoding a wide variety
of filenames and therefore I rarely see ? characters.

What happens if there is a filename that cannot be represented in it's
entirety? i.e. every character is 'replaced'. Does it simply vanish, or does
it appear as "?????????" ? :)

I spent an hour trying to find a single file on the web that did *not* have
(what seemed like) ascii characters in it and failed. Even urls on Japanese
websites use western characters ( a tcp/ip issue I suspect). I was hoping to
find a filename in Kanji (?) ending in .jpg or something so that I could
download it and see what my system (and Python) made of it.

Thanks again,
\d

--
"Life results from the non-random survival of randomly varying replicators."
-- Richard Dawkins

Fonty Python and other dev news at:
http://otherwiseingle.blogspot.com/

Donn

2008-01-13 11:51:58 UTC

Permalink

Martin,

Yes. It does so when it fails to decode the byte string according to the
file system encoding (which, in turn, bases on the locale).

That's at least one way I can weed-out filenames that are going to give me
trouble; if Python itself can't figure out how to decode it, then I can also
fail with honour.

I will try the technique given
on:http://www.pyzine.com/Issue008/Section_Articles/article_Encodings.html
#guessing-the-encoding Perhaps that will help.

I would advise against such a strategy. Instead, you should first
understand what the encodings of the file names actually *are*, on
a real system, and draw conclusions from that.

I don't follow you here. The encoding of file names *on* a real system are
(for Linux) byte strings of potentially *any* encoding. os.listdir() may even
fail to grok some of them. So, I will have a few elements in a list that are
not unicode, I can't ask the O/S for any help and therefore I should be able
to pass that byte string to a function as suggested in the article to at
least take one last stab at identifying it.
Or is that a waste of time because os.listdir() has already tried something
similar (and prob. better)?

I notice that this doesn't include "to allow the user to enter file
names", so it seems there is no input of file names, only output.

I forgot to mention the command-line interface... I actually had trouble with
that too. The user can start the app like this:
fontypython /some/folder/
or
fontypython SomeFileName
And that introduces input in some kind of encoding. I hope that
locale.getprefferedencoding() will be the right one to handle that.

Is such input (passed-in via sys.argv) in byte-strings or unicode? I can find
out with type() I guess.

As to the rest, no, there's no other keyboard input for filenames. There *is*
a 'filter' which is used as a regex to filter 'bold', 'italic' or whatever. I
fully expect that to give me a hard time too.

Then I suggest this technique of keeping bytestring/unicode string
pairs. Use the Unicode string for display, and the byte string for
accessing the disc.

Thanks, that's a good idea - I think I'll implement a dictionary to keep both
and work things that way.

u"M\xd6gul".encode("ascii","ignore")

'Mgul'

u"M\xd6gul".encode("ascii","replace")

'M?gul'

Well, that was what I expected to see too. I must have been doing something
stupid.

\d

"Martin v. Löwis"

2008-01-13 12:03:40 UTC

Permalink

Post by Donn

I would advise against such a strategy. Instead, you should first
understand what the encodings of the file names actually *are*, on
a real system, and draw conclusions from that.

I don't follow you here. The encoding of file names *on* a real system are
(for Linux) byte strings of potentially *any* encoding.

No. On a real system, nothing is potential, but everything is actual.

So on *your* system, today: what encoding are the filenames encoded in?
We are not talking about arbitrary files, right, but about font files?
What *actual* file names do these font files have?

On my system, all font files have ASCII-only file names, even if they
are for non-ASCII characters.

Post by Donn
os.listdir() may even
fail to grok some of them. So, I will have a few elements in a list that are
not unicode, I can't ask the O/S for any help and therefore I should be able
to pass that byte string to a function as suggested in the article to at
least take one last stab at identifying it.

It won't identify it. It will just give you *some* Unicode string.

Post by Donn
Or is that a waste of time because os.listdir() has already tried something
similar (and prob. better)?

"better" is a difficult notion here. Is it better to produce some
result, possibly incorrect, or is it better to give up?

Post by Donn
I forgot to mention the command-line interface... I actually had trouble with
fontypython /some/folder/
or
fontypython SomeFileName
And that introduces input in some kind of encoding. I hope that
locale.getprefferedencoding() will be the right one to handle that.

If the user has set up his machine correctly: yes.

Post by Donn

u"M\xd6gul".encode("ascii","ignore")

'Mgul'

u"M\xd6gul".encode("ascii","replace")

'M?gul'

Well, that was what I expected to see too. I must have been doing something
stupid.

Most likely, you did not invoke .encode on a Unicode string.

Regards,
Martin

"Martin v. Löwis"

2008-01-13 13:17:17 UTC

Permalink

If you can all ls them, and if the file names come out right, then
they'll have the same encoding.

I can't always *type* some of their names and have to use copy/paste to, for
example, ls one of them.
Again, it's working from ignorance (my own) : I assume filenames in different
countries will be in character sets that I have never (nor will I ever) see.

I never heard before that font files use non-ASCII file names, and I
don't see the point in doing so - isn't there typically a font name
*inside* the font file as well, so that you'd rather use that for
display than the file name?

Of course, *other* files (text files, images etc) will often use
non-ASCII file names. However, they won't normally have mixed
encodings - most user-created files on a single system should typically
have the same encoding (there are exceptions possible, of course).

Post by "Martin v. LÃ¶wis"
If the user has set up his machine correctly: yes.

Meaning, I am led to assume, the LANG variable primarily?

Yes.

Regards,
Martin

Donn

2008-01-13 12:27:54 UTC

Permalink

Post by "Martin v. LÃ¶wis"
So on *your* system, today: what encoding are the filenames encoded in?
We are not talking about arbitrary files, right, but about font files?
What *actual* file names do these font files have?
On my system, all font files have ASCII-only file names, even if they
are for non-ASCII characters.

I guess I'm confused by that. I can ls them, so they appear and thus have
characters displayed. I can open and cat them and thus the O/S can access
them, but I don't know whether their characters are strictly in ascii-limits
or drawn from a larger set like unicode. I mean, I have seen Japanese
characters in filenames on my system, and that can't be ascii.

You see, I have a large collection of fonts going back over 10 years and they
came from usenet years ago and so have filenames mangled all to hell.

I can't always *type* some of their names and have to use copy/paste to, for
example, ls one of them.

Again, it's working from ignorance (my own) : I assume filenames in different
countries will be in character sets that I have never (nor will I ever) see.
But I have to cover them somehow.

Post by "Martin v. LÃ¶wis"

Post by Donn
Or is that a waste of time because os.listdir() has already tried
something similar (and prob. better)?

"better" is a difficult notion here. Is it better to produce some
result, possibly incorrect, or is it better to give up?

I think I see, combined with your previous advice - I will keep byte strings
alongside unicode and where I can't get to the unicode for that string, I
will keep an 'ignore' or 'replace' unicode, but I will still have the byte
string and will access the file with that anyway.

Post by "Martin v. LÃ¶wis"
If the user has set up his machine correctly: yes.

Meaning, I am led to assume, the LANG variable primarily?

\d

Donn

2008-01-14 22:55:07 UTC

Permalink

You get the full locale name with locale.setlocale(category) (i.e.
without the second argument)

Ah. Can one call it after the full call has been done:
locale.setlocale(locale.LC_ALL,'')
locale.setlocale(locale.LC_ALL)
Without any issues?

I need that two-letter code that's hidden in a
typical locale like en_ZA.utf8 -- I want that 'en' part.

Okay, I need it because I have a tree of dirs: en, it, fr and so on for the
help files -- it's to help build a path to the right html file for the
language being supported.

Not sure why you want that. Notice that the locale name is fairly system
specific, in particular on non-POSIX systems. It may be
"English_SouthAfrica" on some systems.

Wow, another thing I had no idea about. So far all I've seen are the
xx_yy.utf8 shaped ones.

I will have some trouble then, with the help system.

Thanks,
\d

--
"There may be fairies at the bottom of the garden. There is no evidence for
it, but you can't prove that there aren't any, so shouldn't we be agnostic
with respect to fairies?"
-- Richard Dawkins

Fonty Python and other dev news at:
http://otherwiseingle.blogspot.com/

"Martin v. Löwis"

2008-01-14 23:23:23 UTC

Permalink

Post by Donn
locale.setlocale(locale.LC_ALL,'')
locale.setlocale(locale.LC_ALL)
Without any issues?

If you pass LC_ALL, then some systems will give you funny results
(semicolon-separated enumerations of all the categoryies). Instead,
pick a specific category, e.g. LC_CTYPE.

Post by Donn

I need that two-letter code that's hidden in a
typical locale like en_ZA.utf8 -- I want that 'en' part.

Okay, I need it because I have a tree of dirs: en, it, fr and so on for the
help files -- it's to help build a path to the right html file for the
language being supported.

Ok - taking the first two letters should then be fine, assuming all your
directories have two-letter codes.

Post by Donn

Not sure why you want that. Notice that the locale name is fairly system
specific, in particular on non-POSIX systems. It may be
"English_SouthAfrica" on some systems.

Wow, another thing I had no idea about. So far all I've seen are the
xx_yy.utf8 shaped ones.
I will have some trouble then, with the help system.

If you have "unknown" systems, you can try to use locale.normalize.
This has a hard-coded database which tries to deal with some different
spellings. For "English", it will give you en_EN.ISO8859-1.

OTOH, if your software only works on POSIX systems, anyway, I think
it is a fair assumption that they use two-letter codes for the
languages (the full language name is only used on Windows, AFAIK).

Notice that xx_yy.utf8 definitely is *not* the only syntactical form.
utf8 is spelled in various ways (lower and upper case, with and without
dash), and there may be other encodings (see the en_EN example above),
or no encoding at all in the locale name, and their may be "modifiers":

aa_ER at saaho (saaho dialect in Eritrea)
be_BY at latin (as opposed to the Cyrillic be_BY locale)
likewise for sr_RS
de_DE at euro (as opposed to the D-Mark locale); likewise for other
members of the Euro zone
ca_ES.UTF-8 at valencia (Valencian - Southern Catalan)
(no real difference to ca_ES at euro, but differences in
message translations)
gez_ER at abegede (Ge'ez language in Eritrea with Abegede collation)
tt_RU at iqtelif.UTF-8 (Tatar language written in IQTElif alphabet)
uz_UZ at cyrillic (as opposed to latin uz_UZ)

There used to be a @bokmal modifier for Norwegian (as opposed to
the Nynorsk grammar), but they have separate language codes now
(nb vs. nn).

Regards,
Martin

Regards,
Martin

"Martin v. Löwis"

2008-01-13 20:43:28 UTC

Permalink

Now, I want to open that file from Python, and I create a path with
paf = ['/home/donn/.fontypython/M\xc3\x96gul.pog']
I *think* that the situation is impossible because the system cannot resolve
the correct filename (due the locale being ANSI and the filename being other)
but I am not 100% sure.

Not at all. The string you pass is a *byte* string, not a character
string. You may think that the first letter of it is an aitch,
but that's just your interpretation - it really is the byte 104.

The operating system does not interpret the file names as characters
at all, with the exception of treating byte 47 as the path separator
(typically interpreted by people as "slash").

Your locale becomes only relevant when displaying file names, and
having to chose what glyphs to use.

1. f = codecs.open( paf, "r", "utf8" )
I had hopes for this one.
2. f = codecs.open( paf, "r", locale.getpreferredencoding())
3. f = open( paf, "r")

Now you are mixing two important concepts - the *contents*
of the file with the *name* of the file. These are entirely
independent, and the file name may be in one encoding and
the file contents in another, or the file contents may not
represent character data at all.

All these three APIs try to get to the *contents* of the
file, by opening it.

The name is already a byte string (as a character string,
it would have started with u'...'), so there is no need
to encode it. What the content of a .pog file is, I don't
know, so I can't tell you what encoding it is encoded it.

But none will open it - all get a UnicodeDecodeError. This aligns with my
suspicions, but I wanted to bounce it off you to be sure.

Option three should have worked if paf was a string, but
above, I see it as a *list* of strings. So try

f = open(paf[0], "r")#

where paf[0] should be '/home/donn/.fontypython/M\xc3\x96gul.pog',
as paf is ['/home/donn/.fontypython/M\xc3\x96gul.pog']

Still, I question that you *really* got a UnicodeDecodeError
for three: I get

TypeError: coercing to Unicode: need string or buffer, list found

Can you please type

paf = ['/home/donn/.fontypython/M\xc3\x96gul.pog']
f = open(paf, "r")

at the interactive prompt, and report the *complete* shell output?

1. Does it imply that the filename will be opened (with the name as it's
type : i.e. bytestring or unicode ) and written *into* as <encoding>
2. Imply that filename will be encoded via <encoding> and written into as
<encoding>
It's fuzzy, how is the filename handled?

See above. The encoding in codecs.open has no effect at all on
the file name; it only talks about the file content.

Regards,
Martin

"Martin v. Löwis"

2008-01-14 21:48:28 UTC

Permalink

Given that getlocale() is not to be used, what's the best way to get the
locale later in the app?

You get the full locale name with locale.setlocale(category) (i.e.
without the second argument)

I need that two-letter code that's hidden in a
typical locale like en_ZA.utf8 -- I want that 'en' part.

Not sure why you want that. Notice that the locale name is fairly system
specific, in particular on non-POSIX systems. It may be
"English_SouthAfrica" on some systems.

If you are certain that *your* locale names will only ever be of the
form <languagecode>[_<countrycode>][.<encoding][@modifier] (or whatever
the syntax is), take anything before the underscore as the language code.

However, you should reevaluate why you need that.

BTW - things are hanging-together much better now, thanks to your info. I have
it running in locale 'C' as well as my other test locales. What a relief!

Great!

Martin

Donn

2008-01-14 08:19:30 UTC

Permalink

Post by "Martin v. LÃ¶wis"
Can you please type
paf = ['/home/donn/.fontypython/M\xc3\x96gul.pog']
f = open(paf, "r")

I think I was getting a ghost error from another try somewhere higher up. You
are correct, this does open the file - no matter what the locale is.

I have decided to keep the test for a decode error because files created under
different locales should not be written-to under the current one. I don't
know if one can mix encodings in a single text file, but I don't have time to
find out.

It is getting messy with my test files created in differing locales, and my
code changing so quickly.

Post by "Martin v. LÃ¶wis"
See above. The encoding in codecs.open has no effect at all on
the file name; it only talks about the file content.

Thanks, I suspected as much but it's a subtle thing.

Best,
\d

--
"It is almost as if the human brain were specifically designed to
misunderstand Darwinism, and to find it hard to believe.."
-- Richard Dawkins

Fonty Python and other dev news at:
http://otherwiseingle.blogspot.com/

"Martin v. Löwis"

2008-01-14 08:22:11 UTC

Permalink

Post by Donn
I have decided to keep the test for a decode error because files created under
different locales should not be written-to under the current one. I don't
know if one can mix encodings in a single text file, but I don't have time to
find out.

Of course it's *possible*. However, you need to have a detailed format
specification to make it feasible. For example, in a font file, you
could specify that the font name is in UTF-8, the vendor name is in
Latin-1, and the description is in UTF-16, but it would be really stupid
to specify such a format (not that this would stop people from
specifying such formats - in ZIP files, some file names are encoded in
CP437 and some in UTF-8, depending on a per-filename flag).

Regards,
Martin

Donn

2008-01-14 16:02:56 UTC

Permalink

Given that getlocale() is not to be used, what's the best way to get the
locale later in the app? I need that two-letter code that's hidden in a
typical locale like en_ZA.utf8 -- I want that 'en' part.

BTW - things are hanging-together much better now, thanks to your info. I have
it running in locale 'C' as well as my other test locales. What a relief!

\d

Donn

2008-01-13 20:57:53 UTC

Permalink

Post by "Martin v. LÃ¶wis"
Now you are mixing two important concepts - the *contents*
of the file with the *name* of the file.

Then I suspect the error may be due to the contents having been written in
utf8 from previous runs. Phew!

It's bedtime on my end, so I'll try it again when I get a chance during the
week.

Thanks muchly.
\d

--
snappy repartee: What you'd say if you had another chance.

Fonty Python and other dev news at:
http://otherwiseingle.blogspot.com/

Donn

2008-01-13 07:30:07 UTC

Permalink

Martin,
I really appreciate your reply. I have been working in a vacuum on this and
without any experience. I hope you don't mind if I ask you a bunch of
questions. If I can get over some conceptual 'humps' then I'm sure I can
produce a better app.

That's a bug in the app. It shouldn't assume that environment variables
are UTF-8. Instead, it should assume that they are in the locale's
encoding, and compute that encoding with locale.getpreferredencoding.

I see what you are saying and agree, and I am confused about files and
filenames. My app has to handle font files which can come from anywhere. If
the locale (locale.getpreferredencoding) returns something like "ANSI" and I
am doing an os.listdir() then I lose the plot a little...

It seems to me that filenames are like snapshots of the locales where they
originated. If there's a font file from India and I want to open it on my
system in South Africa (and I have LANG=C) then it seems that it's impossible
to do. If I access the filename it throws a unicodeDecodeError. If I
use 'replace' or 'ignore' then I am mangling the filename and I won't be able
to open it.

The same goes for adding 'foreign' filenames to paths with any kind of string
operation.

My (admittedly uninformed) conception is that by forcing the app to always
use utf8 I can access any filename in any encoding. The problem seems to be
that I cannot know *what* encoding (and I get encode/decode mixed up still,
very new to it all) that particular filename is in.

Am I right? Wrong? Deluded? :) Please fill me in.

If you print non-ASCII strings to the terminal, and you can't be certain
that the terminal supports the encoding in the string, and you can't
reasonably deal with the exceptions, you should accept moji-bake, by
specifying the "replace" error handler when converting strings to the
terminal's encoding.

I went through this exercise recently and had no joy. It seems the string I
chose to use simply would not render - even under 'ignore' and 'replace'.
It's really frustrating because I don't speak a non-ascii language and so
can't know if I am testing real-world strings or crazy Tolkein strings.

Another aspect of this is wxPython. My app requires the unicode build so that
strings have some hope of displaying on the widgets. If I access a font file
and fetch the family name - that can be encoded in any way, again unknown,
and I want to fetch it as 'unicode' and pass it to the widgets and not worry
about what's really going on. Given that, I thought I'd extend the 'utf8'
only concept to the app in general. I am sure I am wrong, but I feel cornered
at the moment.

3. I made the decision to check the locale and stop the app if the return
from getlocale is (None,None).

I would avoid locale.getlocale. It's a pointless function (IMO).

Could you say why?

Here's my use of it:
locale.setlocale( locale.LC_ALL, "" )
loc = locale.getlocale()[0]
if loc == None:
loc = locale.getlocale()
if loc == (None, None):
print localeHelp # not utf-8 (I think)
raise SystemExit
# Now gettext
domain = "all"
gettext.install( domain, localedir, unicode = True )
lang = gettext.translation(domain, localedir, languages = [loc] )
lang.install(unicode = True )

So, I am using getlocale to get a tuple/list (easy, no?) to pass to the
gettext.install function.

Your program definitely, absolutely must work in the C locale. Of
course, you cannot have any non-ASCII characters in that locale, so
deal with it.

This would mean cutting-out a percentage of the external font files that can
be used by the app. Is there no modern standard regarding the LANG variable
and locales these days? My locale -a reports a bunch of xx_XX.utf8 locales.
Does it even make sense to use a non-utf8 locale anymore?

If you have solved that, chances are high that it will work in other
locales as well (but be sure to try Turkish, as that gives a
surprising meaning to "I".lower()).

Oh boy, this gives me cold chills. I don't have the resources to start
worrying about every single language's edge-cases. This is kind of why I was
leaning towards a "use a utf8 locale please" approach.

\d

--
Fonty Python and other dev news at:
http://otherwiseingle.blogspot.com/

"Martin v. Löwis"

2008-01-15 06:08:18 UTC

Permalink

Has it been decided how Python 3.0 will implement os.listdir on Unix?
Will there be only a single attempt to encode using the current locale
or will there be a backup technique?

That's what it currently does.

I'd probably define an optional
encoding parameter so you can ask for os.listdir(encoding="iso-8859-1")
although that then propagates into open, ...

I had the same idea, and I think that parameter should be added.

For open(), I think we should continue to accept byte strings as file
names.

Regards,
Martin

"Martin v. Löwis"

2008-01-13 11:26:17 UTC

Permalink

I have found that os.listdir() does not always return unicode objects when
passed a unicode path. Sometimes "byte strings" are returned in the list,
mixed-in with unicodes.

Yes. It does so when it fails to decode the byte string according to the
file system encoding (which, in turn, bases on the locale).

I will try the technique given
on:http://www.pyzine.com/Issue008/Section_Articles/article_Encodings.html#guessing-the-encoding
Perhaps that will help.

I would advise against such a strategy. Instead, you should first
understand what the encodings of the file names actually *are*, on
a real system, and draw conclusions from that.

I gather you mean that I should get a unicode path, encode it to a byte string
and then pass that to os.listdir
Then, I suppose, I will have to decode each resulting byte string (via the
detect routines mentioned in the link above) back into unicode - passing
those I simply cannot interpret.

That's what I meant, yes. Again, you have a number of options - passing
those that you cannot interpret is but one option. Another option is to
accept moji-bake.

Then, if the locale's encoding cannot decode the file names, you have
several options
a) don't try to interpret the file names as character strings, i.e.
don't decode them. Not sure why you need the file names - if it's
only to open the files, and never to present the file name to the
user, not decoding them might be feasible

So, you reckon I should stick to byte-strings for the low-level file open
stuff? It's a little complicated by my using Python Imaging to access the
font files. It hands it all over to Freetype and really leaves my sphere of
savvy.
I'll do some testing with PIL and byte-string filenames. I wish my memory was
better, I'm pretty sure I've been down that road and all my results kept
pushing me to stick to unicode objects as far as possible.

I would be surprised if PIL/freetype would not support byte string file
names if you read those directly from the disk. OTOH, if the user has
selected/typed a string at a GUI, and you encode that - I can easily
see how that might have failed.

That's correct, and there is no solution (not in Python, not in any
other programming language). You have to made trade-offs. For that,
you need to analyze precisely what your requirements are.

1. To open font files from any source (locale.)
2. To display their filename on the gui and the console.
3. To fetch some text meta-info (family etc.) via PIL/Freetype and display
same.
4. To write the path and filename to text files.
5. To make soft links (path + filename) to another path.
So, there's a lot of unicode + unicode and os.path.join and so forth going on.

I notice that this doesn't include "to allow the user to enter file
names", so it seems there is no input of file names, only output.

Then I suggest this technique of keeping bytestring/unicode string
pairs. Use the Unicode string for display, and the byte string for
accessing the disc.

Post by Donn
I went through this exercise recently and had no joy. It seems the string
I chose to use simply would not render - even under 'ignore' and
'replace'.

I don't understand what "would not render" means.

I meant it would not print the name, but constantly throws ascii related
errors.

That cannot be. Both the ignore and the replace error handlers will
silence all decoding errors.

I don't know if the character will survive this email, but the text I was
trying to display (under LANG=C) in a python script (not the immediate-mode
interpreter) was: "M?gul". The second character is a capital O with an umlaut
(double-dots I think) above it. For some reason I could not get that to
display as "M?gul" or "Mgul".

Post by Donn
u"M\xd6gul".encode("ascii","ignore")

'Mgul'

Post by Donn
u"M\xd6gul".encode("ascii","replace")

'M?gul'

Regards,
Martin

Donn

2008-01-13 10:24:10 UTC

Permalink

Martin,
Thanks, food for thought indeed.

On Unix, yes. On Windows, NTFS and VFAT represent file names as Unicode
strings always, independent of locale. POSIX file names are byte
strings, and there isn't any good support for recording what their
encoding is.

I get my filenames from two sources:
1. A wxPython treeview control (unicode build)
2. os.listdir() with a unicode path passed to it

I have found that os.listdir() does not always return unicode objects when
passed a unicode path. Sometimes "byte strings" are returned in the list,
mixed-in with unicodes.

I will try the technique given
on:http://www.pyzine.com/Issue008/Section_Articles/article_Encodings.html#guessing-the-encoding
Perhaps that will help.

If you think you may have file names with mixed locales, and
the current locale might not match the file name's locale, you should
be using the byte string variant on Unix (which it seems you are already
doing).

b) guess an encoding. For file names on Linux, UTF-8 is fairly common,
so it might be a reasonable guess.
c) accept lossy decoding, i.e. decode with some encoding, and use
"replace" as the error handler. You'll have to preserve the original
file names along with the decoded versions if you later also want to
operate on the original file.

Okay, I'm getting your drift.

That's not true. Try open("\xff","w"), then try interpreting the file
name as UTF-8. Some byte strings are not meaningful UTF-8, hence that
approach cannot work.

Okay.

That's correct, and there is no solution (not in Python, not in any
other programming language). You have to made trade-offs. For that,
you need to analyze precisely what your requirements are.

I would say the requirements are:
1. To open font files from any source (locale.)
2. To display their filename on the gui and the console.
3. To fetch some text meta-info (family etc.) via PIL/Freetype and display
same.
4. To write the path and filename to text files.
5. To make soft links (path + filename) to another path.

So, there's a lot of unicode + unicode and os.path.join and so forth going on.

Post by Donn
I went through this exercise recently and had no joy. It seems the string
I chose to use simply would not render - even under 'ignore' and
'replace'.

I don't understand what "would not render" means.

I meant it would not print the name, but constantly throws ascii related
errors.

I don't know if the character will survive this email, but the text I was
trying to display (under LANG=C) in a python script (not the immediate-mode
interpreter) was: "M?gul". The second character is a capital O with an umlaut
(double-dots I think) above it. For some reason I could not get that to
display as "M?gul" or "Mgul".
BTW, I just made that up - it means nothing (to me). I hope it's not a swear
word in some other language :)

As for font files - I don't know what encoding the family is in, but
I would sure hope that the format specification of the font file format
would also specify what the encoding for the family name is, or that
there are at least established conventions.

You'd think. It turns out that font file are anything but simple. I am doing
my best to avoid being sucked-into the black hole of complexity they
represent. I must stick to what PIL/Freetype can do. The internals of
font-files are waaaaaay over my head.

Post by Donn

I would avoid locale.getlocale. It's a pointless function (IMO).

As a consequence, it will return None if it doesn't know better.
If all you want is the charset of the locale, use
locale.getpreferredencoding().

Brilliant summary - thanks a lot for that.

You could just leave out the languages parameter, and trust gettext
to find some message catalog.

Right - I'll give that a go.

Post by Donn
This would mean cutting-out a percentage of the external font files that
can be used by the app.

See above. There are other ways to trade-off. Alternatively, you could
require that the program finds a richer locale, and bail out if the
locale is just "C".

That's kind of what the OP is all about. If I make this a 'design decision'
then it means I have a problem with the Debian packaging (and RPM?) rules
that require a "C" locale support.
I think I shall have to break the links between my setup.py and the rest of my
app - so that setup.py will allow LANG=C but the app (when run) will not.

That doesn't help. For Turkish in particular, the UTF-8 locale is worse
than the ISO-8859-9 locale, as the lowercase I takes two bytes in UTF-8,
so tolower can't really work in the UTF-8 locale (but can in the
ISO-8859-9 locale).

Wow. I still get cold chills -- but I assume that once the right encoding is
known this sort of thing will be okay.

Thanks again. It's coming together slowly.
\d

"Martin v. Löwis"

2008-01-13 09:27:38 UTC

Permalink

Post by Donn
It seems to me that filenames are like snapshots of the locales where they
originated.

Post by Donn
If there's a font file from India and I want to open it on my
system in South Africa (and I have LANG=C) then it seems that it's impossible
to do. If I access the filename it throws a unicodeDecodeError. If I
use 'replace' or 'ignore' then I am mangling the filename and I won't be able
to open it.

Correct. Notice that there are two ways (currently) in Python to get a
directory listing: with a Unicode directory name, which returns Unicode
strings, and with a byte string directory name, which returns byte
strings. If you think you may have file names with mixed locales, and
the current locale might not match the file name's locale, you should
be using the byte string variant on Unix (which it seems you are already
doing).

Then, if the locale's encoding cannot decode the file names, you have
several options
a) don't try to interpret the file names as character strings, i.e.
don't decode them. Not sure why you need the file names - if it's
only to open the files, and never to present the file name to the
user, not decoding them might be feasible
b) guess an encoding. For file names on Linux, UTF-8 is fairly common,
so it might be a reasonable guess.
c) accept lossy decoding, i.e. decode with some encoding, and use
"replace" as the error handler. You'll have to preserve the original
file names along with the decoded versions if you later also want to
operate on the original file.

Post by Donn
My (admittedly uninformed) conception is that by forcing the app to always
use utf8 I can access any filename in any encoding.

That's not true. Try open("\xff","w"), then try interpreting the file
name as UTF-8. Some byte strings are not meaningful UTF-8, hence that
approach cannot work.

You *can* interpret all file names as ISO-8859-1, but then some file
names will show moji-bake.

Post by Donn
The problem seems to be
that I cannot know *what* encoding (and I get encode/decode mixed up still,
very new to it all) that particular filename is in.

That's correct, and there is no solution (not in Python, not in any
other programming language). You have to made trade-offs. For that,
you need to analyze precisely what your requirements are.

Post by Donn
I went through this exercise recently and had no joy. It seems the string I
chose to use simply would not render - even under 'ignore' and 'replace'.

I don't understand what "would not render" means.

Post by Donn
It's really frustrating because I don't speak a non-ascii language and so
can't know if I am testing real-world strings or crazy Tolkein strings.

I guess your choices are to either give up, or learn.

Post by Donn
Another aspect of this is wxPython. My app requires the unicode build so that
strings have some hope of displaying on the widgets. If I access a font file
and fetch the family name - that can be encoded in any way, again unknown,
and I want to fetch it as 'unicode' and pass it to the widgets and not worry
about what's really going on. Given that, I thought I'd extend the 'utf8'
only concept to the app in general. I am sure I am wrong, but I feel cornered
at the moment.

Don't confuse "utf8 only" with "unicode only". Having all strings as
Unicode strings is a good thing. Assuming that all encoded text is
encoded in UTF-8 (which is but one encoding for Unicode) is likely
incorrect.

As for font files - I don't know what encoding the family is in, but
I would sure hope that the format specification of the font file format
would also specify what the encoding for the family name is, or that
there are at least established conventions.

Post by Donn

3. I made the decision to check the locale and stop the app if the return
from getlocale is (None,None).

I would avoid locale.getlocale. It's a pointless function (IMO).

Could you say why?

It tries to emulate the C library, but does so incorrectly; this
is inherently unfixable because behavior of the C library can vary
across platforms, and Python can't possibly encode the behavior
of all C libraries in existence on all platforms. In particular,
it has a hard-coded list of what charsets are in use in what locale,
and that list necessarily must be incomplete and may be incorrect.

As a consequence, it will return None if it doesn't know better.
If all you want is the charset of the locale, use
locale.getpreferredencoding().

Post by Donn
gettext.install( domain, localedir, unicode = True )
lang = gettext.translation(domain, localedir, languages = [loc] )

You could just leave out the languages parameter, and trust gettext
to find some message catalog.

Post by Donn
So, I am using getlocale to get a tuple/list (easy, no?) to pass to the
gettext.install function.

Sure - but the parameter is optional.

Post by Donn

Your program definitely, absolutely must work in the C locale. Of
course, you cannot have any non-ASCII characters in that locale, so
deal with it.

This would mean cutting-out a percentage of the external font files that can
be used by the app.

See above. There are other ways to trade-off. Alternatively, you could
require that the program finds a richer locale, and bail out if the
locale is just "C".

Post by Donn
Is there no modern standard regarding the LANG variable
and locales these days? My locale -a reports a bunch of xx_XX.utf8 locales.
Does it even make sense to use a non-utf8 locale anymore?

It's not your choice, but the user's. People still use non-UTF-8 locales
heavily, and likely will continue to do so for at least 10 more years.

Post by Donn
Oh boy, this gives me cold chills. I don't have the resources to start
worrying about every single language's edge-cases. This is kind of why I was
leaning towards a "use a utf8 locale please" approach.

Neil Hodgson

2008-01-14 23:58:05 UTC

Permalink

That's not true. Try open("\xff","w"), then try interpreting the file
name as UTF-8. Some byte strings are not meaningful UTF-8, hence that
approach cannot work.

Has it been decided how Python 3.0 will implement os.listdir on
Unix? Will there be only a single attempt to encode using the current
locale or will there be a backup technique? I'd probably define an
optional encoding parameter so you can ask for
os.listdir(encoding="iso-8859-1") although that then propagates into
open, ...

Neil

Donn

2008-01-13 18:09:22 UTC

Permalink

Martin,
I want to thank you for your patience, you have been sterling. I have an
overview this evening that I did not have this morning. I have started fixing
my code and the repairs may not be that extreme after all.

I'll hack-on and get it done. I *might* bug you again, but I'll resist at all
costs :)

Much appreciated.
\d

--
"A computer without Windows is like chocolate cake without mustard."
-- Anonymous Coward /.

Fonty Python and other dev news at:
http://otherwiseingle.blogspot.com/

Donn

2008-01-13 18:48:08 UTC

Permalink

Well, that didn't take me long... Can you help with this situation?
I have a file named "M?gul.pog" in this directory:
/home/donn/.fontypython/

I set my LANG=C

Now, I want to open that file from Python, and I create a path with
os.path.join() and an os.listdir() which results in this byte string:
paf = ['/home/donn/.fontypython/M\xc3\x96gul.pog']

I *think* that the situation is impossible because the system cannot resolve
the correct filename (due the locale being ANSI and the filename being other)
but I am not 100% sure.

So, I have been trying combinations of open:
1. f = codecs.open( paf, "r", "utf8" )
I had hopes for this one.
2. f = codecs.open( paf, "r", locale.getpreferredencoding())
3. f = open( paf, "r")

But none will open it - all get a UnicodeDecodeError. This aligns with my
suspicions, but I wanted to bounce it off you to be sure.

It does not really mesh with our previous words about opening all files as
bytestrings, and admits failure to open this file.

Also, this codecs.open(filename, "r", <encoding>) function:
1. Does it imply that the filename will be opened (with the name as it's
type : i.e. bytestring or unicode ) and written *into* as <encoding>
2. Imply that filename will be encoded via <encoding> and written into as
<encoding>
It's fuzzy, how is the filename handled?

\d

--
He has Van Gogh's ear for music. -- Billy Wilder

Fonty Python and other dev news at:
http://otherwiseingle.blogspot.com/

Donn

2008-01-13 13:50:49 UTC

Permalink

If you can all ls them, and if the file names come out right, then
they'll have the same encoding.

Could it not be that the app doing the output (say konsole) could be
displaying a filename as best as it can (doing the ignore/replace) trick and
using whatever fonts it can reach) and this would disguise the situation?
I don't think one can call any string a plain ascii string anymore.

I have been looking for somewhere online that I can download files obviously
in a non-ascii set (like japan someplace) but can't find anything easy. I
want to see exactly how my system (Kubuntu 7.10) handles things.

I never heard before that font files use non-ASCII file names,

They are files, named as any other file - those that are created by people get
called whatever they want, under whatever locale they used.
Still, I don't fully understand how this is all handled.

don't see the point in doing so - isn't there typically a font name
*inside* the font file as well, so that you'd rather use that for
display than the file name?

Yes, but sometimes I can't reach that - segfaults and so forth. I also need to
write the filename to a text file for logging.

Of course, *other* files (text files, images etc) will often use
non-ASCII file names.

Same as font files - I am talking mainly about TTF files here. Mainly Arrr,
pass the rum, matey fonts ;) (Which I don't use in designs, but have kept
over the years.)

However, they won't normally have mixed
encodings - most user-created files on a single system should typically
have the same encoding (there are exceptions possible, of course).

Well, if I am collecting fonts from all over the place then I get a mixed-bag.

Meaning, I am led to assume, the LANG variable primarily?

Yes.

Thanks. Good to know I'm on the right track.

\d

"Martin v. Löwis"

2008-01-13 17:02:00 UTC

Permalink

Post by Donn
Could it not be that the app doing the output (say konsole) could be
displaying a filename as best as it can (doing the ignore/replace) trick and
using whatever fonts it can reach) and this would disguise the situation?

Post by Donn
I have been looking for somewhere online that I can download files obviously
in a non-ascii set (like japan someplace) but can't find anything easy. I
want to see exactly how my system (Kubuntu 7.10) handles things.

So what does sys.getfilesystemencoding() say what encoding is used for
filenames?

Regards,
Martin

"Martin v. Löwis"

2008-01-13 17:51:06 UTC

Permalink

Post by Donn
What happens if there is a filename that cannot be represented in it's
entirety? i.e. every character is 'replaced'. Does it simply vanish, or does
it appear as "?????????" ? :)

The latter. I did open(u"\u20ac\u20ac","w") in an UTF-8 locale, then did
"LANG=C ls", and it gave me ?????? (as the two characters use 6 bytes)

Post by Donn
I spent an hour trying to find a single file on the web that did *not* have
(what seemed like) ascii characters in it and failed. Even urls on Japanese
websites use western characters ( a tcp/ip issue I suspect).

Actually, an HTTP and URL issue. Non-ASCII URLs aren't really supported
in the web.

Post by Donn
I was hoping to
find a filename in Kanji (?) ending in .jpg or something so that I could
download it and see what my system (and Python) made of it.

Use a text editor instead to create such a file. For example, create
a new document, and save it as "????.txt" (which Google says means
"casestudies.txt")

Regards,
Martin

"Martin v. Löwis"

2008-01-12 23:08:42 UTC

Permalink

2. If this returns "C" or anything without 'utf8' in it, then things start
2a. The app assumes unicode objects internally. i.e. Whenever there is
a "string like this" in a var it's supposed to be unicode. Whenever
something comes into the app (from a filename, a file's contents, the
command-line) it's assumed to be a byte-string that I decode("utf8") on
before placing it into my objects etc.

2b. Because of 2a and if the locale is not 'utf8 aware' (i.e. "C") I start
getting all the old 'ascii' unicode decode errors. This happens at every
string operation, at every print command and is almost impossible to fix.

3. I made the decision to check the locale and stop the app if the return
from getlocale is (None,None).

I would avoid locale.getlocale. It's a pointless function (IMO).

Also, what's the purpose of this test?

Does anyone have some ideas? Is there a universal "proper" locale that we
could set a system to *before* the Debian build stuff starts? What would
that be - en_US.utf8?

Your program definitely, absolutely must work in the C locale. Of
course, you cannot have any non-ASCII characters in that locale, so
deal with it.

If you have solved that, chances are high that it will work in other
locales as well (but be sure to try Turkish, as that gives a
surprising meaning to "I".lower()).

Regards,
Martin

Donn Ingle

2008-01-12 08:25:02 UTC

Permalink

Hello,
I hope someone can illuminate this situation for me.

Here's the nutshell:

1. On start I call locale.setlocale(locale.LC_ALL,''), the getlocale.

2. If this returns "C" or anything without 'utf8' in it, then things start
to go downhill:
2a. The app assumes unicode objects internally. i.e. Whenever there is
a "string like this" in a var it's supposed to be unicode. Whenever
something comes into the app (from a filename, a file's contents, the
command-line) it's assumed to be a byte-string that I decode("utf8") on
before placing it into my objects etc.
2b. Because of 2a and if the locale is not 'utf8 aware' (i.e. "C") I start
getting all the old 'ascii' unicode decode errors. This happens at every
string operation, at every print command and is almost impossible to fix.

3. I made the decision to check the locale and stop the app if the return
from getlocale is (None,None).

4. My setup.py (distutils) also tests locale (because it then loads gettext
to give localized information to the user during setup).

5. Because it's doing a raise SystemExit if the locale is (None,None) which
happens if LANG is set to "C", the setup.py stops.

6. Someone is helping me to package the app for Debian/Ubuntu. During the
bizarre amount of Voodoo they invoke to do that, the setup.py is being run
and it is breaking out because their LANG is set to "C"

7. I have determined, as best I can, that Python relies on LANG being set to
a proper string like en_ZA.utf8 (xx_YY.encoding) and anything else will
start Python with the default encoding of 'ascii' thus throwing the entire
app into a medieval dustbin as far as i18n goes.

8. Since I can't control the LANG of the user's system, and I am relying on
it containing 'utf8' in the locale results.. well I seem to be in a
catch-22 here.

Does anyone have some ideas? Is there a universal "proper" locale that we
could set a system to *before* the Debian build stuff starts? What would
that be - en_US.utf8?

Any words of wisdom would help.
\d