Lazarus (UTF8) and Windows: SysToUTF8, UTF8ToSys... Is there a better solution?

Discussion:

Lazarus (UTF8) and Windows: SysToUTF8, UTF8ToSys... Is there a better solution?

Marcos Douglas

2013-12-15 03:02:17 UTC

Hi,

I'm using Lazarus (trunk) and FPC 2.6.2 to programming on Windows (XP, 7 and 8).
As everyone knows, Lazarus is UTF8 but FPC 2.6.2 is ANSI.

I would like to know how do you working (on Windows) when uses Lazarus.

How I work:
1. I use [string] to represent any type string. But some libs (DLL,
ActiveX, etc) uses WideString;
2. If I have to create a file, I use UTF8ToSys(FileName)... and if I
have a TStringList I use SS.Text := UTF8ToSys(Text)... and at the end
SS.SaveToFile(UTF8ToSys(FileName));

I don't know about you think, but this is VERY frustating. If you
forget only one call to UTF8ToSys... errors can happen, and the same
to SysToUTF8.

So, I searched for a good solution but the Wiki stay the same:
http://wiki.freepascal.org/LCL_Unicode_Support#Dealing_with_directory_and_filenames

...use SysToUTF8, UTF8ToSys.

Then I searched in Lazarus' sources and -- as you can see in image
attachment -- Lazarus uses the same "technique".

I know the FPC team are working in a version of FPC Unicode but I
don't know when it will be available -- and if the problem will
persist.

I have many systems coded in FPC+Lazarus only to run on Windows so I ask you:
Is there some trick to make the FPC+Lazarus to use only ANSI?

Marcos Douglas

Hans-Peter Diettrich

2013-12-15 04:01:35 UTC

Marcos Douglas schrieb:

> How I work:
> 1. I use [string] to represent any type string. But some libs (DLL,
> ActiveX, etc) uses WideString;

That's Windows specific, not portable.

> 2. If I have to create a file, I use UTF8ToSys(FileName)...

Okay for filenames, even if IMO it should not be necessary.

> and if I
> have a TStringList I use SS.Text := UTF8ToSys(Text)... and at the end
> SS.SaveToFile(UTF8ToSys(FileName));

Why that?

> I have many systems coded in FPC+Lazarus only to run on Windows so I ask you:
> Is there some trick to make the FPC+Lazarus to use only ANSI?

Why that? Lazarus is using UTF-8 throughout, so that writing and reading
files will work the same on all targets.

DoDi

--

Marco van de Voort

2013-12-15 10:56:08 UTC

On Sun, Dec 15, 2013 at 05:01:35AM +0100, Hans-Peter Diettrich wrote:
> > I have many systems coded in FPC+Lazarus only to run on Windows so I ask you:
> > Is there some trick to make the FPC+Lazarus to use only ANSI?
>
> Why that? Lazarus is using UTF-8 throughout, so that writing and reading
> files will work the same on all targets.

The Tstringlist.save* routines that Marcos mentions are FPC.

--

Bart

2013-12-15 15:08:35 UTC

We have TStringListUtf8 and TFileStreamUTF8 in LazUtf8Classes unit.

Furthermore all file routines in FileUtil and LazFileUtils do all this
conversion automatically for you. They excpest UTF8 strings as their
parameters (all LCL strings are UTF8) and (on NT based platforms) use
WideString API to implement this.

Mutatis mutandis for ParamStrUtf8.

So for basic stuff all this is already taken care of.

As for DLL, ActiveX, well that is platform specific, and you will have
to convert.

Bart

--

Marcos Douglas

2013-12-15 15:25:18 UTC

On Sun, Dec 15, 2013 at 1:08 PM, Bart <***@gmail.com> wrote:
> We have TStringListUtf8 and TFileStreamUTF8 in LazUtf8Classes unit.
>
> Furthermore all file routines in FileUtil and LazFileUtils do all this
> conversion automatically for you. They excpest UTF8 strings as their
> parameters (all LCL strings are UTF8) and (on NT based platforms) use
> WideString API to implement this.
>
> Mutatis mutandis for ParamStrUtf8.
>
> So for basic stuff all this is already taken care of.

Only in Lazarus' context... but I have some components that is only
FPC's context.
These components do not use Lazarus' routines and that is the BIG
problem. I need to "remember" in pass only ANSI strings for these
components as remember to convert the component's output string
results to use in Lazarus.

> As for DLL, ActiveX, well that is platform specific, and you will have
> to convert.

Forget that, this is not a problem. I'm using many DLL and ActiveX and
I know this is not portable... that's Ok. Do not need any conversion.

Thanks,
Marcos Douglas

--

Reinier Olislagers

2013-12-15 17:13:32 UTC

On 15/12/2013 16:25, Marcos Douglas wrote:
> On Sun, Dec 15, 2013 at 1:08 PM, Bart <***@gmail.com> wrote:
>> So for basic stuff all this is already taken care of.
>
> Only in Lazarus' context... but I have some components that is only
> FPC's context.
> These components do not use Lazarus' routines and that is the BIG
> problem. I need to "remember" in pass only ANSI strings for these
> components as remember to convert the component's output string
> results to use in Lazarus.

Why not just include a project reference to LCLBase (IIRC that should be
enough) and just always use the LCL units until FPC catches up?

--

Marcos Douglas

2013-12-15 19:47:09 UTC

On Sun, Dec 15, 2013 at 3:13 PM, Reinier Olislagers
<***@gmail.com> wrote:
> On 15/12/2013 16:25, Marcos Douglas wrote:
>> On Sun, Dec 15, 2013 at 1:08 PM, Bart <***@gmail.com> wrote:
>>> So for basic stuff all this is already taken care of.
>>
>> Only in Lazarus' context... but I have some components that is only
>> FPC's context.
>> These components do not use Lazarus' routines and that is the BIG
>> problem. I need to "remember" in pass only ANSI strings for these
>> components as remember to convert the component's output string
>> results to use in Lazarus.
>
> Why not just include a project reference to LCLBase (IIRC that should be
> enough) and just always use the LCL units until FPC catches up?

You propose include LCL in packages that not have LCL references and
change all code of these packages to use UTF8 functions??

I think didn't understand what you proposed...

Best regards,
Marcos Douglas

--

Reinier Olislagers

2013-12-16 06:30:29 UTC

On 15/12/2013 20:47, Marcos Douglas wrote:
> On Sun, Dec 15, 2013 at 3:13 PM, Reinier Olislagers
> <***@gmail.com> wrote:
>> On 15/12/2013 16:25, Marcos Douglas wrote:
>>> On Sun, Dec 15, 2013 at 1:08 PM, Bart <***@gmail.com> wrote:
>>> problem. I need to "remember" in pass only ANSI strings for these
>>> components as remember to convert the component's output string
>>> results to use in Lazarus.
>>
>> Why not just include a project reference to LCLBase (IIRC that should be
>> enough) and just always use the LCL units until FPC catches up?
>
> You propose include LCL in packages that not have LCL references and
> change all code of these packages to use UTF8 functions??
>
> I think didn't understand what you proposed...
I think I don't understand what you're after? You said yourself you
don't want to pass ANSI strings!?!?

My suggestion is to replace calls to the FPC (ANSI) RTL with LCL UTF8
equivalents in your code where possible.

Apart from that there's not much else you can do except contribute
patches to help "unicode-ise" the FPC RTL...

--

Marcos Douglas

2013-12-16 11:15:22 UTC

On Mon, Dec 16, 2013 at 3:30 AM, Reinier Olislagers
<***@gmail.com> wrote:
> On 15/12/2013 20:47, Marcos Douglas wrote:
>> On Sun, Dec 15, 2013 at 3:13 PM, Reinier Olislagers
>> <***@gmail.com> wrote:
>>> On 15/12/2013 16:25, Marcos Douglas wrote:
>>>> On Sun, Dec 15, 2013 at 1:08 PM, Bart <***@gmail.com> wrote:
>>>> problem. I need to "remember" in pass only ANSI strings for these
>>>> components as remember to convert the component's output string
>>>> results to use in Lazarus.
>>>
>>> Why not just include a project reference to LCLBase (IIRC that should be
>>> enough) and just always use the LCL units until FPC catches up?
>>
>> You propose include LCL in packages that not have LCL references and
>> change all code of these packages to use UTF8 functions??
>>
>> I think didn't understand what you proposed...
> I think I don't understand what you're after? You said yourself you
> don't want to pass ANSI strings!?!?

I said "a trick to use only ANSI".

> My suggestion is to replace calls to the FPC (ANSI) RTL with LCL UTF8
> equivalents in your code where possible.

But FPC still ANSI and all read/write in FPC is ANSI. We can not
override these internal FPC's routines... and, some packages are used
by others non-GUI program and that not have dependencies to LCL.

> Apart from that there's not much else you can do except contribute
> patches to help "unicode-ise" the FPC RTL...

Ok... But first I would like to know who have these same problems,
like me, working on Windows and what these people do to bypass these
problems.

Regards,
Marcos Douglas

--

Reinier Olislagers

2013-12-16 11:18:49 UTC

On 16/12/2013 12:15, Marcos Douglas wrote:
> On Mon, Dec 16, 2013 at 3:30 AM, Reinier Olislagers
> <***@gmail.com> wrote:
>> On 15/12/2013 20:47, Marcos Douglas wrote:
>>> On Sun, Dec 15, 2013 at 3:13 PM, Reinier Olislagers
>>> <***@gmail.com> wrote:
>>>> On 15/12/2013 16:25, Marcos Douglas wrote:
>>>>> On Sun, Dec 15, 2013 at 1:08 PM, Bart <***@gmail.com> wrote:
>> My suggestion is to replace calls to the FPC (ANSI) RTL with LCL UTF8
>> equivalents in your code where possible.
>
> But FPC still ANSI and all read/write in FPC is ANSI. We can not
> override these internal FPC's routines... and, some packages are used
> by others non-GUI program and that not have dependencies to LCL.
Sigh. I don't think you understood what I wrote.
And LCLBase doesn't pull in GUI stuff, IIUC...

>> Apart from that there's not much else you can do except contribute
>> patches to help "unicode-ise" the FPC RTL...
>
> Ok... But first I would like to know who have these same problems,
> like me, working on Windows and what these people do to bypass these
> problems.
As is customary whenever somebody mentions the magic "U" word, I'm sure
you'll get a lot of responses....

--

Marcos Douglas

2013-12-16 11:51:13 UTC

On Mon, Dec 16, 2013 at 8:18 AM, Reinier Olislagers
<***@gmail.com> wrote:
> On 16/12/2013 12:15, Marcos Douglas wrote:
>> On Mon, Dec 16, 2013 at 3:30 AM, Reinier Olislagers
>> <***@gmail.com> wrote:
>>> On 15/12/2013 20:47, Marcos Douglas wrote:
>>>> On Sun, Dec 15, 2013 at 3:13 PM, Reinier Olislagers
>>>> <***@gmail.com> wrote:
>>>>> On 15/12/2013 16:25, Marcos Douglas wrote:
>>>>>> On Sun, Dec 15, 2013 at 1:08 PM, Bart <***@gmail.com> wrote:
>>> My suggestion is to replace calls to the FPC (ANSI) RTL with LCL UTF8
>>> equivalents in your code where possible.
>>
>> But FPC still ANSI and all read/write in FPC is ANSI. We can not
>> override these internal FPC's routines... and, some packages are used
>> by others non-GUI program and that not have dependencies to LCL.
>
> Sigh. I don't think you understood what I wrote.

So explain better, please.

> And LCLBase doesn't pull in GUI stuff, IIUC...
>

Sigh. I know.
I think you did not understand is:
Some packages do not work (internally) using UTF8 strings. I can
change all code to use UTF8 in all package and all projects... but
this would be a massive work.

>>> Apart from that there's not much else you can do except contribute
>>> patches to help "unicode-ise" the FPC RTL...
>>
>> Ok... But first I would like to know who have these same problems,
>> like me, working on Windows and what these people do to bypass these
>> problems.
> As is customary whenever somebody mentions the magic "U" word, I'm sure
> you'll get a lot of responses....

Sorry, I didn't understand... (sarcasm?)

Thank you,
Marcos Douglas

--

Reinier Olislagers

2013-12-16 12:10:13 UTC

On 16/12/2013 12:51, Marcos Douglas wrote:
> On Mon, Dec 16, 2013 at 8:18 AM, Reinier Olislagers
> <***@gmail.com> wrote:
>> On 16/12/2013 12:15, Marcos Douglas wrote:
>>> On Mon, Dec 16, 2013 at 3:30 AM, Reinier Olislagers
>>> <***@gmail.com> wrote:
>>>> On 15/12/2013 20:47, Marcos Douglas wrote:
>>>>> On Sun, Dec 15, 2013 at 3:13 PM, Reinier Olislagers
>>>>> <***@gmail.com> wrote:
>>>>>> On 15/12/2013 16:25, Marcos Douglas wrote:
>>>>>>> On Sun, Dec 15, 2013 at 1:08 PM, Bart <***@gmail.com> wrote:
>>>> My suggestion is to replace calls to the FPC (ANSI) RTL with LCL UTF8
>>>> equivalents in your code where possible.
>>>
>>> But FPC still ANSI and all read/write in FPC is ANSI. We can not
>>> override these internal FPC's routines... and, some packages are used
>>> by others non-GUI program and that not have dependencies to LCL.
>>
>> Sigh. I don't think you understood what I wrote.
>
> So explain better, please.
I tried 2 times already, sorry. It's not important anyway.

> Sigh. I know.
> I think you did not understand is:
> Some packages do not work (internally) using UTF8 strings. I can
> change all code to use UTF8 in all package and all projects... but
> this would be a massive work.
And my point is (unfortunately also... sigh...) that you cannot expect
that to change magically.
Conversion is going to be required unless and until... well I'll stop
here, see the many U* threads on this list, the FPC one and fpc-devel.

>> As is customary whenever somebody mentions the magic "U" word, I'm sure
>> you'll get a lot of responses....
>
> Sorry, I didn't understand... (sarcasm?)
Yes, but not directed at you - at the reactions of this list whenever
that word with U is mentioned ;)

--

Marcos Douglas

2013-12-16 15:32:58 UTC

On Mon, Dec 16, 2013 at 9:10 AM, Reinier Olislagers
<***@gmail.com> wrote:
> On 16/12/2013 12:51, Marcos Douglas wrote:
>> On Mon, Dec 16, 2013 at 8:18 AM, Reinier Olislagers
>> <***@gmail.com> wrote:
>>> On 16/12/2013 12:15, Marcos Douglas wrote:
>>>> On Mon, Dec 16, 2013 at 3:30 AM, Reinier Olislagers
>>>> <***@gmail.com> wrote:
>>>>> On 15/12/2013 20:47, Marcos Douglas wrote:
>>>>>> On Sun, Dec 15, 2013 at 3:13 PM, Reinier Olislagers
>>>>>> <***@gmail.com> wrote:
>>>>>>> On 15/12/2013 16:25, Marcos Douglas wrote:
>>>>>>>> On Sun, Dec 15, 2013 at 1:08 PM, Bart <***@gmail.com> wrote:
>>>>> My suggestion is to replace calls to the FPC (ANSI) RTL with LCL UTF8
>>>>> equivalents in your code where possible.
>>>>
>>>> But FPC still ANSI and all read/write in FPC is ANSI. We can not
>>>> override these internal FPC's routines... and, some packages are used
>>>> by others non-GUI program and that not have dependencies to LCL.
>>>
>>> Sigh. I don't think you understood what I wrote.
>>
>> So explain better, please.
> I tried 2 times already, sorry. It's not important anyway.

Ok, no problem.

>> Sigh. I know.
>> I think you did not understand is:
>> Some packages do not work (internally) using UTF8 strings. I can
>> change all code to use UTF8 in all package and all projects... but
>> this would be a massive work.
> And my point is (unfortunately also... sigh...) that you cannot expect
> that to change magically.
> Conversion is going to be required unless and until... well I'll stop
> here, see the many U* threads on this list, the FPC one and fpc-devel.

Ok, thank you.

>>> As is customary whenever somebody mentions the magic "U" word, I'm sure
>>> you'll get a lot of responses....
>>
>> Sorry, I didn't understand... (sarcasm?)
> Yes, but not directed at you - at the reactions of this list whenever
> that word with U is mentioned ;)

Hm... Ok... thanks again.

Marcos Douglas

--

Hans-Peter Diettrich

2013-12-18 11:03:56 UTC

Reinier Olislagers schrieb:

> Apart from that there's not much else you can do except contribute
> patches to help "unicode-ise" the FPC RTL...

The new AnsiStrings (with Encoding and automatic conversion) should be
sufficient, Unicode is not required. In fact a move to a Unicode RTL
would require that either Lazarus is converted, too, or that 2 RTL
flavors (Ansi and Unicode) must be supported. Not a good idea, IMO.

DoDi

--

Marco van de Voort

2013-12-22 18:58:14 UTC

On Wed, Dec 18, 2013 at 12:03:56PM +0100, Hans-Peter Diettrich wrote:
> > Apart from that there's not much else you can do except contribute
> > patches to help "unicode-ise" the FPC RTL...
>
> The new AnsiStrings (with Encoding and automatic conversion) should be
> sufficient, Unicode is not required. In fact a move to a Unicode RTL
> would require that either Lazarus is converted, too, or that 2 RTL
> flavors (Ansi and Unicode) must be supported. Not a good idea, IMO.

Keeping UTF8 on Windows makes a majority platform seem only half supported.
Not good either. Worse, it is Delphi incompatible.

--

Hans-Peter Diettrich

2013-12-22 22:52:04 UTC

Marco van de Voort schrieb:

> Keeping UTF8 on Windows makes a majority platform seem only half supported.
> Not good either. Worse, it is Delphi incompatible.

You favor a special FPC and Lazarus for Windows, in addition to the
UTF-8 version for all other platforms?

DoDi

--

Marco van de Voort

2013-12-23 10:32:17 UTC

On Sun, Dec 22, 2013 at 11:52:04PM +0100, Hans-Peter Diettrich wrote:
> > Keeping UTF8 on Windows makes a majority platform seem only half supported.
> > Not good either. Worse, it is Delphi incompatible.
>
> You favor a special FPC and Lazarus for Windows, in addition to the
> UTF-8 version for all other platforms?

IMHO the utf8 is not a done deal, and Delphi compatibility requires at least
also UTF16 on other platforms.

QT is utf16, and so is Cocoa. Only GTK is utf8

So I would say UTF16, and maybe, if there is demand, some can get utf8 :-)

--

Jürgen Hestermann

2013-12-23 17:52:21 UTC

Am 2013-12-23 11:32, schrieb Marco van de Voort:
> So I would say UTF16, and maybe, if there is demand, some can get utf8 :-)

The question is:
Should FPC and LCL use a fixed encoding for all platforms
or should the encoding be adapted for each WidgetSet/OS?

If it should be the same for all platforms then it should be UTF8 IMO.
UTF16 is the most horrible decision (all bad things combined).
UTF32 would at least have the advantage of fixed character size
but pays this with *a lot* of memory consumption.
UTF8 has the lowest memory demand (in general) and a good
backward compatibility.

On the other hand, adapting the string encoding for each
Widgetset/OS would be a can of worms IMO.
A lot of additional knowledge about strings is put on the programmer
because handling of strings has to be done differently depending on OS.
That would be a hazadous decision and would only be of use if programs
are exclusively written for one OS only.
But FPC/Lazarus is meant to be portable so this should not be done.

--

Marco van de Voort

2013-12-23 22:08:01 UTC

On Mon, Dec 23, 2013 at 06:52:21PM +0100, J?rgen Hestermann wrote:
> Am 2013-12-23 11:32, schrieb Marco van de Voort:
> > So I would say UTF16, and maybe, if there is demand, some can get utf8 :-)
>
> The question is:
> Should FPC and LCL use a fixed encoding for all platforms
> or should the encoding be adapted for each WidgetSet/OS?

Not necessarily. Supporting both on both platforms is a sane reason too.

One can't ditch utf16 because of Delphi compatibility. It will be hard to
ditch utf8 because of old Lazarus compatibility.

But if I have to chose to kill one, it is utf8. It is the lesser used choice
for unicode strings INSIDE APPLICATIONS. Yes, UTF8 is dominant in documents, but
not in APIs.

> If it should be the same for all platforms then it should be UTF8 IMO.
> UTF16 is the most horrible decision (all bad things combined).

For what? Most of the sentiments I hear are echoed discussions on the web
that are mostly about document encodings, NOT application internal
encodings.

However we

> UTF32 would at least have the advantage of fixed character size
> but pays this with *a lot* of memory consumption.

(it is not fixed character, but fixed codepoint)

> UTF8 has the lowest memory demand

Not according to 1 billion Chinese.

> (in general) and a good backward compatibility.

Hardly. Only for western languages, and even there conversions often go
wrong. That's why the whole BOM kludge became so important.

> On the other hand, adapting the string encoding for each
> Widgetset/OS would be a can of worms IMO.

If you feel that way, I think Delphi compatibility should prevail. Old
Lazarus code needs to be modified anyway.

Note that the language support for utf8 breaks down when you pass e.g. a
"string" to rawbytestring on Windows. (because it is converted to the
default 1-byte encoding, which is not utf8 in general).

As said, UTF8 on Windows is a crutch, and attempts to workaround that moves
Lazarus in the direction of "portability to everything as long as it is
unix" philosophies, a la Cygwin.

IMHO a bad direction. FPC has in general avoided having an outright
preference and IMHO should continue to do so.

> A lot of additional knowledge about strings is put on the programmer
> because handling of strings has to be done differently depending on OS.

It will anyway, even with utf8. Constructs that happen to work with Linux
will fail on Windows. Because on Windows the default 1-byte encoding is not
UTF8.

Moreover, I think people step over the Delphi compatibility card too easy.
Way, way ,way to easy.

> But FPC/Lazarus is meant to be portable so this should not be done.

FPC/Lazarus is supposed to be portable, not an emulated Unix on everything.
Using other systems default encoding is emulation, and not portability.

--

Hans-Peter Diettrich

2013-12-24 05:18:41 UTC

Marco van de Voort schrieb:
> On Mon, Dec 23, 2013 at 06:52:21PM +0100, J?rgen Hestermann wrote:
>> Am 2013-12-23 11:32, schrieb Marco van de Voort:
>> > So I would say UTF16, and maybe, if there is demand, some can get utf8 :-)
>>
>> The question is:
>> Should FPC and LCL use a fixed encoding for all platforms
>> or should the encoding be adapted for each WidgetSet/OS?
>
> Not necessarily. Supporting both on both platforms is a sane reason too.
>
> One can't ditch utf16 because of Delphi compatibility. It will be hard to
> ditch utf8 because of old Lazarus compatibility.

In the meantime we have 2 Delphi compiler/RTL versions:
- Ansi (Win32)
- Unicode (UTF-16, multi target)
and 4 GUI versions
- VCL Win32
- CLX
- VCL.NET
- FireMonkey
summing up to 8 versions in theory, and 3 versions in practice.

So what does "Delphi compatible" mean *really*?

The FPC compiler supports multiple targets, and most probably can be
managed to support both string types using the *same code base*
(maintenance issue!). IMO this does *not* apply to the libraries (RTL
and LCL) and existing applications, where Lazarus counts as the most
important and prominent application. We can be happy to have one single
LCL and IDE version, which is already incompatible due to the use of
UTF-8 strings instead of Ansi. Multiple versions, for compatibility with
the other Delphi combinations, are beyond *development capacities*.

This sheds a very different light on Delphi compatibility, meaning that
a Unicode LCL and IDE can *not* be supported in parallel to the existing
UTF-8 implementation. Dumping UTF-8 would discontinue support for the
*entire* range of *existing* LCL applications, i.e. loose all the
current Lazarus users :-(

So what should be the intended *audience* for a future Lazarus version?

IMO the biggest group are old fashioned Delphi (D7) users, which want
their existing Ansi/VCL code base supported *without* complications and
incompatibilities introduced by the newer Delphi versions. The subject
of this thread clearly indicates that UTF-8 is *not* a solution for this
group of users.

Another important user group is targeting mobiles, where time will tell
whether FM will ever succeed, or shares the fate of Kylix or VCL.NET.
IMO these should be happy already with fpGUI or mseGUI, no need to raise
another competitor in this area.

> But if I have to chose to kill one, it is utf8. It is the lesser used choice
> for unicode strings INSIDE APPLICATIONS. Yes, UTF8 is dominant in documents, but
> not in APIs.

That's my conclusion as well. But is that new audience worth to abandon
the entire existing Lazarus audience?

DoDi

--

Marcos Douglas

2013-12-24 14:22:41 UTC

On Tue, Dec 24, 2013 at 3:18 AM, Hans-Peter Diettrich
<***@aol.com> wrote:
> Marco van de Voort schrieb:
>>
>> On Mon, Dec 23, 2013 at 06:52:21PM +0100, J?rgen Hestermann wrote:
>>
>>> Am 2013-12-23 11:32, schrieb Marco van de Voort:
>>> > So I would say UTF16, and maybe, if there is demand, some can get utf8
>>> :-)
>>>
>>> The question is:
>>> Should FPC and LCL use a fixed encoding for all platforms
>>> or should the encoding be adapted for each WidgetSet/OS?
>>
>>
>> Not necessarily. Supporting both on both platforms is a sane reason too.
>>
>> One can't ditch utf16 because of Delphi compatibility. It will be hard to
>> ditch utf8 because of old Lazarus compatibility.
>
>
> In the meantime we have 2 Delphi compiler/RTL versions:
> - Ansi (Win32)
> - Unicode (UTF-16, multi target)
> and 4 GUI versions
> - VCL Win32
> - CLX
> - VCL.NET
> - FireMonkey
> summing up to 8 versions in theory, and 3 versions in practice.
>
> So what does "Delphi compatible" mean *really*?
>
> The FPC compiler supports multiple targets, and most probably can be managed
> to support both string types using the *same code base* (maintenance
> issue!). IMO this does *not* apply to the libraries (RTL and LCL) and
> existing applications, where Lazarus counts as the most important and
> prominent application. We can be happy to have one single LCL and IDE
> version, which is already incompatible due to the use of UTF-8 strings
> instead of Ansi. Multiple versions, for compatibility with the other Delphi
> combinations, are beyond *development capacities*.
>
> This sheds a very different light on Delphi compatibility, meaning that a
> Unicode LCL and IDE can *not* be supported in parallel to the existing UTF-8
> implementation. Dumping UTF-8 would discontinue support for the *entire*
> range of *existing* LCL applications, i.e. loose all the current Lazarus
> users :-(
>
> So what should be the intended *audience* for a future Lazarus version?
>
> IMO the biggest group are old fashioned Delphi (D7) users, which want their
> existing Ansi/VCL code base supported *without* complications and
> incompatibilities introduced by the newer Delphi versions. The subject of
> this thread clearly indicates that UTF-8 is *not* a solution for this group
> of users.

I started this thread. My problem isn't to use UTF-8 on Windows... my
problem is use different encodings on the same code, ie, RTL <> LCL.

Use functions, always, to convert string between RTL and LCL and
vice-versa IHMO is wrong because the final code is confusing. In a
huge application you still need to think "here is UTF-8 or
ANSI/UTF-16?"

> Another important user group is targeting mobiles, where time will tell
> whether FM will ever succeed, or shares the fate of Kylix or VCL.NET. IMO
> these should be happy already with fpGUI or mseGUI, no need to raise another
> competitor in this area.
>
>
>
>> But if I have to chose to kill one, it is utf8. It is the lesser used
>> choice
>> for unicode strings INSIDE APPLICATIONS. Yes, UTF8 is dominant in
>> documents, but
>> not in APIs.
>
>
> That's my conclusion as well. But is that new audience worth to abandon the
> entire existing Lazarus audience?

Of course nobody will abandon the entire existing Lazarus audience. If
the RTL will be UTF-16, UTF-32, whatever the Lazarus will continues --
I think -- working using UTF-8.

Marcos Douglas

--

Jürgen Hestermann

2013-12-24 17:13:01 UTC

Am 24.12.2013 15:22, schrieb Marcos Douglas:
> Use functions, always, to convert string between RTL and LCL and
> vice-versa IHMO is wrong because the final code is confusing. In a
> huge application you still need to think "here is UTF-8 or
> ANSI/UTF-16?"

That's true.
It's a pain to pay attention to this.
All units used should use the same string encoding IMO.
But which?
I think that's the discussion in this thread.

> If the RTL will be UTF-16, UTF-32, whatever the Lazarus will continues --
> I think -- working using UTF-8.

But that would be a real pain.
In a program it should be possible to use strings
without the need to convert back and forth between encodings.
So all strings from/to FPC and LCL routines should have the same encoding.

--

Marcos Douglas

2013-12-24 17:27:50 UTC

On Tue, Dec 24, 2013 at 3:13 PM, Jürgen Hestermann
<***@gmx.de> wrote:
> Am 24.12.2013 15:22, schrieb Marcos Douglas:
>
>> Use functions, always, to convert string between RTL and LCL and
>> vice-versa IHMO is wrong because the final code is confusing. In a
>> huge application you still need to think "here is UTF-8 or
>> ANSI/UTF-16?"
>
> That's true.
> It's a pain to pay attention to this.

Someone agreed! :-)

> All units used should use the same string encoding IMO.
> But which?
> I think that's the discussion in this thread.

Yes, this is the major problem... ;-)

>> If the RTL will be UTF-16, UTF-32, whatever the Lazarus will continues --
>> I think -- working using UTF-8.
>
> But that would be a real pain.

Would not.. IS a real pain today.

> In a program it should be possible to use strings
> without the need to convert back and forth between encodings.
> So all strings from/to FPC and LCL routines should have the same encoding.

This will depend only on the FPC team...

When I created this thread I was looking for a way to only minimize
this problem but...

Marcos Douglas

--

Graeme Geldenhuys

2013-12-25 10:12:20 UTC

On 2013-12-25 10:05, Graeme Geldenhuys wrote:
>> But which?
>
> UTF-8 of course! It's the newest Unicode encoding that overcomes all
> problems found in other encodings.

This guy explains it very well.

https://www.youtube.com/watch?v=MijmeoH9LT4

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Graeme Geldenhuys

2013-12-25 10:05:13 UTC

On 2013-12-24 17:13, Jürgen Hestermann wrote:
> All units used should use the same string encoding IMO.
> But which?

UTF-8 of course! It's the newest Unicode encoding that overcomes all
problems found in other encodings. It is also the only Unicode encoding
that is backwards compatible with ASCII - hence the W3C and the rest of
the Internet etc standardised on it. It is also future proof and can
(again) be extended to full (4 byte range) or to using 5 or 6 byte code
points [*1]. Performance wise, it is also NOT any slower than any of the
other Unicode encodings.

Probably the only reason UTF-16 is still being used is because of
Windows - which used to use UCS2, and moving to UTF-16 was easier at the
time (and I don't think UTF-8 existed at that point).

[1] A couple years back they limited the range of UTF-8 so that it stays
compatible for now with the limited range of UTF-16. But the UTF-8
encoding can actually go all the way to 6 bytes per code page, which is
an absolute massive range.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Hans-Peter Diettrich

2013-12-25 00:19:04 UTC

Marcos Douglas schrieb:
> On Tue, Dec 24, 2013 at 3:18 AM, Hans-Peter Diettrich
> <***@aol.com> wrote:

> I started this thread. My problem isn't to use UTF-8 on Windows... my
> problem is use different encodings on the same code, ie, RTL <> LCL.

This mix would cause problems, of course.

> Use functions, always, to convert string between RTL and LCL and
> vice-versa IHMO is wrong because the final code is confusing. In a
> huge application you still need to think "here is UTF-8 or
> ANSI/UTF-16?"

The simplest (feasable) solution IMO is the adaptation of (OS...) string
types behind the scene, i.e. inside the RTL and widgetsets. Then you can
have any common encoding in the application and library API, while
encoding-dependent code is encapsulated in lower level functions
receiving explicit (Unicode, UTF8String...) string types, so that the
compiler can insert required conversions. Such explicit parameter types
also were required for legacy code, where a specific encoding is
assumed. I'm not sure how this conversion process can be automated or
supported, perhaps removing/renaming the tradional UTF8... functions
would help in spotting the procedures that require special attention.

The number of automatic conversions can be reduced in the next step, by
e.g. adding overrides, or conditional code, for both string types one by
one, as time permits.

DoDi

--

Marcos Douglas

2013-12-24 14:33:41 UTC

On Tue, Dec 24, 2013 at 12:19 PM, Marco van de Voort <***@stack.nl> wrote:
> On Tue, Dec 24, 2013 at 06:18:41AM +0100, Hans-Peter Diettrich wrote:
>> >
>> > Not necessarily. Supporting both on both platforms is a sane reason too.
>> >
>> > One can't ditch utf16 because of Delphi compatibility. It will be hard to
>> > ditch utf8 because of old Lazarus compatibility.
>>
>> In the meantime we have 2 Delphi compiler/RTL versions:
>> - Ansi (Win32)
>> - Unicode (UTF-16, multi target)
>> and 4 GUI versions
>> - VCL Win32
>> - CLX
>> - VCL.NET
>> - FireMonkey
>> summing up to 8 versions in theory, and 3 versions in practice.
>
> The older delphi compilers are unsupported. We never supported anything but
> VCL 32/64, so this list seems artificially inflated to me.
>
>> So what does "Delphi compatible" mean *really*?
>
> The same as it always has. VCL, and language level at a distance. The rest
> is irrelevant.
>
>> The FPC compiler supports multiple targets, and most probably can be
>> managed to support both string types using the *same code base*
>> (maintenance issue!).
>
> Yes.
>
>> IMO this does *not* apply to the libraries (RTL
>> and LCL)
>
> RTL is less of a problem than one might think. The problem mostly only comes
> in at the classes level.
>
>>and existing applications, where Lazarus counts as the most
>> important and prominent application.
>
> Existing Lazarus applications are toast anyway, without changes.
>
>> We can be happy to have one single LCL and IDE version, which is already
>> incompatible due to the use of UTF-8 strings instead of Ansi. Multiple
>> versions, for compatibility with the other Delphi combinations, are beyond
>> *development capacities*.
>
> Then drop the old stuff, and simply go for full compatibility. Anything else
> will only cause the loss of all OSS Delphi projects (and even the commercial
> ones that support Lazarus).
>
> And people like me that are torn between both systems.
>
>> This sheds a very different light on Delphi compatibility, meaning that
>> a Unicode LCL and IDE can *not* be supported in parallel to the existing
>> UTF-8 implementation.
>
> There is no existing UTF8 implementation that can be continued as is anyway.
>
>> Dumping UTF-8 would discontinue support for the entire* range of
>> **existing* LCL applications, i.e. loose all the
>> current Lazarus users :-(
>>
>> So what should be the intended *audience* for a future Lazarus version?
>
>> IMO the biggest group are old fashioned Delphi (D7) users, which want
>> their existing Ansi/VCL code base supported *without* complications and
>> incompatibilities introduced by the newer Delphi versions. The subject
>> of this thread clearly indicates that UTF-8 is *not* a solution for this
>> group of users.
>
> It was like that two years ago. But I see more and more people migrate to
> the unicode versions, and updating packages. The D7 base is eroding, and
> worse, many of its users are mostly hedging bets to keep their codebases
> running. Not to make new code. (and we need people that DO things)
>
> It's like with turbo pascal in the (1.0.x) past. Yes, the numbers are huge,
> but all they say is they want something 100% compatible to effortless keep
> their codebases running. But when the times come to actually _invest_ in
> the code again, they pick something that is at least halfwhat modern. And
> all you are stuck with is oldtimers and l33t tinkerers.
>
> That is the curse of supporting legacy targets, you can't do that forever
> without making yourself irrelevant.
>
> Keep in mind that any Lazarus solution in production use based on 2.8.x is
> years away. The current activity levels in that group will be even less. Our
> decisions must be aimed not at the situation now, but good for at least 5
> years.
>
>> Another important user group is targeting mobiles, where time will tell
>> whether FM will ever succeed, or shares the fate of Kylix or VCL.NET.
>
> Everywhere I see FM (Mobile plugin) buyers, I see existing Delphi users
> hoping for an easy conversion to mobile and a quick buck to tide them over
> the crisis. Not real go-getters that really go for mobile.
>
> That makes me think this is not sustainable.
>
> But Embarcadero is said to use it heavily internally, so they won't quickly
> kill it off, and I assume a certain kind of customers will adapt it.
>
> But IMHO for us it is irrelevant
>
>> IMO these should be happy already with fpGUI or mseGUI, no need to raise
>> another competitor in this area.
>
> I don't really see any adaptation there. Those teams and offerings are again
> a magnitude smaller than Lazarus, and for most of those users switching from
> Embacadero to Lazarus is already the biggest step they are willing to make.
>
>> > But if I have to chose to kill one, it is utf8. It is the lesser used choice
>> > for unicode strings INSIDE APPLICATIONS. Yes, UTF8 is dominant in documents, but
>> > not in APIs.
>>
>> That's my conclusion as well. But is that new audience worth to abandon
>> the entire existing Lazarus audience?
>
> I myself hope for the two tracks way. It satisfies multiple demands, and the
> extra work is offset by less rewriting from current Delphi sources and less
> discussion.
>
> But the prime point is that IMHO an utf8 Windows is insane, and it should be
> possible to port modern Delphi VCL apps at least to Windows. Preferably to
> all.

Sorry if I say something crazy, but what do you think to use UTF-16 on
{mode delphi} and UTF-8 in {mode fpc}?

Marcos Douglas

--

Sven Barth

2013-12-24 21:08:38 UTC

Am 24.12.2013 15:34 schrieb "Marcos Douglas" <***@delfire.net>:
> Sorry if I say something crazy, but what do you think to use UTF-16 on
> {mode delphi} and UTF-8 in {mode fpc}?

That is already the case with mode delphiunicode. But the big problem are
classes and their inheritance. Take TStringList for example. Let's assume
it's declared with String=AnsiString and you override it in a unit with
String=UnicodeString then you'll get problems with overloads/overrides,
because UnicodeString <> AnsiString.

The mode concept is all good and well, but here it breaks down... :(

Regards,
Sven

Marco van de Voort

2013-12-24 14:19:30 UTC

On Tue, Dec 24, 2013 at 06:18:41AM +0100, Hans-Peter Diettrich wrote:
> >
> > Not necessarily. Supporting both on both platforms is a sane reason too.
> >
> > One can't ditch utf16 because of Delphi compatibility. It will be hard to
> > ditch utf8 because of old Lazarus compatibility.
>
> In the meantime we have 2 Delphi compiler/RTL versions:
> - Ansi (Win32)
> - Unicode (UTF-16, multi target)
> and 4 GUI versions
> - VCL Win32
> - CLX
> - VCL.NET
> - FireMonkey
> summing up to 8 versions in theory, and 3 versions in practice.

The older delphi compilers are unsupported. We never supported anything but
VCL 32/64, so this list seems artificially inflated to me.

> So what does "Delphi compatible" mean *really*?

The same as it always has. VCL, and language level at a distance. The rest
is irrelevant.

> The FPC compiler supports multiple targets, and most probably can be
> managed to support both string types using the *same code base*
> (maintenance issue!).

Yes.

> IMO this does *not* apply to the libraries (RTL
> and LCL)

RTL is less of a problem than one might think. The problem mostly only comes
in at the classes level.

>and existing applications, where Lazarus counts as the most
> important and prominent application.

Existing Lazarus applications are toast anyway, without changes.

> We can be happy to have one single LCL and IDE version, which is already
> incompatible due to the use of UTF-8 strings instead of Ansi. Multiple
> versions, for compatibility with the other Delphi combinations, are beyond
> *development capacities*.

Then drop the old stuff, and simply go for full compatibility. Anything else
will only cause the loss of all OSS Delphi projects (and even the commercial
ones that support Lazarus).

And people like me that are torn between both systems.

> This sheds a very different light on Delphi compatibility, meaning that
> a Unicode LCL and IDE can *not* be supported in parallel to the existing
> UTF-8 implementation.

There is no existing UTF8 implementation that can be continued as is anyway.

> Dumping UTF-8 would discontinue support for the entire* range of
> **existing* LCL applications, i.e. loose all the
> current Lazarus users :-(
>
> So what should be the intended *audience* for a future Lazarus version?

> IMO the biggest group are old fashioned Delphi (D7) users, which want
> their existing Ansi/VCL code base supported *without* complications and
> incompatibilities introduced by the newer Delphi versions. The subject
> of this thread clearly indicates that UTF-8 is *not* a solution for this
> group of users.

It was like that two years ago. But I see more and more people migrate to
the unicode versions, and updating packages. The D7 base is eroding, and
worse, many of its users are mostly hedging bets to keep their codebases
running. Not to make new code. (and we need people that DO things)

It's like with turbo pascal in the (1.0.x) past. Yes, the numbers are huge,
but all they say is they want something 100% compatible to effortless keep
their codebases running. But when the times come to actually _invest_ in
the code again, they pick something that is at least halfwhat modern. And
all you are stuck with is oldtimers and l33t tinkerers.

That is the curse of supporting legacy targets, you can't do that forever
without making yourself irrelevant.

Keep in mind that any Lazarus solution in production use based on 2.8.x is
years away. The current activity levels in that group will be even less. Our
decisions must be aimed not at the situation now, but good for at least 5
years.

> Another important user group is targeting mobiles, where time will tell
> whether FM will ever succeed, or shares the fate of Kylix or VCL.NET.

Everywhere I see FM (Mobile plugin) buyers, I see existing Delphi users
hoping for an easy conversion to mobile and a quick buck to tide them over
the crisis. Not real go-getters that really go for mobile.

That makes me think this is not sustainable.

But Embarcadero is said to use it heavily internally, so they won't quickly
kill it off, and I assume a certain kind of customers will adapt it.

But IMHO for us it is irrelevant

> IMO these should be happy already with fpGUI or mseGUI, no need to raise
> another competitor in this area.

I don't really see any adaptation there. Those teams and offerings are again
a magnitude smaller than Lazarus, and for most of those users switching from
Embacadero to Lazarus is already the biggest step they are willing to make.

> > But if I have to chose to kill one, it is utf8. It is the lesser used choice
> > for unicode strings INSIDE APPLICATIONS. Yes, UTF8 is dominant in documents, but
> > not in APIs.
>
> That's my conclusion as well. But is that new audience worth to abandon
> the entire existing Lazarus audience?

I myself hope for the two tracks way. It satisfies multiple demands, and the
extra work is offset by less rewriting from current Delphi sources and less
discussion.

But the prime point is that IMHO an utf8 Windows is insane, and it should be
possible to port modern Delphi VCL apps at least to Windows. Preferably to
all.

--

Jürgen Hestermann

2013-12-24 11:18:49 UTC

Am 2013-12-23 23:08, schrieb Marco van de Voort:
> But if I have to chose to kill one, it is utf8. It is the lesser used choice
> for unicode strings INSIDE APPLICATIONS. Yes, UTF8 is dominant in documents, but
> not in APIs.

But in APIs it would not matter much to convert (in general the time for conversion
is negligible compared to the time that is needed for the rest around the API call).

I have written a file manager for Windows that can log and store millions of files in memory.
It uses the (UTF16) unicode API from Windows and converts the file names as UTF8 internally.
There exists another file manager who uses UTF16 internally too which can also log millions of files.
When logging the same source I can't see any difference in performance (even when logging
multiple times so that everything is cached!) although I have to convert and the other one does not.
But the memory footprints are very different.

>> UTF16 is the most horrible decision (all bad things combined).
> For what? Most of the sentiments I hear are echoed discussions on the web
> that are mostly about document encodings, NOT application internal
> encodings.

IMO this decision is based on the assumption to choose one encoding for everything.
So the same encoding is used *everywhere* as much as possible.
Then UTF8 is the best solution.
Why use UTF16/32? They cannot be treated the same as ancient ANSI strings either.
So what would be the reason behind it? Just wasting memory?

>> UTF8 has the lowest memory demand
> Not according to 1 billion Chinese.

How many of the strings stored and processed on a chinese computer are in chinese language?
A lot of the strings are still in english (HTML etc.).
So for asian countries the real memory demand is a mix and is not so easy to determine.
In most western countries UTF8 definitely uses less memory.

>> On the other hand, adapting the string encoding for each
>> Widgetset/OS would be a can of worms IMO.
> If you feel that way, I think Delphi compatibility should prevail.

Why this?
Free Pascal/Lazarus should fledge and not repeat all the bad decissions of Borland/Embarcadero/..

> Note that the language support for utf8 breaks down when you pass e.g. a
> "string" to rawbytestring on Windows. (because it is converted to the
> default 1-byte encoding, which is not utf8 in general).

I am not sure what you are talking about here.
For Windows I would use the unicode (UTF16) API interface exclusively and
convert it to UTF8 internally. From then on, everything should be UTF8.

> As said, UTF8 on Windows is a crutch, and attempts to workaround that moves
> Lazarus in the direction of "portability to everything as long as it is
> unix" philosophies, a la Cygwin.

For me the decision of what Unicode encoding should be used is primary OS independent.
Just do the conversion once at the API interface level but then use internal what was
decided to be "the best" (UTF8 IMO). Conversions seem to be unavoidable anyway.
So it is just a decision where and when they take place.
And the API level is a good place IMO.
And when other OS's use the same encoding it is even better but not the reason to chose one or the other.

>> A lot of additional knowledge about strings is put on the programmer
>> because handling of strings has to be done differently depending on OS.

No!. That's just the aim: If *all* Free Pascal/Lazarus programmers can rely on having
UTF8 in all cases then you only need to handle UTF8 strings.
No IFDEFS to handle UTF16 on Windows and UTF8 on Linux.
The same code just works on *all* platforms!

> Constructs that happen to work with Linux will fail on Windows.
> Because on Windows the default 1-byte encoding is not UTF8.

The ANSI interface should not be used anymore. It is obsolete and only needed
for ancient OS's like DOS. But programmers should not be encourraged to use it
on modern platforms. Just use UTF8 *everywhere*. That should be the aim IMO.

--

Marco van de Voort

2013-12-24 14:26:31 UTC

On Tue, Dec 24, 2013 at 12:18:49PM +0100, J?rgen Hestermann wrote:
> > But if I have to chose to kill one, it is utf8. It is the lesser used choice
> > for unicode strings INSIDE APPLICATIONS. Yes, UTF8 is dominant in documents, but
> > not in APIs.
>
> But in APIs it would not matter much to convert (in general the time for
> conversion is negligible compared to the time that is needed for the rest
> around the API call).

Maybe. (less so for fine grained structures, e.g. think a virtual stringgrid
with many nodes). But that is not the point, it makes that alien.

> I have written a file manager for Windows that can log and store millions
> of files in memory. It uses the (UTF16) unicode API from Windows and
> converts the file names as UTF8 internally. There exists another file
> manager who uses UTF16 internally too which can also log millions of
> files.

That's because the cost is hidden by slow harddisk movement. Directory
scanning is about the slowest operation one can do on the same computer.

> >> UTF16 is the most horrible decision (all bad things combined).
> > For what? Most of the sentiments I hear are echoed discussions on the web
> > that are mostly about document encodings, NOT application internal
> > encodings.
>
> IMO this decision is based on the assumption to choose one encoding for everything.

Well, that is a wrong (and arbitrary) assumption then. If you want to play
make believe and reorganize the world top-down be my guess, but leave
practical design matters in the current world out of it. We simply have to
live in the world we live in, not the world that could have been

Document encodings are for end users, programming
interfaces are for programmers. Lazarus users are programmers, except for
sourceencoding which IMHO can remain utf8 just happily (just like Delphi
btw)

> How many of the strings stored and processed on a chinese computer are in
> chinese language? A lot of the strings are still in english (HTML etc.).

Tags are. Text isn't. Depends on the webpage, but we are not talking about
webpages. It was merely to point out that your size criterium is random and
abitrary. It actually only matters for European languages.

> >> On the other hand, adapting the string encoding for each
> >> Widgetset/OS would be a can of worms IMO.
> > If you feel that way, I think Delphi compatibility should prevail.
>
> Why this? Free Pascal/Lazarus should fledge and not repeat all the bad
> decissions of Borland/Embarcadero/..

Because most of users convert, and because having a clear agreed standard to work
against is beneficial.

We have played this game many times before, and the only thing the Delphi
nay sayers seem to agree upon is saying nay to Delphi compatibility. For
what else there is going to be there are as many opinion as people. (and
often more, e.g. with work and private hat on)

... running out of time, got a train to catch. Will continue later.

--

Reinier Olislagers

2013-12-24 15:21:45 UTC

<let's say irony>
Well (not directed at you, Marco), this thread has certainly exceeded my
expectations. Thanks, very enjoyable!
Also nice that it's so similar to the same one last year (on the fpc
list, I think).
</let's say irony>

<lame joke>
Anyway, I'll go massage a reindeer now...
</lame joke>

<serious>
Best wishes,
Reinier
</serious>

--

Marcos Douglas

2013-12-24 17:35:14 UTC

On Tue, Dec 24, 2013 at 1:21 PM, Reinier Olislagers
<***@gmail.com> wrote:
> <let's say irony>
> Well (not directed at you, Marco), this thread has certainly exceeded my
> expectations. Thanks, very enjoyable!
> Also nice that it's so similar to the same one last year (on the fpc
> list, I think).
> </let's say irony>
>
> <lame joke>
> Anyway, I'll go massage a reindeer now...
> </lame joke>
>
> <serious>
> Best wishes,
> Reinier
> </serious>

<curious>
So the answer for these problems is?
</curious>

Sorry but It is a real problem. No answer was presented so far and I
have many system coded using Lazarus... this is important for me and,
I think, for many Windows programmers.

Why these questions bother you?

Best wishes,
Marcos Douglas

--

Juha Manninen

2013-12-24 21:02:32 UTC

On Tue, Dec 24, 2013 at 7:35 PM, Marcos Douglas <***@delfire.net> wrote:
> So the answer for these problems is?

There is no answer. No, actually there are many answers but they all
have problems. If you want your code to also compile with Delphi, then
you have even more answers and problems. :)

> Why these questions bother you?

The same questions and debate have continued for long but there are no
good answers yet.
Unfortunately this is a transition period. Things will become smoother
when time passes.

Right now, with FPC/Lazarus, you must use UTF-8 with the specialized
UTF-8 functions, and convert when needed. Yes, I know this solution
has issues.

Juha

--

Hans-Peter Diettrich

2013-12-25 00:40:50 UTC

Reinier Olislagers schrieb:

> <serious>
> Best wishes,
> Reinier
> </serious>

My best wishes to everybody, too :-)

DoDi

--

Jürgen Hestermann

2013-12-24 17:39:32 UTC

Am 24.12.2013 15:26, schrieb Marco van de Voort:
>> But in APIs it would not matter much to convert (in general the time for
>> conversion is negligible compared to the time that is needed for the
rest
>> around the API call).
> Maybe. (less so for fine grained structures, e.g. think a virtual
stringgrid
> with many nodes).

Well, I use VirtualStringGrid with millions of nodes in my program
and it works like a charm (with UTF8) on Windows.

>> I have written a file manager for Windows that can log and store
millions
>> of files in memory. It uses the (UTF16) unicode API from Windows and
>> converts the file names as UTF8 internally. There exists another file
>> manager who uses UTF16 internally too which can also log millions of
>> files.
> That's because the cost is hidden by slow harddisk movement.

When I log my whole hard disk multiple times the the logging comes
totally from memory.
There is no hard disk movement at all.
Still there is no difference.

>> IMO this decision is based on the assumption to choose one encoding
for everything.
> Well, that is a wrong (and arbitrary) assumption then. If you want to
play
> make believe and reorganize the world top-down be my guess, but leave
> practical design matters in the current world out of it. We simply
have to
> live in the world we live in, not the world that could have been

It is the world of exact such thinking.
People are creating this.
It's not a law of nature.

> Document encodings are for end users, programming
> interfaces are for programmers. Lazarus users are programmers, except for
> sourceencoding which IMHO can remain utf8 just happily (just like Delphi
> btw)

That's nonsense.
The *need* to convert between encodings should be restricted to just
that: Only when it is needed.
We should never force people (neither programmers nor users nor anyone)
to juggle encodings.
The aim should be to *reduce* conversions not to increase them.

--

Hans-Peter Diettrich

2013-12-25 00:36:43 UTC

Jürgen Hestermann schrieb:

> The ANSI interface should not be used anymore. It is obsolete and only
> needed
> for ancient OS's like DOS. But programmers should not be encourraged to
> use it
> on modern platforms. Just use UTF8 *everywhere*. That should be the aim
> IMO.

Whenever the encoding matters, most users and applications are best off
with their regional Ansi encoding - all used characters are single
bytes. UTF-16 extends the range of languages whose characters can be
assumed to have a fixed size, i.e. all character sets in the BMP. Such
fixed-size characters IMO are on the top of the wishlist of most users,
so that none of them ever will be happy with UTF-8. Certainly UTF-8 was
the best choice when Delphi (and FPC) did not have native UTF-16
strings, but when we have Unicode strings, now or soon, it should be
dropped.

DoDi

--

Graeme Geldenhuys

2013-12-25 10:17:53 UTC

On 2013-12-25 10:03, Jürgen Hestermann wrote:
> So UTF16 has all drawbacks of all encodings but no benefit (except
> that this awful decision is used by Windows).

+1

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Jürgen Hestermann

2013-12-25 10:03:55 UTC

Am 2013-12-25 01:36, schrieb Hans-Peter Diettrich:
> Whenever the encoding matters, most users and applications are best off
> with their regional Ansi encoding - all used characters are single bytes.

You forget that using ANSI API functions on Windows not only has the drawback
that you cannot access all files (which have unicode characters in them)
but also that there is the limit of 255 characters for the path length
(while unicode API functions allow up to 32k characters).
So you run into problems in 2 cases:

1.) if strings (file names) contain non-ANSI unicode characters
2.) if paths are longer than 255 characters

Do you realy advice people nowadays to restrict their programs so far by using ANSI API functions?
I wouldn't. I was always wondering why so many programs fail with these 2 limitations on
Windows after an alternative has been available for such a long time.
Now you want to extent this time by yet another generation of programmers.
That's not good. Hopefully not too many programmers follow this road...

> UTF-16 extends the range of languages whose characters can be assumed to have a fixed size,

That's not true.
You still you cannot rely on having a number of bytes for characters in UTF16 either.
Also, UTF8 would not have any BOM problem while UTF16 and UTF32 have.
So UTF16 has all drawbacks of all encodings but no benefit (except that this awfull decision is used by Windows).

--

Hans-Peter Diettrich

2013-12-25 10:34:49 UTC

Jürgen Hestermann schrieb:
> Am 2013-12-25 01:36, schrieb Hans-Peter Diettrich:
> > Whenever the encoding matters, most users and applications are best off
> > with their regional Ansi encoding - all used characters are single
> bytes.
>
> You forget that using ANSI API functions on Windows not only has the
> drawback
> that you cannot access all files (which have unicode characters in them)
> but also that there is the limit of 255 characters for the path length
> (while unicode API functions allow up to 32k characters).

For that purpose (file names) I vote for a dedicated string type, that
matches the target platform requirements. Then the user has not to look
at filenames on a per-character base.

> Do you realy advice people nowadays to restrict their programs so far by
> using ANSI API functions?

How many users have to use API functions, which are bound to a single
platform? And which of these do not understand how to handle strings of
whatever encoding?

DoDi

--

Mattias Gaertner

2013-12-16 10:41:08 UTC

On Sun, 15 Dec 2013 17:47:09 -0200
Marcos Douglas <***@delfire.net> wrote:

> On Sun, Dec 15, 2013 at 3:13 PM, Reinier Olislagers
> <***@gmail.com> wrote:
> > On 15/12/2013 16:25, Marcos Douglas wrote:
> >> On Sun, Dec 15, 2013 at 1:08 PM, Bart <***@gmail.com> wrote:
> >>> So for basic stuff all this is already taken care of.
> >>
> >> Only in Lazarus' context... but I have some components that is only
> >> FPC's context.
> >> These components do not use Lazarus' routines and that is the BIG
> >> problem. I need to "remember" in pass only ANSI strings for these
> >> components as remember to convert the component's output string
> >> results to use in Lazarus.
> >
> > Why not just include a project reference to LCLBase (IIRC that should be
> > enough) and just always use the LCL units until FPC catches up?
>
> You propose include LCL in packages that not have LCL references and
> change all code of these packages to use UTF8 functions??

Note:
The Lazarus UTF8 functions are in the package LazUtils.
You don't need to use the LCL (nor LCLBase) if you only need the
UTF8 functions.

Mattias

--

Marcos Douglas

2013-12-16 11:22:54 UTC

On Mon, Dec 16, 2013 at 7:41 AM, Mattias Gaertner
<nc-***@netcologne.de> wrote:
> On Sun, 15 Dec 2013 17:47:09 -0200
> Marcos Douglas <***@delfire.net> wrote:
>
>> On Sun, Dec 15, 2013 at 3:13 PM, Reinier Olislagers
>> <***@gmail.com> wrote:
>> > On 15/12/2013 16:25, Marcos Douglas wrote:
>> >> On Sun, Dec 15, 2013 at 1:08 PM, Bart <***@gmail.com> wrote:
>> >>> So for basic stuff all this is already taken care of.
>> >>
>> >> Only in Lazarus' context... but I have some components that is only
>> >> FPC's context.
>> >> These components do not use Lazarus' routines and that is the BIG
>> >> problem. I need to "remember" in pass only ANSI strings for these
>> >> components as remember to convert the component's output string
>> >> results to use in Lazarus.
>> >
>> > Why not just include a project reference to LCLBase (IIRC that should be
>> > enough) and just always use the LCL units until FPC catches up?
>>
>> You propose include LCL in packages that not have LCL references and
>> change all code of these packages to use UTF8 functions??
>
> Note:
> The Lazarus UTF8 functions are in the package LazUtils.
> You don't need to use the LCL (nor LCLBase) if you only need the
> UTF8 functions.
>
> Mattias

You're right.
IMHO, the real problem is not to use LCL/LazUtils but having multiple
calls to SysToUTF8..UTF8ToSys in all code.

Question:
After the new FPC version is released, how Lazarus will work together
"FPC Unicode"?

Marcos Douglas

--

Mattias Gaertner

2013-12-16 12:28:47 UTC

On Mon, 16 Dec 2013 09:22:54 -0200
Marcos Douglas <***@delfire.net> wrote:

> On Mon, Dec 16, 2013 at 7:41 AM, Mattias Gaertner
>[...]
> > Note:
> > The Lazarus UTF8 functions are in the package LazUtils.
> > You don't need to use the LCL (nor LCLBase) if you only need the
> > UTF8 functions.
> >
> > Mattias
>
> You're right.
> IMHO, the real problem is not to use LCL/LazUtils but having multiple
> calls to SysToUTF8..UTF8ToSys in all code.

The problem is not specific to the LCL. The problem is using libraries
with different encodings.

> Question:
> After the new FPC version is released, how Lazarus will work together
> "FPC Unicode"?

What do you mean with "FPC Unicode". The new compiler feature
of strings with encoding and various changes to the RTL or the idea to
release a FPC flavor with dotted unitnames and String=UnicodeString?

Mattias

--

Marcos Douglas

2013-12-16 15:43:41 UTC

On Mon, Dec 16, 2013 at 9:28 AM, Mattias Gaertner
<nc-***@netcologne.de> wrote:
> On Mon, 16 Dec 2013 09:22:54 -0200
> Marcos Douglas <***@delfire.net> wrote:
>
>> On Mon, Dec 16, 2013 at 7:41 AM, Mattias Gaertner
>>[...]
>> > Note:
>> > The Lazarus UTF8 functions are in the package LazUtils.
>> > You don't need to use the LCL (nor LCLBase) if you only need the
>> > UTF8 functions.
>> >
>> > Mattias
>>
>> You're right.
>> IMHO, the real problem is not to use LCL/LazUtils but having multiple
>> calls to SysToUTF8..UTF8ToSys in all code.
>
> The problem is not specific to the LCL. The problem is using libraries
> with different encodings.

Of course!
But I'm using Lazarus and... LCL... so, I'm searching a better way to
continue programming without concern about these problems -- at least
decrease.

Just for clarify: I'm not judging the LCL or whatever. I'm only
searching a solution to my problems. If I can help the Lazarus team,
in a near future, would be great.

>> Question:
>> After the new FPC version is released, how Lazarus will work together
>> "FPC Unicode"?
>
> What do you mean with "FPC Unicode". The new compiler feature
> of strings with encoding and various changes to the RTL or the idea to
> release a FPC flavor with dotted unitnames and String=UnicodeString?

I mean "new compiler feature of strings with encoding and various
changes to the RTL".
Lazarus will change something? Will use "FPC new feature of strings"
or continues using UTF8?

Best regards,
Marcos Douglas

--

Mattias Gaertner

2013-12-16 17:22:13 UTC

On Mon, 16 Dec 2013 13:43:41 -0200
Marcos Douglas <***@delfire.net> wrote:

>[...]
> > The problem is not specific to the LCL. The problem is using libraries
> > with different encodings.
>
> Of course!
> But I'm using Lazarus and... LCL... so, I'm searching a better way to
> continue programming without concern about these problems -- at least
> decrease.

You are not alone.

> Just for clarify: I'm not judging the LCL or whatever. I'm only
> searching a solution to my problems. If I can help the Lazarus team,
> in a near future, would be great.

You are welcome.
The LCL uses one encoding on all platforms, while still using native
widgetsets and native file handles. That was already a big step to
decrease the amount of conversions. And many packages using the LCL
followed that approach. That further decreased conversions.
The big missing piece in the puzzle is the RTL. Now FPC has extended
'string' and is adapting the RTL for unicode.
Of course this does not magically solve the general problem. All
libraries must be adapted including the LCL.

> >> Question:
> >> After the new FPC version is released, how Lazarus will work together
> >> "FPC Unicode"?
> >
> > What do you mean with "FPC Unicode". The new compiler feature
> > of strings with encoding and various changes to the RTL or the idea to
> > release a FPC flavor with dotted unitnames and String=UnicodeString?
>
> I mean "new compiler feature of strings with encoding and various
> changes to the RTL".
> Lazarus will change something? Will use "FPC new feature of strings"
> or continues using UTF8?

One of the new FPC strings is UTF8String. So for compatibility the
first step is to use that. That is not yet complete.

Mattias

--

Marcos Douglas

2013-12-17 01:41:34 UTC

On Mon, Dec 16, 2013 at 3:22 PM, Mattias Gaertner
<nc-***@netcologne.de> wrote:
> On Mon, 16 Dec 2013 13:43:41 -0200
> Marcos Douglas <***@delfire.net> wrote:
>
>>[...]
>> > The problem is not specific to the LCL. The problem is using libraries
>> > with different encodings.
>>
>> Of course!
>> But I'm using Lazarus and... LCL... so, I'm searching a better way to
>> continue programming without concern about these problems -- at least
>> decrease.
>
> You are not alone.
>
>
>> Just for clarify: I'm not judging the LCL or whatever. I'm only
>> searching a solution to my problems. If I can help the Lazarus team,
>> in a near future, would be great.
>
> You are welcome.
> The LCL uses one encoding on all platforms, while still using native
> widgetsets and native file handles. That was already a big step to
> decrease the amount of conversions. And many packages using the LCL
> followed that approach. That further decreased conversions.
> The big missing piece in the puzzle is the RTL. Now FPC has extended
> 'string' and is adapting the RTL for unicode.

I think use one encoding on all platforms is good, however Windows
uses UTF-16. All string from/to Windows needs to be converted, right?
Is this not a penalty for Windows platform?
What coding FPC's team chose to use on Windows in the next release of
the compiler, UTF-16?

> Of course this does not magically solve the general problem. All
> libraries must be adapted including the LCL.

Of course.

>> >> Question:
>> >> After the new FPC version is released, how Lazarus will work together
>> >> "FPC Unicode"?
>> >
>> > What do you mean with "FPC Unicode". The new compiler feature
>> > of strings with encoding and various changes to the RTL or the idea to
>> > release a FPC flavor with dotted unitnames and String=UnicodeString?
>>
>> I mean "new compiler feature of strings with encoding and various
>> changes to the RTL".
>> Lazarus will change something? Will use "FPC new feature of strings"
>> or continues using UTF8?
>
> One of the new FPC strings is UTF8String. So for compatibility the
> first step is to use that. That is not yet complete.

Lazarus uses only 'string', not UTF8String, UnicodeString,
RawByteString, etc. This will change?

Marcos Douglas

--

Mattias Gaertner

2013-12-17 08:23:52 UTC

On Mon, 16 Dec 2013 23:41:34 -0200
Marcos Douglas <***@delfire.net> wrote:

>[...]
> I think use one encoding on all platforms is good, however Windows
> uses UTF-16. All string from/to Windows needs to be converted, right?
> Is this not a penalty for Windows platform?

The Windows API is UTF-16, most text files and databases are not. You
will always have some conversions. In most cases the conversion is
hardly measurable, but there are cases where UTF-16 is better and cases
where UTF-8 is better. And in some cases even UTF-32 is better.
The LCL is graphical library, so string speed hardly matters.

> What coding FPC's team chose to use on Windows in the next release of
> the compiler, UTF-16?

I'm not member of the FPC team, but afaik UTF-16 would be a big
incompatibility, so the next release will still using system codepage
string for Windows. Maybe eventually another flavor with UTF-16.

>[...]
> Lazarus uses only 'string', not UTF8String, UnicodeString,
> RawByteString, etc. This will change?

Lazarus used only string (AnsiString), now it has to be more specific
at some places to avoid unnecessary conversions by the compiler.

Mattias

--

Mattias Gaertner

2013-12-17 23:45:26 UTC

On Tue, 17 Dec 2013 21:41:09 -0200
Marcos Douglas <***@delfire.net> wrote:

>[...]
> So, nothing will change for Lazarus users after the new FPC release.

I do hope that something will change after a new FPC release. :)

Mattias

--

Marcos Douglas

2013-12-18 01:16:55 UTC

On Tue, Dec 17, 2013 at 9:45 PM, Mattias Gaertner
<nc-***@netcologne.de> wrote:
> On Tue, 17 Dec 2013 21:41:09 -0200
> Marcos Douglas <***@delfire.net> wrote:
>
>>[...]
>> So, nothing will change for Lazarus users after the new FPC release.
>
> I do hope that something will change after a new FPC release. :)

Touché :)

Marcos Douglas

--

Marcos Douglas

2013-12-18 01:52:49 UTC

On Tue, Dec 17, 2013 at 11:16 PM, Marcos Douglas <***@delfire.net> wrote:
> On Tue, Dec 17, 2013 at 9:45 PM, Mattias Gaertner
> <nc-***@netcologne.de> wrote:
>> On Tue, 17 Dec 2013 21:41:09 -0200
>> Marcos Douglas <***@delfire.net> wrote:
>>
>>>[...]
>>> So, nothing will change for Lazarus users after the new FPC release.
>>
>> I do hope that something will change after a new FPC release. :)
>
> Touché :)

I would like to understand: Why Java, .Net and others use UTF-16 as
default encode and why Lazarus team chose UTF-8?

Thanks,
Marcos Douglas

--

Martin Schreiber

2013-12-18 06:56:38 UTC

On Wednesday 18 December 2013 02:52:49 Marcos Douglas wrote:
>
> I would like to understand: Why Java, .Net and others use UTF-16 as
> default encode and why Lazarus team chose UTF-8?
>
One reason is that Free Pascal did not support utf-16 at the time Unicode
became urgent and later the FPC implementation of utf-16 strings was buggy.
So there was no other choice than to use utf-8 in standard FPC 8 bit strings.

Martin

--

Michael Van Canneyt

2013-12-18 08:05:26 UTC

On Tue, 17 Dec 2013, Marcos Douglas wrote:

> On Tue, Dec 17, 2013 at 11:16 PM, Marcos Douglas <***@delfire.net> wrote:
>> On Tue, Dec 17, 2013 at 9:45 PM, Mattias Gaertner
>> <nc-***@netcologne.de> wrote:
>>> On Tue, 17 Dec 2013 21:41:09 -0200
>>> Marcos Douglas <***@delfire.net> wrote:
>>>
>>>> [...]
>>>> So, nothing will change for Lazarus users after the new FPC release.
>>>
>>> I do hope that something will change after a new FPC release. :)
>>
>> TouchÃ© :)
>
> I would like to understand: Why Java, .Net and others use UTF-16 as
> default encode and why Lazarus team chose UTF-8?

The impact of switching to UTF-8 is less when you care about backwards compatibility.

Michael.

Marcos Douglas

2013-12-18 15:11:55 UTC

On Wed, Dec 18, 2013 at 3:56 AM, Martin Schreiber <***@gmail.com> wrote:
> On Wednesday 18 December 2013 02:52:49 Marcos Douglas wrote:
>>
>> I would like to understand: Why Java, .Net and others use UTF-16 as
>> default encode and why Lazarus team chose UTF-8?
>>
> One reason is that Free Pascal did not support utf-16 at the time Unicode
> became urgent and later the FPC implementation of utf-16 strings was buggy.
> So there was no other choice than to use utf-8 in standard FPC 8 bit strings.

On Wed, Dec 18, 2013 at 5:05 AM, Michael Van Canneyt
<***@freepascal.org> wrote:
>
> On Tue, 17 Dec 2013, Marcos Douglas wrote:
>
>> On Tue, Dec 17, 2013 at 11:16 PM, Marcos Douglas <***@delfire.net> wrote:
>>>
>> [...]
>>
>> I would like to understand: Why Java, .Net and others use UTF-16 as
>> default encode and why Lazarus team chose UTF-8?
>
> The impact of switching to UTF-8 is less when you care about backwards
> compatibility.

Now I understand. Thank you Martin and Michael .

Marcos Douglas

--

Martin Schreiber

2013-12-18 17:24:35 UTC

On Wednesday 18 December 2013 09:05:26 Michael Van Canneyt wrote:
> >
> > I would like to understand: Why Java, .Net and others use UTF-16 as
> > default encode and why Lazarus team chose UTF-8?
>
> The impact of switching to UTF-8 is less when you care about backwards
> compatibility.
>
I do not quite agree. Moving from single byte encoding to utf-8 breaks many
non-English applications which use character index.
MSElang supports string8 (utf-8), string16 (utf-16) string32 (UCS4) and
bytestring (any 8 bit encoding and/or binary data).

Martin

--

Juha Manninen

2013-12-18 09:48:52 UTC

On Wed, Dec 18, 2013 at 3:52 AM, Marcos Douglas <***@delfire.net> wrote:
> I would like to understand: Why Java, .Net and others use UTF-16 as
> default encode and why Lazarus team chose UTF-8?

... and don't forget Windows API.
I believe the decision was made by people who didn't know the issue
well enough, and some decision had to be pushed quickly.
I am slowly learning the issues around Unicode. UTF-8 seems to be the
best encoding for most purposes.
The only benefit of UTF-16 was supposed to be its fixed size character
length, but finally it did not happen. All characters in the world did
not fit into 16-bit space.
It means UTF-16 wastes space without any real benefits. Only UTF-32
would bring the fixed size character benefit.
What more, UTF-16 is confusing because it has variations. It all is
well explained here:
http://www.utf8everywhere.org/

At my work we must switch to Unicode but the details of how to do it
are still open. The code now works with both Delphi and FPC.
There is a highly optimized DB engine where most data fits in a cache
at run-time making it lightning fast. UTF-16 would almost double the
space requirement and thus is out of question. The core parts must use
UTF-8 anyway. One choice is to dump Delphi completely and use
FPC+Lazarus for everything. Lets see...

Juha

--

Marcos Douglas

2013-12-18 15:19:22 UTC

On Wed, Dec 18, 2013 at 6:48 AM, Juha Manninen
<***@gmail.com> wrote:
> On Wed, Dec 18, 2013 at 3:52 AM, Marcos Douglas <***@delfire.net> wrote:
>> I would like to understand: Why Java, .Net and others use UTF-16 as
>> default encode and why Lazarus team chose UTF-8?
>
> ... and don't forget Windows API.
> I believe the decision was made by people who didn't know the issue
> well enough, and some decision had to be pushed quickly.
> I am slowly learning the issues around Unicode. UTF-8 seems to be the
> best encoding for most purposes.
> The only benefit of UTF-16 was supposed to be its fixed size character
> length, but finally it did not happen. All characters in the world did
> not fit into 16-bit space.
> It means UTF-16 wastes space without any real benefits. Only UTF-32
> would bring the fixed size character benefit.
> What more, UTF-16 is confusing because it has variations. It all is
> well explained here:
> http://www.utf8everywhere.org/

I will read, thanks.

> At my work we must switch to Unicode but the details of how to do it
> are still open. The code now works with both Delphi and FPC.
> There is a highly optimized DB engine where most data fits in a cache
> at run-time making it lightning fast. UTF-16 would almost double the
> space requirement and thus is out of question. The core parts must use
> UTF-8 anyway. One choice is to dump Delphi completely and use
> FPC+Lazarus for everything. Lets see...

Here too, more or less... I'm thinking to switch all own packages to UTF-8.
But, in your codes, how do you works on Delphi -- or with Lazarus on
Windows -- using your core parts? There are many calls from/to
SysToUTF8 and/or UTF8ToSys from core to Windows?

Thanks,
Marcos Douglas

--

Martin Schreiber

2013-12-18 17:29:04 UTC

On Wednesday 18 December 2013 16:19:22 Marcos Douglas wrote:

> > What more, UTF-16 is confusing because it has variations. It all is
> > well explained here:
> > http://www.utf8everywhere.org/
>
> I will read, thanks.
>
Read it with a grain of salt. ;-)
It does not consider all aspects of the matter.

Martin

--

Marcos Douglas

2013-12-18 19:53:29 UTC

On Wed, Dec 18, 2013 at 2:29 PM, Martin Schreiber <***@gmail.com> wrote:
> On Wednesday 18 December 2013 16:19:22 Marcos Douglas wrote:
>
>> > What more, UTF-16 is confusing because it has variations. It all is
>> > well explained here:
>> > http://www.utf8everywhere.org/
>>
>> I will read, thanks.
>>
> Read it with a grain of salt. ;-)
> It does not consider all aspects of the matter.

Thanks for this tip. :)

Marcos Douglas

--

Juha Manninen

2013-12-18 18:25:45 UTC

On Wed, Dec 18, 2013 at 5:19 PM, Marcos Douglas <***@delfire.net> wrote:
> Here too, more or less... I'm thinking to switch all own packages to UTF-8.
> But, in your codes, how do you works on Delphi -- or with Lazarus on
> Windows -- using your core parts? There are many calls from/to
> SysToUTF8 and/or UTF8ToSys from core to Windows?

If you need to call WinAPI, then you must convert obviously.
In our case API calls are not needed by the core program. It is
cross-platform code. However using a new Unicode-Delphi would cause
many problems because all VCL functions and classes, including
TStringList, expect UTF-16 string. When using UTF8String, the compiler
converts between encodings all the time.
UTF-8 is needed in many places, thus we would need to duplicate much
of VCL code for UTF-8. No good.
Using UTF-8 with FPC/Lazarus would simplify the task. LCL classes and
functions work as expected etc.
I was even presented a possibility of doing a hybrid Ansi/UTF-8 system
and a gradual data conversion plan.
If Lenght(s) = UTF8Lenght(s), then the string is an AnsiString, and so on...

If you call WinAPI a lot, then with UTF-8 you must convert encodings.
But, if you are calling WinAPI a lot, then you are in trouble anyway.

As Michael Van Canneyt wrote, backwards compatibility with UTF-8 is
good. For example all our lower-ascii data will work without
conversions.
Also lots of code which is not designed for Unicode, will continue to
work with UTF-8 but not with UTF-16. For example parsers for common
markup languages (HTML, XML, BB-code) still magically work because all
tags are in lower-ascii area.

Juha

--

Marcos Douglas

2013-12-19 21:58:18 UTC

On Wed, Dec 18, 2013 at 3:25 PM, Juha Manninen
<***@gmail.com> wrote:
> On Wed, Dec 18, 2013 at 5:19 PM, Marcos Douglas <***@delfire.net> wrote:
>> Here too, more or less... I'm thinking to switch all own packages to UTF-8.
>> But, in your codes, how do you works on Delphi -- or with Lazarus on
>> Windows -- using your core parts? There are many calls from/to
>> SysToUTF8 and/or UTF8ToSys from core to Windows?
>
> If you need to call WinAPI, then you must convert obviously.
> In our case API calls are not needed by the core program. It is
> cross-platform code. However using a new Unicode-Delphi would cause
> many problems because all VCL functions and classes, including
> TStringList, expect UTF-16 string. When using UTF8String, the compiler
> converts between encodings all the time.

Using UTF8String the compiler converts to UTF-16 automatically?

> UTF-8 is needed in many places, thus we would need to duplicate much
> of VCL code for UTF-8. No good.
> Using UTF-8 with FPC/Lazarus would simplify the task. LCL classes and
> functions work as expected etc.
> I was even presented a possibility of doing a hybrid Ansi/UTF-8 system
> and a gradual data conversion plan.
> If Lenght(s) = UTF8Lenght(s), then the string is an AnsiString, and so on...

I thought that, e.g., override the RTL classes and functions:
type
TStringList = class(Classes.TStringList)
// using UTF-8
end

> If you call WinAPI a lot, then with UTF-8 you must convert encodings.
> But, if you are calling WinAPI a lot, then you are in trouble anyway.

I disagree if you code a program only to run on Windows. ;-)

> As Michael Van Canneyt wrote, backwards compatibility with UTF-8 is
> good. For example all our lower-ascii data will work without
> conversions.
> Also lots of code which is not designed for Unicode, will continue to
> work with UTF-8 but not with UTF-16. For example parsers for common
> markup languages (HTML, XML, BB-code) still magically work because all
> tags are in lower-ascii area.

I agree.

Marcos Douglas

--

Juha Manninen

2013-12-19 23:46:18 UTC

On Thu, Dec 19, 2013 at 11:58 PM, Marcos Douglas <***@delfire.net> wrote:
> Using UTF8String the compiler converts to UTF-16 automatically?

Yes, Delphi does that. Future FPC versions will do automatic
conversion, too, but not only to UTF-16.

> I thought that, e.g., override the RTL classes and functions:
> type
> TStringList = class(Classes.TStringList)
> // using UTF-8
> end

No, with FPC you don't need to do that. The existing StringList works ok.

With Delphi you would need to copy the whole class, name it
TUtf8StringList, and replace "string" with "UTF8String".
This new class must NOT inherit from Classes.TStringList.

Juha

--

Marcos Douglas

2013-12-20 00:47:27 UTC

On Thu, Dec 19, 2013 at 9:46 PM, Juha Manninen
<***@gmail.com> wrote:
> On Thu, Dec 19, 2013 at 11:58 PM, Marcos Douglas <***@delfire.net> wrote:
>> Using UTF8String the compiler converts to UTF-16 automatically?
>
> Yes, Delphi does that. Future FPC versions will do automatic
> conversion, too, but not only to UTF-16.

Not _only_ to UTF-16? It will depend on the OS?

>> I thought that, e.g., override the RTL classes and functions:
>> type
>> TStringList = class(Classes.TStringList)
>> // using UTF-8
>> end
>
> No, with FPC you don't need to do that. The existing StringList works ok.

For now (2.6.2) works ok only for AnsiString... I'm talking about
codify TStringList class to work with UTF-8 but no changes in string
type arguments.

> With Delphi you would need to copy the whole class, name it
> TUtf8StringList, and replace "string" with "UTF8String".
> This new class must NOT inherit from Classes.TStringList.

The same here... I think.

Marcos Douglas

--

Juha Manninen

2013-12-20 06:22:38 UTC

On Fri, Dec 20, 2013 at 2:47 AM, Marcos Douglas <***@delfire.net> wrote:
> Not _only_ to UTF-16? It will depend on the OS?

No, FPC string will know its encoding and the conversion is made to
any encoding but only when needed.
Let's not go deeper into this subject here. The details of future FPC
are still open and they are not yet documented.
When "Unicode" is mentioned, usually people start to argue about how
it SHOULD be done.
You can search in fpc-pascal and fpc-dev histories for that.

> For now (2.6.2) works ok only for AnsiString... I'm talking about
> codify TStringList class to work with UTF-8 but no changes in string
> type arguments.

Again no. TStringList in 2.6.2 works ok for UTF-8 encoded strings, too.
The same is true for future FPC versions because they are not
hard-coded for UTF-16 (as Delphi is).

>> With Delphi you would need to copy the whole class, name it
>> TUtf8StringList, and replace "string" with "UTF8String".
>> This new class must NOT inherit from Classes.TStringList.
>
> The same here... I think.

No no no :)

Juha

--

Michael Schnell

2013-12-20 09:23:06 UTC

On 12/20/2013 07:22 AM, Juha Manninen wrote:
> When "Unicode" is mentioned, usually people start to argue about how
> it SHOULD be done.

:-) :-) :-)

Especially because the big boss "Delphi" does not provide a really good
model to go for. Delphi String handling is done with UTF16 (using other
encoding results in bad performance and other problems) in mind and with
no respect to portability at all.

And in spite of that there still are some soles that claim Unicode
support is not a complicated thing :-(

-Michael

--

Marcos Douglas

2013-12-21 01:08:52 UTC

On Fri, Dec 20, 2013 at 4:22 AM, Juha Manninen
<***@gmail.com> wrote:
> On Fri, Dec 20, 2013 at 2:47 AM, Marcos Douglas <***@delfire.net> wrote:
>> Not _only_ to UTF-16? It will depend on the OS?
>
> No, FPC string will know its encoding and the conversion is made to
> any encoding but only when needed.
> Let's not go deeper into this subject here. The details of future FPC
> are still open and they are not yet documented.
> When "Unicode" is mentioned, usually people start to argue about how
> it SHOULD be done.
> You can search in fpc-pascal and fpc-dev histories for that.

Ok, you're right.

>> For now (2.6.2) works ok only for AnsiString... I'm talking about
>> codify TStringList class to work with UTF-8 but no changes in string
>> type arguments.
>
> Again no. TStringList in 2.6.2 works ok for UTF-8 encoded strings, too.
> The same is true for future FPC versions because they are not
> hard-coded for UTF-16 (as Delphi is).

I didn't understand. If I have a TStringList instance, on Windows, I
need to convert Text property to ANSI. But some components, e.g.
TMemo, do these conversions automatically, but this is different.

>>> With Delphi you would need to copy the whole class, name it
>>> TUtf8StringList, and replace "string" with "UTF8String".
>>> This new class must NOT inherit from Classes.TStringList.
>>
>> The same here... I think.
>
> No no no :)
But you talking about to make a new StringList... this is not the proposal. ;-)

Marcos Douglas

--

Juha Manninen

2013-12-21 07:56:08 UTC

On Sat, Dec 21, 2013 at 3:08 AM, Marcos Douglas <***@delfire.net> wrote:
> I didn't understand. If I have a TStringList instance, on Windows, I
> need to convert Text property to ANSI. But some components, e.g.
> TMemo, do these conversions automatically, but this is different.

TMemo is a GUI component. Then the string encoding matters and it must
be converted to the native widgetset encoding. Still, the conversion
is automatic. You don't need to care about it if you work with LCL
components only and not with WinAPI directly.

TStringList is not a GUI component. It can be used for example in an
embedded Linux program with no GUI.
It does not need to know the encoding (except for sorting maybe). With
FPC no automatic conversions happen.

In Delphi things are different. The auto-conversion happens ALWAYS
when assigning between eg. UTF-8 and UTF-16. It has nothing to do with
WinAPI, or any other widgetset API.
Native string is UTF-16. If you have
var MyUTF8Str: UTF8String;
...
StringList.Add(MyUTF8Str); <- triggers conversion
MyUTF8Str := StringList[0]; <- triggers conversion again

The amazing thing is that such code works. Delphi does a good job in
converting the strings.
It is also reasonably fast, but still not acceptable in a speed
critical code. This was the problem in my employer's code base. We are
thinking how to use UTF-8 for the core program without triggering many
auto-conversions. One choice is to dump Delphi and use only FPC. Now
the code still works with both.

Juha

P.S.
I am still wondering why you are so fond of WinAPI while you have a
nice cross-platform API available.

--

Marcos Douglas

2013-12-21 14:33:33 UTC

On Sat, Dec 21, 2013 at 5:56 AM, Juha Manninen
<***@gmail.com> wrote:
> On Sat, Dec 21, 2013 at 3:08 AM, Marcos Douglas <***@delfire.net> wrote:
>> I didn't understand. If I have a TStringList instance, on Windows, I
>> need to convert Text property to ANSI. But some components, e.g.
>> TMemo, do these conversions automatically, but this is different.
>
> TMemo is a GUI component.

I know, of course... :)

> Then the string encoding matters and it must
> be converted to the native widgetset encoding. Still, the conversion
> is automatic. You don't need to care about it if you work with LCL
> components only and not with WinAPI directly.

Yes and so I wrote "TMemo, do these conversions automatically, but
this is different".

> TStringList is not a GUI component. It can be used for example in an
> embedded Linux program with no GUI.

Yes again. I use a lot TStrings as a transfer of information in many
cases... no GUI envolved.

> It does not need to know the encoding (except for sorting maybe). With
> FPC no automatic conversions happen.

And that is one of these problems because I need convert the Text
property to the right encode.

> In Delphi things are different. The auto-conversion happens ALWAYS
> when assigning between eg. UTF-8 and UTF-16. It has nothing to do with
> WinAPI, or any other widgetset API.
> Native string is UTF-16. If you have
> var MyUTF8Str: UTF8String;
> ...
> StringList.Add(MyUTF8Str); <- triggers conversion
> MyUTF8Str := StringList[0]; <- triggers conversion again
>
> The amazing thing is that such code works. Delphi does a good job in
> converting the strings.

That's it!
I think you talking about of new versions of Delphi, right? So I
always read that "new Unicode implementation" in new versions of
Delphi is wrong, broke things, etc. but you is writing other vision.
These conversions, IMHO, could be automatic -- as Delphi does -- when
I use the correct type of string, in that case UT8String. So, I can
write my packages and opt to use only UTF8String or UTF16String in all
arguments and the compiler convert for me. What is wrong in that
approach?

> It is also reasonably fast, but still not acceptable in a speed
> critical code. This was the problem in my employer's code base. We are
> thinking how to use UTF-8 for the core program without triggering many
> auto-conversions. One choice is to dump Delphi and use only FPC. Now
> the code still works with both.

If you do not want automatic conversions, use the RawByteString type.
Delphi does not do conversions in that case, right?
Thank you, I'm learning.

Marcos Douglas

> P.S.
> I am still wondering why you are so fond of WinAPI while you have a
> nice cross-platform API available.

Fond? Of course not! I use WinAPI when I need or when I don't know
another way to do the same using cross-plataform. I'm a "classic
Delphi programmer". I still use Delphi (stoped in 7 version) today but
all new projects I use Lazarus -- MSEgui a little.
For example, I use a lot PostMessage, SendMessage, PeekMessage... Are
these cross-plataform? If not, how can I do the same?

--

Juha Manninen

2013-12-21 15:18:05 UTC

This post might be inappropriate. Click to display it.

Marcos Douglas

2013-12-21 15:41:57 UTC

On Sat, Dec 21, 2013 at 1:18 PM, Juha Manninen
<***@gmail.com> wrote:
> On Sat, Dec 21, 2013 at 4:33 PM, Marcos Douglas <***@delfire.net> wrote:
>> That's it!
>> I think you talking about of new versions of Delphi, right? So I
>> always read that "new Unicode implementation" in new versions of
>> Delphi is wrong, broke things, etc. but you is writing other vision.
>
> Yes, Delphi 2009+. Delphi 2009 is soon 5 years old, not really new any more.
> IMO it does a good job for such a fundamental change in string type.
> Only code that relies on "sizeof(char) = 1" does not work. It includes
> streaming strings, file I/O or I/O with some outside devices, using
> Length(Str) as a parameter for GetMem(), Move() etc.
> Most "clean" code works amazingly well, if you are ok with using
> UTF-16 everywhere.
>
>
>> These conversions, IMHO, could be automatic -- as Delphi does -- when
>> I use the correct type of string, in that case UT8String. So, I can
>> write my packages and opt to use only UTF8String or UTF16String in all
>> arguments and the compiler convert for me. What is wrong in that
>> approach?
>
> Nothing wrong I guess. I hope it will be possible with FPC. Still,
> let's not speculate more, we already have such mail threads in fpc-dev
> list that continued for months.
>
>
>> If you do not want automatic conversions, use the RawByteString type.
>> Delphi does not do conversions in that case, right?
>> Thank you, I'm learning.
>
> You can bypass the conversion sometimes by using RawByteString but it
> would be rather hackish. Remember, all VCL classes and string
> functions expect UTF-16. I don't want to try what happens if you pass
> them a UTF-8 encoded string using some hack.
> The bottom line is: Use only UTF-16 with Delphi and it works very well.

I always said here -- FPC/Lazarus lists -- that FPC should never
follow Delphi but you're making me change my mind about Unicode
implementation.
Ok, no more speculations about how next FPC will work with Unicode.

>> For example, I use a lot PostMessage, SendMessage, PeekMessage... Are
>> these cross-plataform? If not, how can I do the same?
>
> LCL (and VCL) typically use events, like TNotifyEvent. They are
> basically just call-back functions.

Oh, not same. I use a lot Events -- no only Form or GUI components --
in my core codes but PostMessage is very different, eg., you call a
PostMessage, show a Modal Form and the process will start after; the
task code is not inside the instance of the Form and the Form knows
nothing about the task.

Marcos Douglas

--

Juha Manninen

2013-12-22 09:06:17 UTC

On Sat, Dec 21, 2013 at 5:41 PM, Marcos Douglas <***@delfire.net> wrote:
>> LCL (and VCL) typically use events, like TNotifyEvent. They are
>> basically just call-back functions.
> Oh, not same. I use a lot Events -- no only Form or GUI components --
> in my core codes but PostMessage is very different, eg., you call a
> PostMessage, show a Modal Form and the process will start after; the
> task code is not inside the instance of the Form and the Form knows
> nothing about the task.

Ok, true.
Some of the Windows message are ported to be cross-platform. I have
used OnIdle handler and sometimes threads when I want the action to
happen later.

Juha

--

Marcos Douglas

2013-12-22 16:54:55 UTC

On Sun, Dec 22, 2013 at 7:06 AM, Juha Manninen
<***@gmail.com> wrote:
> On Sat, Dec 21, 2013 at 5:41 PM, Marcos Douglas <***@delfire.net> wrote:
>>> LCL (and VCL) typically use events, like TNotifyEvent. They are
>>> basically just call-back functions.
>> Oh, not same. I use a lot Events -- no only Form or GUI components --
>> in my core codes but PostMessage is very different, eg., you call a
>> PostMessage, show a Modal Form and the process will start after; the
>> task code is not inside the instance of the Form and the Form knows
>> nothing about the task.
>
> Ok, true.
> Some of the Windows message are ported to be cross-platform. I have
> used OnIdle handler and sometimes threads when I want the action to
> happen later.

I use threads too, but I like make things as simple as possible and
threads can be hard sometimes. Use PostMessage is very easy and
simple.

Marcos Douglas

--

Jürgen Hestermann

2013-12-21 17:55:46 UTC

Am 2013-12-21 16:18, schrieb Juha Manninen:
> The bottom line is: Use only UTF-16 with Delphi and it works very well.

I would not like Lazarus to do the same.
UTF16 is the worst of all possible unicode encodings.

--

Bob Axtell

2013-12-22 05:51:44 UTC

>
>
anybody have a little time to help a 71 yr ole kid just startin' out?

--

Mark Morgan Lloyd

2013-12-22 11:38:02 UTC

Bob Axtell wrote:
>>
>>
> anybody have a little time to help a 71 yr ole kid just startin' out?

What are you trying to interface to? If you don't need a graphical
control I extended FPC's standard serial unit to work with Win-32, I
think it's reliable although I usually use the Linux or Solaris
variants. http://bugs.freepascal.org/view.php?id=18946

--
Mark Morgan Lloyd
markMLl .AT. telemetry.co .DOT. uk

[Opinions above are the author's, not those of his employers or colleagues]

--

Bob Axtell

2013-12-22 13:31:45 UTC

On 12/22/2013 4:38 AM, Mark Morgan Lloyd wrote:
> Bob Axtell wrote:
>>>
>>>
>> anybody have a little time to help a 71 yr ole kid just startin' out?
>
> What are you trying to interface to? If you don't need a graphical
> control I extended FPC's standard serial unit to work with Win-32, I
> think it's reliable although I usually use the Linux or Solaris
> variants. http://bugs.freepascal.org/view.php?id=18946
>
I am making a simple connection thru an RS232 LINK, and I have used
Comport with Delphi6 very successfully in the past. The graphical
control is needed to make the notebook app easily readable and usable by
our installer, who has ASUS netbooks with WinXP installed.

Thanks for taking me under your wing.

--Bob

--

The only place success comes before work is in the dictionary.

VINCE LOMBARDI

--

Juha Manninen

2013-12-22 09:13:22 UTC

On Sat, Dec 21, 2013 at 7:55 PM, Jürgen Hestermann
<***@gmx.de> wrote:
>> The bottom line is: Use only UTF-16 with Delphi and it works very well.
> I would not like Lazarus to do the same.
> UTF16 is the worst of all possible unicode encodings.

I believe LCL will continue to use UTF-8. Nobody knows yet how many
changes are needed later with new FPC versions but no worries, that
question is not acute now.

Juha

--

Michael Schnell

2013-12-20 09:16:15 UTC

On 12/20/2013 12:46 AM, Juha Manninen wrote:
> Yes, Delphi does that. Future FPC versions will do automatic
> conversion, too, but not only to UTF-16.

It's a long winding debate whether or not this is a good idea from a
technical POW, but as Delphi does this, FPC seems to need to follow.

In fact there are decent positive aspects.

But it obviously is a negative aspect if TStringList and such functions
are implemented using a fixed encoding scheme forcing conversions to and
fro when e.g. using TStringList as an intermediate store. Here, a
"generic" implementation (which Delphi does not provide) would be good.
IMHO this is doable without loosing Delphi compatibility or performance.

-Michael

--

Marcos Douglas

2013-12-21 01:11:51 UTC

On Fri, Dec 20, 2013 at 7:16 AM, Michael Schnell <***@lumino.de> wrote:
> On 12/20/2013 12:46 AM, Juha Manninen wrote:
>>
>> Yes, Delphi does that. Future FPC versions will do automatic conversion,
>> too, but not only to UTF-16.
>
>
> It's a long winding debate whether or not this is a good idea from a
> technical POW, but as Delphi does this, FPC seems to need to follow.
>
> In fact there are decent positive aspects.
>
> But it obviously is a negative aspect if TStringList and such functions are
> implemented using a fixed encoding scheme forcing conversions to and fro
> when e.g. using TStringList as an intermediate store. Here, a "generic"
> implementation (which Delphi does not provide) would be good. IMHO this is
> doable without loosing Delphi compatibility or performance.

+1

That's I was talking about in previous mail, "using TStringList as an
intermediate store".

Marcos Douglas

--

Hans-Peter Diettrich

2013-12-23 11:58:31 UTC

Juha Manninen schrieb:

> However using a new Unicode-Delphi would cause
> many problems because all VCL functions and classes, including
> TStringList, expect UTF-16 string. When using UTF8String, the compiler
> converts between encodings all the time.

Then you can give your favorite string type a unique name, and set it to
whatever is best in your favorite environment.

DoDi

--

Juha Manninen

2013-12-23 12:25:19 UTC

On Mon, Dec 23, 2013 at 1:58 PM, Hans-Peter Diettrich
<***@aol.com> wrote:
> Then you can give your favorite string type a unique name, and set it to
> whatever is best in your favorite environment.

The favorite string type in this case would be UTF8String. It already
has a name. Please see what I was writing earlier.

Juha

--

Jy V

2013-12-20 16:09:20 UTC

On Wed, Dec 18, 2013 at 6:48 AM, Juha Manninen
<***@gmail.com> wrote:
> What more, UTF-16 is confusing because it has variations. It all is
> well explained here:
> http://www.utf8everywhere.org/

my experience at: http://www.utf8bootcamp.org/

Marcos Douglas

2013-12-17 23:41:09 UTC

On Tue, Dec 17, 2013 at 6:23 AM, Mattias Gaertner
<nc-***@netcologne.de> wrote:
> On Mon, 16 Dec 2013 23:41:34 -0200
> Marcos Douglas <***@delfire.net> wrote:
>
>>[...]
>> I think use one encoding on all platforms is good, however Windows
>> uses UTF-16. All string from/to Windows needs to be converted, right?
>> Is this not a penalty for Windows platform?
>
> The Windows API is UTF-16, most text files and databases are not. You
> will always have some conversions. In most cases the conversion is
> hardly measurable, but there are cases where UTF-16 is better and cases
> where UTF-8 is better. And in some cases even UTF-32 is better.
> The LCL is graphical library, so string speed hardly matters.

Yes, will always have conversions... the problem are conversions
inside a only program (LCL <> RTL). ;-)

>> What coding FPC's team chose to use on Windows in the next release of
>> the compiler, UTF-16?
>
> I'm not member of the FPC team, but afaik UTF-16 would be a big
> incompatibility, so the next release will still using system codepage
> string for Windows. Maybe eventually another flavor with UTF-16.

So, nothing will change for Lazarus users after the new FPC release.

>>[...]
>> Lazarus uses only 'string', not UTF8String, UnicodeString,
>> RawByteString, etc. This will change?
>
> Lazarus used only string (AnsiString), now it has to be more specific
> at some places to avoid unnecessary conversions by the compiler.

Ok, thank you for all explanations.

Marcos Douglas

--

Jürgen Hestermann

2013-12-17 18:15:15 UTC

Am 2013-12-17 02:41, schrieb Marcos Douglas:
> I think use one encoding on all platforms is good, however Windows
> uses UTF-16. All string from/to Windows needs to be converted, right?
> Is this not a penalty for Windows platform?

I am just writing a file manager for Windows (hopefully can port it to Linux later)
and I don't see any performance problems by using UTF8 in my program while the API is UTF16.
Most (if not all) things that I do with files take much longer than the string conversion so it does not matter much.

IMO it makes very much sense to use UTF8 (which Linux uses anyway) throughout all programs.
Having multiple string encodings within Lazarus/FPC makes it worse than necessary.

Maybe some day all programs and OS's use UTF8 so that
we never need to think about encodings anymore (just dreaming...).
At least, the more UTF8 is used the fewer conversions are needed.

I think that UTF8 is the English language of encoding ;-)

--

Martin Schreiber

2013-12-17 18:51:51 UTC

On Tuesday 17 December 2013 19:15:15 Jürgen Hestermann wrote:
>
> IMO it makes very much sense to use UTF8 (which Linux uses anyway)

Warning: Linux uses "array of byte" for filenames, not utf-8 AFAIK.

Martin

--

Jürgen Hestermann

2013-12-19 16:18:44 UTC

Am 2013-12-17 19:51, schrieb Martin Schreiber:
> On Tuesday 17 December 2013 19:15:15 Jürgen Hestermann wrote:
>> IMO it makes very much sense to use UTF8 (which Linux uses anyway)
> Warning: Linux uses "array of byte" for filenames, not utf-8 AFAIK.
>

Of course, as long as it is only stored in the file system it is alway just an array of bytes.
Encoding is irrelevant as long as you only identify files by the byte sequence.
But when displaying such a file name on the screen then the display component
mußt have a clue about the encoding so that characters can be displayed.
There must be a convention how to *interpret* that array of byte.

But it seems that actually such a convention does not realy exist when I read this correctly:
http://unix.stackexchange.com/questions/2089/what-charset-encoding-is-used-for-filenames-and-paths-on-linux

--

Marcos Douglas

2013-12-18 01:16:34 UTC

On Tue, Dec 17, 2013 at 4:15 PM, Jürgen Hestermann
<***@gmx.de> wrote:
> Am 2013-12-17 02:41, schrieb Marcos Douglas:
>
>> I think use one encoding on all platforms is good, however Windows
>> uses UTF-16. All string from/to Windows needs to be converted, right?
>> Is this not a penalty for Windows platform?
>
> I am just writing a file manager for Windows (hopefully can port it to Linux
> later)
> and I don't see any performance problems by using UTF8 in my program while
> the API is UTF16.
> Most (if not all) things that I do with files take much longer than the
> string conversion so it does not matter much.

Ok. But how do you work, using SysToUTF8 / UTF8ToSys?

> IMO it makes very much sense to use UTF8 (which Linux uses anyway)
> throughout all programs.
> Having multiple string encodings within Lazarus/FPC makes it worse than
> necessary.

It makes perfect sense to use only one encoding, whatever, in all code
of a single program. ;-)

>
> [...]

Marcos Douglas

--

Jürgen Hestermann

2013-12-19 16:33:45 UTC

Am 2013-12-18 02:16, schrieb Marcos Douglas:
> On Tue, Dec 17, 2013 at 4:15 PM, Jürgen Hestermann
> <***@gmx.de> wrote:
>> I am just writing a file manager for Windows (hopefully can port it to Linux
>> later)
>> and I don't see any performance problems by using UTF8 in my program while
>> the API is UTF16.
>> Most (if not all) things that I do with files take much longer than the
>> string conversion so it does not matter much.
> Ok. But how do you work, using SysToUTF8 / UTF8ToSys?

I use the following:

---------------------------
var X,Path : UTF8String;
FW : Win32_Find_DataW;

H := FindFirstFileW(pwidechar(UTF8Decode(WinAPIPathName(Path))),FW);
...
X := UTF8Encode(UnicodeString(FW.cFileName));
---------------------------

where WinAPIPathName just prepends the "\\?\" string to the pathname to overcome the 255 char length limitation.
Path is the UTF8 string for the file search and X holds the found file name(s) in UTF8 notation.
When I later need an API-call I convert back:

---------------------------
... Windows.DeleteFileW(pwidechar(UTF8Decode(WinAPIPathName(AppendDir(Pfad,X)))))
---------------------------

--

Marcos Douglas

2013-12-19 22:04:15 UTC

On Thu, Dec 19, 2013 at 2:33 PM, Jürgen Hestermann
<***@gmx.de> wrote:
> Am 2013-12-18 02:16, schrieb Marcos Douglas:
>
>> On Tue, Dec 17, 2013 at 4:15 PM, Jürgen Hestermann
>> <***@gmx.de> wrote:
>>> I am just writing a file manager for Windows (hopefully can port it to
>>> Linux
>>> later)
>>> and I don't see any performance problems by using UTF8 in my program
>>> while
>>> the API is UTF16.
>>> Most (if not all) things that I do with files take much longer than the
>>> string conversion so it does not matter much.
>> Ok. But how do you work, using SysToUTF8 / UTF8ToSys?
>
> I use the following:
>
> ---------------------------
> var X,Path : UTF8String;
> FW : Win32_Find_DataW;
>
> H := FindFirstFileW(pwidechar(UTF8Decode(WinAPIPathName(Path))),FW);
> ...
> X := UTF8Encode(UnicodeString(FW.cFileName));
> ---------------------------
>
> where WinAPIPathName just prepends the "\\?\" string to the pathname to
> overcome the 255 char length limitation.
> Path is the UTF8 string for the file search and X holds the found file
> name(s) in UTF8 notation.
> When I later need an API-call I convert back:
>
> ---------------------------
> ...
> Windows.DeleteFileW(pwidechar(UTF8Decode(WinAPIPathName(AppendDir(Pfad,X)))))
> ---------------------------

Well, the same problem...
If there is no solution (for now), I prefer using SysToUTF8/ UTF8ToSys
because is more simpler than use WideString API and conversion to
UnicodeString, UTF8Decode, etc. Don't you think?

Marcos Douglas

--

Bart

2013-12-19 22:39:46 UTC

On 12/19/13, Marcos Douglas <***@delfire.net> wrote:

> Well, the same problem...
> If there is no solution (for now), I prefer using SysToUTF8/ UTF8ToSys
> because is more simpler than use WideString API and conversion to
> UnicodeString, UTF8Decode, etc. Don't you think?

It fails if any of the characters is outside the current codepage
(e.g. Chinese or Taiwanese on my system). In that case the Wide API's
do work.

Bart

--

Jürgen Hestermann

2013-12-20 06:19:08 UTC

Am 2013-12-19 23:04, schrieb Marcos Douglas:
> Well, the same problem...
> If there is no solution (for now), I prefer using SysToUTF8/ UTF8ToSys
> because is more simpler than use WideString API and conversion to
> UnicodeString, UTF8Decode, etc. Don't you think?

As Bart already mentions, the ANSI (SYS) interface does *not* support Unicode.
Also, you are not be able to access long paths (longer than 255 characters) when using ANSI API functions.
Therefore the [W]ide (unicode) character API functions are a must.

--

Marcos Douglas

2013-12-20 19:55:48 UTC

On Fri, Dec 20, 2013 at 3:19 AM, Jürgen Hestermann
<***@gmx.de> wrote:
> Am 2013-12-19 23:04, schrieb Marcos Douglas:
>
>> Well, the same problem...
>> If there is no solution (for now), I prefer using SysToUTF8/ UTF8ToSys
>> because is more simpler than use WideString API and conversion to
>> UnicodeString, UTF8Decode, etc. Don't you think?
>
> As Bart already mentions, the ANSI (SYS) interface does *not* support
> Unicode.
> Also, you are not be able to access long paths (longer than 255 characters)
> when using ANSI API functions.
> Therefore the [W]ide (unicode) character API functions are a must.

So, these limitations exist in Lazarus too, right?

Marcos Douglas

--

Mattias Gaertner

2013-12-20 22:44:51 UTC

On Fri, 20 Dec 2013 17:55:48 -0200
Marcos Douglas <***@delfire.net> wrote:

> On Fri, Dec 20, 2013 at 3:19 AM, Jürgen Hestermann
> <***@gmx.de> wrote:
> > Am 2013-12-19 23:04, schrieb Marcos Douglas:
> >
> >> Well, the same problem...
> >> If there is no solution (for now), I prefer using SysToUTF8/ UTF8ToSys
> >> because is more simpler than use WideString API and conversion to
> >> UnicodeString, UTF8Decode, etc. Don't you think?
> >
> > As Bart already mentions, the ANSI (SYS) interface does *not* support
> > Unicode.
> > Also, you are not be able to access long paths (longer than 255 characters)
> > when using ANSI API functions.
> > Therefore the [W]ide (unicode) character API functions are a must.
>
> So, these limitations exist in Lazarus too, right?

The file functions with UTF8 in name use internally the W functions
under Windows.

Mattias

--

Marcos Douglas

2013-12-21 01:13:25 UTC

On Fri, Dec 20, 2013 at 8:44 PM, Mattias Gaertner
<nc-***@netcologne.de> wrote:
> On Fri, 20 Dec 2013 17:55:48 -0200
> Marcos Douglas <***@delfire.net> wrote:
>
>> On Fri, Dec 20, 2013 at 3:19 AM, Jürgen Hestermann
>> <***@gmx.de> wrote:
>> > Am 2013-12-19 23:04, schrieb Marcos Douglas:
>> >
>> >> Well, the same problem...
>> >> If there is no solution (for now), I prefer using SysToUTF8/ UTF8ToSys
>> >> because is more simpler than use WideString API and conversion to
>> >> UnicodeString, UTF8Decode, etc. Don't you think?
>> >
>> > As Bart already mentions, the ANSI (SYS) interface does *not* support
>> > Unicode.
>> > Also, you are not be able to access long paths (longer than 255 characters)
>> > when using ANSI API functions.
>> > Therefore the [W]ide (unicode) character API functions are a must.
>>
>> So, these limitations exist in Lazarus too, right?
>
> The file functions with UTF8 in name use internally the W functions
> under Windows.

I didn't know, thanks.

Marcos Douglas

--

Marco van de Voort

2013-12-22 18:56:19 UTC

On Sun, Dec 15, 2013 at 06:13:32PM +0100, Reinier Olislagers wrote:
> > FPC's context.
> > These components do not use Lazarus' routines and that is the BIG
> > problem. I need to "remember" in pass only ANSI strings for these
> > components as remember to convert the component's output string
> > results to use in Lazarus.
>
> Why not just include a project reference to LCLBase (IIRC that should be
> enough) and just always use the LCL units until FPC catches up?

FPC 2.7.x can compile the windows unit in unicode (UTF16) mode. Most system and
sysutils file related routines are already unicode (UTF16 with Rawbytestring
overload).

--

Marcos Douglas

2013-12-22 19:06:27 UTC

On Sun, Dec 22, 2013 at 4:56 PM, Marco van de Voort <***@stack.nl> wrote:
> On Sun, Dec 15, 2013 at 06:13:32PM +0100, Reinier Olislagers wrote:
>> > FPC's context.
>> > These components do not use Lazarus' routines and that is the BIG
>> > problem. I need to "remember" in pass only ANSI strings for these
>> > components as remember to convert the component's output string
>> > results to use in Lazarus.
>>
>> Why not just include a project reference to LCLBase (IIRC that should be
>> enough) and just always use the LCL units until FPC catches up?
>
> FPC 2.7.x can compile the windows unit in unicode (UTF16) mode. Most system and
> sysutils file related routines are already unicode (UTF16 with Rawbytestring
> overload).

So FPC 2.7.x can compile the windows unit in unicode (UTF16) mode. But
how it will work with Lazarus that uses UTF-8? Lazarus will not to
change to UTF-16 -- only for Windows -- then everything will stay the
same to Windows programmers?

Thanks,
Marcos Douglas

--

Marco van de Voort

2013-12-23 10:38:09 UTC

On Sun, Dec 22, 2013 at 05:06:27PM -0200, Marcos Douglas wrote:
> > FPC 2.7.x can compile the windows unit in unicode (UTF16) mode. Most system and
> > sysutils file related routines are already unicode (UTF16 with Rawbytestring
> > overload).
>
> So FPC 2.7.x can compile the windows unit in unicode (UTF16) mode. But
> how it will work with Lazarus that uses UTF-8?

Not without conversions. UTF8 on Windows IMHO _NEVER_ was a good idea.

> Lazarus will not to
> change to UTF-16 -- only for Windows -- then everything will stay the
> same to Windows programmers?

I think it is too early to say what will happen. One way or the other.
Everybody is still searching, and the current 2.6.x based UTF8 support will
need an overhaul anyway for 2.8.x.

I think 2.8.x will be a transition version anyway, and a definitive unicode
solution will only in the major release after that.

--

Marcos Douglas

2013-12-23 17:56:21 UTC

On Mon, Dec 23, 2013 at 8:38 AM, Marco van de Voort <***@stack.nl> wrote:
> On Sun, Dec 22, 2013 at 05:06:27PM -0200, Marcos Douglas wrote:
>> > FPC 2.7.x can compile the windows unit in unicode (UTF16) mode. Most system and
>> > sysutils file related routines are already unicode (UTF16 with Rawbytestring
>> > overload).
>>
>> So FPC 2.7.x can compile the windows unit in unicode (UTF16) mode. But
>> how it will work with Lazarus that uses UTF-8?
>
> Not without conversions. UTF8 on Windows IMHO _NEVER_ was a good idea.
>
>> Lazarus will not to
>> change to UTF-16 -- only for Windows -- then everything will stay the
>> same to Windows programmers?
>
> I think it is too early to say what will happen. One way or the other.
> Everybody is still searching, and the current 2.6.x based UTF8 support will
> need an overhaul anyway for 2.8.x.
>
> I think 2.8.x will be a transition version anyway, and a definitive unicode
> solution will only in the major release after that.

Ok, thanks for the explanation.

Marcos Douglas

--

Sven Barth

2013-12-15 20:25:03 UTC

On 15.12.2013 16:25, Marcos Douglas wrote:
>> As for DLL, ActiveX, well that is platform specific, and you will have
>> to convert.
>
> Forget that, this is not a problem. I'm using many DLL and ActiveX and
> I know this is not portable... that's Ok. Do not need any conversion.

I think this was also meant in context of string encoding and not that
ActiveX itself is not portable...

Regards,
Sven

--

Marcos Douglas

2013-12-15 15:26:37 UTC

On Sun, Dec 15, 2013 at 8:56 AM, Marco van de Voort <***@stack.nl> wrote:
> On Sun, Dec 15, 2013 at 05:01:35AM +0100, Hans-Peter Diettrich wrote:
>> > I have many systems coded in FPC+Lazarus only to run on Windows so I ask you:
>> > Is there some trick to make the FPC+Lazarus to use only ANSI?
>>
>> Why that? Lazarus is using UTF-8 throughout, so that writing and reading
>> files will work the same on all targets.
>
> The Tstringlist.save* routines that Marcos mentions are FPC.

Yes.

I know you use Delphi too and, of course, Windows.
Could you tell me how you work with this problem?

Thanks,
Marcos Douglas

--

Marcos Douglas

2013-12-15 15:13:33 UTC

On Sun, Dec 15, 2013 at 2:01 AM, Hans-Peter Diettrich
<***@aol.com> wrote:
> Marcos Douglas schrieb:
>
>
>> How I work:
>> 1. I use [string] to represent any type string. But some libs (DLL,
>> ActiveX, etc) uses WideString;
>
>
> That's Windows specific, not portable.

Yes and this is not a problem.

>> 2. If I have to create a file, I use UTF8ToSys(FileName)...
>
>
> Okay for filenames, even if IMO it should not be necessary.

Not Ok for me. Because this I wrote this mail...
It is not necessary ONLY if you uses ANSI path names.

>> and if I
>> have a TStringList I use SS.Text := UTF8ToSys(Text)... and at the end
>> SS.SaveToFile(UTF8ToSys(FileName));
>
>
> Why that?

As Marco van de Voort said, SaveToFile is FPC.

>> I have many systems coded in FPC+Lazarus only to run on Windows so I ask
>> you:
>> Is there some trick to make the FPC+Lazarus to use only ANSI?
>
>
> Why that? Lazarus is using UTF-8 throughout, so that writing and reading
> files will work the same on all targets.

That is the problem! Lazarus write files in UTF-8 but I can't write my
(all) Windows files in UTF-8.

Marcos Douglas

--

188 Replies
276 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Marcos Douglas 2013-12-15 03:02:17 UTC

Hans-Peter Diettrich 2013-12-15 04:01:35 UTC

Marco van de Voort 2013-12-15 10:56:08 UTC

Bart 2013-12-15 15:08:35 UTC

Marcos Douglas 2013-12-15 15:25:18 UTC

Reinier Olislagers 2013-12-15 17:13:32 UTC

Marcos Douglas 2013-12-15 19:47:09 UTC

Reinier Olislagers 2013-12-16 06:30:29 UTC

Marcos Douglas 2013-12-16 11:15:22 UTC

Reinier Olislagers 2013-12-16 11:18:49 UTC

Marcos Douglas 2013-12-16 11:51:13 UTC

Reinier Olislagers 2013-12-16 12:10:13 UTC

Marcos Douglas 2013-12-16 15:32:58 UTC

Hans-Peter Diettrich 2013-12-18 11:03:56 UTC

Marco van de Voort 2013-12-22 18:58:14 UTC

Hans-Peter Diettrich 2013-12-22 22:52:04 UTC

Marco van de Voort 2013-12-23 10:32:17 UTC

Jürgen Hestermann 2013-12-23 17:52:21 UTC

Marco van de Voort 2013-12-23 22:08:01 UTC

Hans-Peter Diettrich 2013-12-24 05:18:41 UTC

Marcos Douglas 2013-12-24 14:22:41 UTC

Jürgen Hestermann 2013-12-24 17:13:01 UTC

Marcos Douglas 2013-12-24 17:27:50 UTC

Graeme Geldenhuys 2013-12-25 10:12:20 UTC

Graeme Geldenhuys 2013-12-25 10:05:13 UTC

Hans-Peter Diettrich 2013-12-25 00:19:04 UTC

Marcos Douglas 2013-12-24 14:33:41 UTC

Sven Barth 2013-12-24 21:08:38 UTC

Marco van de Voort 2013-12-24 14:19:30 UTC

Jürgen Hestermann 2013-12-24 11:18:49 UTC

Marco van de Voort 2013-12-24 14:26:31 UTC

Reinier Olislagers 2013-12-24 15:21:45 UTC

Marcos Douglas 2013-12-24 17:35:14 UTC

Juha Manninen 2013-12-24 21:02:32 UTC

Hans-Peter Diettrich 2013-12-25 00:40:50 UTC

Jürgen Hestermann 2013-12-24 17:39:32 UTC

Hans-Peter Diettrich 2013-12-25 00:36:43 UTC

Graeme Geldenhuys 2013-12-25 10:17:53 UTC

Jürgen Hestermann 2013-12-25 10:03:55 UTC

Hans-Peter Diettrich 2013-12-25 10:34:49 UTC

Mattias Gaertner 2013-12-16 10:41:08 UTC

Marcos Douglas 2013-12-16 11:22:54 UTC

Mattias Gaertner 2013-12-16 12:28:47 UTC

Marcos Douglas 2013-12-16 15:43:41 UTC

Mattias Gaertner 2013-12-16 17:22:13 UTC

Marcos Douglas 2013-12-17 01:41:34 UTC

Mattias Gaertner 2013-12-17 08:23:52 UTC

Mattias Gaertner 2013-12-17 23:45:26 UTC

Marcos Douglas 2013-12-18 01:16:55 UTC

Marcos Douglas 2013-12-18 01:52:49 UTC

Martin Schreiber 2013-12-18 06:56:38 UTC

Michael Van Canneyt 2013-12-18 08:05:26 UTC

Marcos Douglas 2013-12-18 15:11:55 UTC

Martin Schreiber 2013-12-18 17:24:35 UTC

Juha Manninen 2013-12-18 09:48:52 UTC

Marcos Douglas 2013-12-18 15:19:22 UTC

Martin Schreiber 2013-12-18 17:29:04 UTC

Marcos Douglas 2013-12-18 19:53:29 UTC

Juha Manninen 2013-12-18 18:25:45 UTC

Marcos Douglas 2013-12-19 21:58:18 UTC

Juha Manninen 2013-12-19 23:46:18 UTC

Marcos Douglas 2013-12-20 00:47:27 UTC

Juha Manninen 2013-12-20 06:22:38 UTC

Michael Schnell 2013-12-20 09:23:06 UTC

Marcos Douglas 2013-12-21 01:08:52 UTC

Juha Manninen 2013-12-21 07:56:08 UTC

Marcos Douglas 2013-12-21 14:33:33 UTC

Juha Manninen 2013-12-21 15:18:05 UTC

Marcos Douglas 2013-12-21 15:41:57 UTC

Juha Manninen 2013-12-22 09:06:17 UTC

Marcos Douglas 2013-12-22 16:54:55 UTC

Jürgen Hestermann 2013-12-21 17:55:46 UTC

Bob Axtell 2013-12-22 05:51:44 UTC

Mark Morgan Lloyd 2013-12-22 11:38:02 UTC

Bob Axtell 2013-12-22 13:31:45 UTC

Juha Manninen 2013-12-22 09:13:22 UTC

Michael Schnell 2013-12-20 09:16:15 UTC

Marcos Douglas 2013-12-21 01:11:51 UTC

Hans-Peter Diettrich 2013-12-23 11:58:31 UTC

Juha Manninen 2013-12-23 12:25:19 UTC

Jy V 2013-12-20 16:09:20 UTC

Marcos Douglas 2013-12-17 23:41:09 UTC

Jürgen Hestermann 2013-12-17 18:15:15 UTC

Martin Schreiber 2013-12-17 18:51:51 UTC

Jürgen Hestermann 2013-12-19 16:18:44 UTC

Marcos Douglas 2013-12-18 01:16:34 UTC

Jürgen Hestermann 2013-12-19 16:33:45 UTC

Marcos Douglas 2013-12-19 22:04:15 UTC

Bart 2013-12-19 22:39:46 UTC

Jürgen Hestermann 2013-12-20 06:19:08 UTC

Marcos Douglas 2013-12-20 19:55:48 UTC

Mattias Gaertner 2013-12-20 22:44:51 UTC

Marcos Douglas 2013-12-21 01:13:25 UTC

Marco van de Voort 2013-12-22 18:56:19 UTC

Marcos Douglas 2013-12-22 19:06:27 UTC

Marco van de Voort 2013-12-23 10:38:09 UTC

Marcos Douglas 2013-12-23 17:56:21 UTC

Sven Barth 2013-12-15 20:25:03 UTC

Marcos Douglas 2013-12-15 15:26:37 UTC

Marcos Douglas 2013-12-15 15:13:33 UTC

about - legalese

Loading...