Does Lazarus support a complete Unicode Component Library?

Discussion:

Juha Manninen

2010-12-31 09:12:46 UTC

Hi

I was asked the title's question but I don't really know the answer.
So, compared to Tnt- unicode components for Delphi, or to the new Delphi
unicode support. How is it?

I know the FPC's UnicodeString type is not compatible with Delphi's but it is
close anyway, right (?)

Regards,
Juha

--

Vincent Snijders

2010-12-31 09:30:22 UTC

Permalink

2010/12/31 Juha Manninen <***@gmail.com>:
> Hi
>
> I was asked the title's question but I don't really know the answer.
> So, compared to Tnt- unicode components for Delphi, or to the new Delphi
> unicode support. How is it?

It is the same as the new Delphi unicode support, I think. All GUI
components support unicode out of the box, it uses UTF8 encoded
strings.
AFAIK Delphi uses UTF16, so in that way it is different.

Vincent

--

Juha Manninen

2010-12-31 10:29:33 UTC

Permalink

Vincent Snijders kirjoitti perjantai 31 joulukuu 2010 11:30:22:
> It is the same as the new Delphi unicode support, I think. All GUI
> components support unicode out of the box, it uses UTF8 encoded
> strings.
> AFAIK Delphi uses UTF16, so in that way it is different.

... so UTF8 is also unicode?
I am still confused with all the encodings and string types.

Thanks.

Juha

--

Marc Weustink

2010-12-31 10:56:24 UTC

Permalink

Juha Manninen wrote:
> Vincent Snijders kirjoitti perjantai 31 joulukuu 2010 11:30:22:
>> It is the same as the new Delphi unicode support, I think. All GUI
>> components support unicode out of the box, it uses UTF8 encoded
>> strings.
>> AFAIK Delphi uses UTF16, so in that way it is different.
>
> ... so UTF8 is also unicode?
> I am still confused with all the encodings and string types.

Yes both are. And both are variable width, where UTF16 encodes
characters in 16bit words and UTF8 in 8bit bytes.

For more info see
http://en.wikipedia.org/wiki/Unicode

Marc

--

DSK

2011-01-01 21:29:29 UTC

Permalink

Juha,

>... so UTF8 is also unicode?
>I am still confused with all the encodings and string types.

The best article ever on it ...
http://www.joelonsoftware.com/articles/Unicode.html

--
DSK
Posted with 1.19.1.269

--

Juha Manninen

2011-01-01 14:25:49 UTC

Permalink

> It is the same as the new Delphi unicode support, I think. All GUI
> components support unicode out of the box, it uses UTF8 encoded
> strings.
> AFAIK Delphi uses UTF16, so in that way it is different.
>

I must ask a newbie question again. I never needed to pay attention to this
because char encodings in GUIs have worked well for my purposes.

The GUI text properties have type "string" which is ansistring with the
normal H+ setting.
TCaption is defined as "string", too.
Examples: TEdit.Text, TMemo.Lines[0]
What happens when I do:
var s: string;
...
s := TMemo.Lines[0];

Is it converted somehow?
The native widget's encoding is either UTF-8 or UTF-16.
Is the string actually a Utf8String or Utf16String then?
When do I need to pay attention to it?

Juha

Sven Barth

2011-01-01 18:13:26 UTC

Permalink

On 01.01.2011 15:25, Juha Manninen wrote:
>
> It is the same as the new Delphi unicode support, I think. All GUI
> components support unicode out of the box, it uses UTF8 encoded
> strings.
> AFAIK Delphi uses UTF16, so in that way it is different.
>
>
> I must ask a newbie question again. I never needed to pay attention to
> this because char encodings in GUIs have worked well for my purposes.
>
> The GUI text properties have type "string" which is ansistring with the
> normal H+ setting.
> TCaption is defined as "string", too.
> Examples: TEdit.Text, TMemo.Lines[0]
> What happens when I do:
> var s: string;
> ...
> s := TMemo.Lines[0];
>
> Is it converted somehow?
> The native widget's encoding is either UTF-8 or UTF-16.
> Is the string actually a Utf8String or Utf16String then?
> When do I need to pay attention to it?

Currently there is no automatic conversion (it's planned in one of the
branches of FPC). For now a String (=AnsiString) can be seen as an
"array of byte". You as a developer are responsible that the string
contains the correct encoding.

So in your above example the string that is stored in "s" will be UTF8
encoded, because it comes from the GUI. But if that string contains
multibyte characters those characters will appear as single "one byte"
characters if you access the string using [], Pos, Copy, etc.

Example (note: this is not accurate UTF8 encoding, I'm just making that
up here)

TMemo.Lines[0] contains: 'hä?!' ( h a-umlaut ? ! )
I now assume that an a-umlaut is encoded as "ae" (which isn't really the
case, but it's for the sake of an example ^^)
s now contains: 'h a e ? !'

If you now want to access the second character of s you'd expect that
you'd get the a-umlaut, but if you do s[2] you'll get an "a". And if you
access the third one (s[3]) you'll get the "e" instead of "?".

You need to convert the UTF8 string to a different one, e.g. UTF16:

var
us: UnicodeString;
begin
us := UTF8Encode(s);
end;

Now us[2] will return the a-umlaut.

I hope this example clears that up a bit, if not: just ask more questions ;)

Regards,
Sven

--

Vladimir Zhirov

2011-01-01 20:14:32 UTC

Permalink

Sven Barth wrote:

> You need to convert the UTF8 string to a different one, e.g.
> UTF16:
>
> var
> us: UnicodeString;
> begin
> us := UTF8Encode(s);
> end;
>
> Now us[2] will return the a-umlaut.

I would suggest using Utf8Copy(s, 2, 1) instead. It helps
to avoid conversion and works correctly even for characters
that take 4 bytes in UnicodeString/WideString (i.e. 2
wide characters). Utf8Copy is declared in LCLProc unit.

--

Juha Manninen

2011-01-01 21:29:08 UTC

Permalink

Vladimir Zhirov kirjoitti lauantai 01 tammikuu 2011 22:14:32:
> Sven Barth wrote:
> > You need to convert the UTF8 string to a different one, e.g.
> > UTF16:
> >
> > var
> > us: UnicodeString;
> > begin
> > us := UTF8Encode(s);
> > end;
> >
> > Now us[2] will return the a-umlaut.
>
> I would suggest using Utf8Copy(s, 2, 1) instead. It helps
> to avoid conversion and works correctly even for characters
> that take 4 bytes in UnicodeString/WideString (i.e. 2
> wide characters). Utf8Copy is declared in LCLProc unit.

So the conversion is only needed if a char inside the string is accessed by
index?

I understand the principle but I didn't understand how the functions
UTF8Encode and UTF8Decode work. Of course I don't need to understand such
details because I am not FPC developer but anyway ...

UTF8Encode returns UTF8String and the AnsiString parameter is internally
typecasted to UnicodeString. How can that work?

Maybe Sven's example should use UTF8Decode. It returns UnicodeString.
According to debugger both functions convert the string to uppercase and add
some garbage to the beginning and end, but it may be debugger error.

Regards,
Juha

--

Vladimir Zhirov

2011-01-01 22:51:42 UTC

Permalink

Juha Manninen wrote:

> So the conversion is only needed if a char inside the string
> is accessed by index?

No, the conversion is completely optional.
Summing up what was suggested, there are two ways to access character
by index in UTF-8 string:

1. Convert it to WideString/UnicodeString and use MyWideString[Index];
2. Use Utf8Copy(MyString, Index, 1);

The limitation of the first approach is that it relies on the fact that the character fits in 2 bytes
(WideChar). As a result, it works wrong for characters of some languages and some special symbols
(see http://en.wikipedia.org/wiki/Supplementary_Multilingual_Plane#Supplementary_Multilingual_Plane
for the list of them). So this approach does not support "true" unicode, but works in most cases.

The second approach should handle this right (provided there is no bugs).

> UTF8Encode returns UTF8String and the AnsiString parameter is
> internally typecasted to UnicodeString. How can that work?
>
> Maybe Sven's example should use UTF8Decode.

Sure, UTF8Decode should have been used in this case.

--

Sven Barth

2011-01-02 11:47:28 UTC

Permalink

On 01.01.2011 22:29, Juha Manninen wrote:
> Vladimir Zhirov kirjoitti lauantai 01 tammikuu 2011 22:14:32:
>> Sven Barth wrote:
>>> You need to convert the UTF8 string to a different one, e.g.
>>> UTF16:
>>>
>>> var
>>> us: UnicodeString;
>>> begin
>>> us := UTF8Encode(s);
>>> end;
>>>
>>> Now us[2] will return the a-umlaut.
>>
>> I would suggest using Utf8Copy(s, 2, 1) instead. It helps
>> to avoid conversion and works correctly even for characters
>> that take 4 bytes in UnicodeString/WideString (i.e. 2
>> wide characters). Utf8Copy is declared in LCLProc unit.
>
> So the conversion is only needed if a char inside the string is accessed by
> index?
>

If you use the LCL in your application you can also use the UTF8Copy
which was mentioned by Vladimir.

Let's say it this way: if your String contains an UTF8 encoded text you
should not use [] or the normal Pos, Copy, etc. functions, because they
might return garbage. Use functions that can work with that encoding
(either by converting the string or working directly on it).

> I understand the principle but I didn't understand how the functions
> UTF8Encode and UTF8Decode work. Of course I don't need to understand such
> details because I am not FPC developer but anyway ...
>
> UTF8Encode returns UTF8String and the AnsiString parameter is internally
> typecasted to UnicodeString. How can that work?
>

You looked at the wrong function. I meant the one below it which has a
UnicodeString as argument. And this also solves the mystery:

Casting from AnsiString to UnicodeString invokes the WideString
Manager's Ansi2UnicodeMoveProc which converts the supplied AnsiString to
a correct UTF16 string. Then the function which takes an UnicodeString
as argument is invoked (it's an overloaded function after all) and the
UTF16 string is converted to UTF8.

> Maybe Sven's example should use UTF8Decode. It returns UnicodeString.
> According to debugger both functions convert the string to uppercase and add
> some garbage to the beginning and end, but it may be debugger error.

Yes, it should have used UTF8Decode. I used the wrong function. -.-

Regards,
Sven

--

Juha Manninen

2011-01-02 12:06:38 UTC

Permalink

Sven Barth kirjoitti sunnuntai 02 tammikuu 2011 13:47:28:
> Casting from AnsiString to UnicodeString invokes the WideString
> Manager's Ansi2UnicodeMoveProc which converts the supplied AnsiString to
> a correct UTF16 string.

Ok, there is some "compiler magic" here. I think I understand it now.

Regards,
Juha

--

Graeme Geldenhuys

2011-01-02 17:29:17 UTC

Permalink

On 2 January 2011 13:47, Sven Barth <***@googlemail.com> wrote:
> Casting from AnsiString to UnicodeString invokes the WideString Manager's
> Ansi2UnicodeMoveProc which converts the supplied AnsiString to a correct
> UTF16 string.

Does that mean FPC and LCL always treats UnicodeString type as a UTF16
encoded type? If so, that is a rather odd "type name" then, because
"unicode" is NOT just UTF16, it is also UTF8, UTF16, UTF16-LE,
UTF16-BE and UTF32. A better, and more correct, type name would then
have been UTF16String, just like there is a UTF8String type (though I
don't really know how the latter differs from AnsiString (which is
basically an array of bytes).

--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net

--

Sven Barth

2011-01-02 19:33:53 UTC

Permalink

On 02.01.2011 18:29, Graeme Geldenhuys wrote:
> On 2 January 2011 13:47, Sven Barth<***@googlemail.com> wrote:
>> Casting from AnsiString to UnicodeString invokes the WideString Manager's
>> Ansi2UnicodeMoveProc which converts the supplied AnsiString to a correct
>> UTF16 string.
>
> Does that mean FPC and LCL always treats UnicodeString type as a UTF16
> encoded type? If so, that is a rather odd "type name" then, because
> "unicode" is NOT just UTF16, it is also UTF8, UTF16, UTF16-LE,
> UTF16-BE and UTF32. A better, and more correct, type name would then
> have been UTF16String, just like there is a UTF8String type (though I
> don't really know how the latter differs from AnsiString (which is
> basically an array of bytes).

Yes, UnicodeString (and WideString as well) is treated as UTF16 encoded
string.

The type name might come from Delphi compatibility (tada!).

And currently UTF8String is defined as AnsiString, so there is currently
no difference (which could change once the cpstrnew branch is mature
enough).

Regards,
Sven

--

Graeme Geldenhuys

2011-01-02 21:22:43 UTC

Permalink

On 2 January 2011 21:33, Sven Barth wrote:
>
> Yes, UnicodeString (and WideString as well) is treated as UTF16 encoded
> string.
>
> The type name might come from Delphi compatibility (tada!).

And once again for someone as myself, not using Delphi, it is quite
ridiculous too see the errors the Free Pascal project makes (or should
that rather be the errors Delphi makes) in such cases/examples. Trying
to fool all developers into thinking that "unicode" [as in
UnicodeString type] only means UTF-16, because indecently that is what
Microsoft uses in its Windows OS. Free Pascal is a cross-platform
compiler but it seems Microsoft even dictates what Free Pascal must
do. A shame really (and a slap in the face for any developer working
on a non-Microsoft platform). UnicodeString should really mean any of
the possible unicode encoding types.

Maybe the code-page enabled string type (cpstrnew branch) will use
some more "sane" name for its string type, or redefine the standard
String type to mean a code page / encoding enabled string type instead
of String = AnsiString.

> And currently UTF8String is defined as AnsiString, so there is currently no
> difference

That's what I thought. So why did they [FPC team] actually bother to
create the UTF8String type then?

--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net

--

Henry Vermaak

2011-01-02 21:47:56 UTC

Permalink

On 2 January 2011 21:22, Graeme Geldenhuys <***@gmail.com> wrote:
>> And currently UTF8String is defined as AnsiString, so there is currently no
>> difference
>
> That's what I thought. So why did they [FPC team] actually bother to
> create the UTF8String type then?

Maybe this is just to help developers, so you can use it to remind
yourself (or other people working on your code) that a certain string
contains utf8, so that you/they remember to use the conversion
functions, etc. Just a guess.

Henry

--

Michael Schnell

2011-01-03 11:34:49 UTC

Permalink

On 01/02/2011 10:22 PM, Graeme Geldenhuys wrote:
> the code-page enabled string type (cpstrnew branch)
I understand that cpstrnew not necessary uses code pages, but it _can_
use it if it seems appropriate.

A conversion between multiple UTF-types and codepage based coding
types is done automatically, if necessary. (Which will both prevent and
introduce confusion :-) )

Correct ?

-Michael

--

Marco van de Voort

2011-02-12 17:49:11 UTC

Permalink

On Sun, Jan 02, 2011 at 11:22:43PM +0200, Graeme Geldenhuys wrote:
> > The type name might come from Delphi compatibility (tada!).
>
> And once again for someone as myself, not using Delphi, it is quite
> ridiculous too see the errors the Free Pascal project makes

It doesn't. You just assume nobody thought about it before you, and that it
is an error. Moreover, you also assume that everybody has the same
sensitivities as you about certain topics, and again that is wrong.

> (or should that rather be the errors Delphi makes) in such cases/examples.
> Trying to fool all developers into thinking that "unicode" [as in
> UnicodeString type] only means UTF-16, because indecently that is what
> Microsoft uses in its Windows OS.

We would never try to "fool" anybody to think such a thing. Please stop
putting words in our mouths.

> Maybe the code-page enabled string type (cpstrnew branch) will use
> some more "sane" name for its string type, or redefine the standard
> String type to mean a code page / encoding enabled string type instead
> of String = AnsiString.

This is all undecided. I lean towards splitting operating system targets
into a utf8 and a utf16 one for most platforms(*), since nobody will ever agree
on one encoding. Not even per platform.

(*) and a legacy "ansi" one if need be.

> > And currently UTF8String is defined as AnsiString, so there is currently no
> > difference
>
> That's what I thought. So why did they [FPC team] actually bother to
> create the UTF8String type then?

It's an alias for literal programming purposes. You can see from the
typename what a procedure expects, and it goes into the documentation

--

Michael Schnell

2011-02-14 10:07:51 UTC

Permalink

On 02/12/2011 06:49 PM, Marco van de Voort wrote:And currently
UTF8String is defined as AnsiString, so there is currently no
>>> difference
>> That's what I thought. So why did they [FPC team] actually bother to
>> create the UTF8String type then?
> It's an alias for literal programming purposes.

While I disagree on much of his wording I so agree with the OP, that
with having ANSIString and UTF8String an alias for exactly the same -
only to allow for keeping in mind what coding the user manually
introduces - is very confusing. These names strongly suggest that a
statement myANSIString := MYUTF8String either is detected as illegal or
forces appropriate conversion.

-Michael

--

Marco van de Voort

2011-02-16 21:12:12 UTC

Permalink

On Mon, Feb 14, 2011 at 11:07:51AM +0100, Michael Schnell wrote:
> On 02/12/2011 06:49 PM, Marco van de Voort wrote:And currently
> UTF8String is defined as AnsiString, so there is currently no
> >>> difference
> >> That's what I thought. So why did they [FPC team] actually bother to
> >> create the UTF8String type then?
> > It's an alias for literal programming purposes.
>
> While I disagree on much of his wording I so agree with the OP, that
> with having ANSIString and UTF8String an alias for exactly the same -
> only to allow for keeping in mind what coding the user manually
> introduces - is very confusing. These names strongly suggest that a
> statement myANSIString := MYUTF8String either is detected as illegal or
> forces appropriate conversion.

I don't see why, and don't feel responsible for other people's speculation.

--

Felipe Monteiro de Carvalho

2011-02-14 11:16:55 UTC

Permalink

On Sat, Feb 12, 2011 at 6:49 PM, Marco van de Voort <***@stack.nl> wrote:
> This is all undecided. I lean towards splitting operating system targets
> into a utf8 and a utf16 one for most platforms(*), since nobody will ever agree
> on one encoding. Not even per platform.
>
> (*) and a legacy "ansi" one if need be.

Why do we need "targets"?

Wouldn't it be better to simply duplicate all string functions for
utf8 and utf16 and ansi if necessary?

That was my idea from the start in case the new string was merged. For example:

CompareText
UTF8CompareText
UTF16CompareText

The versions with a fixed encoding could refer to a generic unicode
version with undefined encoding.

--
Felipe Monteiro de Carvalho

--

Graeme Geldenhuys

2011-02-14 12:08:10 UTC

Permalink

Op 2011-02-14 13:16, Felipe Monteiro de Carvalho het geskryf:
>
> Why do we need "targets"?

I would imagine that such a new string type (possibly UnicodeString)
would default to the encoding used per platform by default.

eg:
* a Linux UnicodeString will default to UnicodeString(utf8)
* a Windows UnicodeString will default to UnicodeString(utf16)
etc..

But that doesn't limit the developer, because the developer could simply
define a new string type and use that instead.

eg:
// alias types with their encodings set to something specific
UTF8String = UnicodeString(utf8);
UTF16String = UnicodeString(utf16);
CP850String = UnicodeString(cp850);
etc...

> Wouldn't it be better to simply duplicate all string functions for
> utf8 and utf16 and ansi if necessary?

Why? CompareText and all other such functions should simply take a
UnicodeString types as parameter (in addition to the already existing
WideString, AnsiString and ShortString versions). The Unicode enabled
version of CompareText will then query the encodings used and do a
automatic conversion if needed, then do the comparison and return the
result.

eg:

var
u8: UTF8String;
u16: UTF16String;
s850: CP850String;
r: integer;
begin
u8 := ...;
u16 := ...;
s850 := ...;
r := CompareText(u8, u16);
...
r := CompareText(u8, s850);
...
end;

> CompareText
> UTF8CompareText
> UTF16CompareText

So for every possible encoding and code page you want to make a new
function? That doesn't sound like a good plan to me. The encoding
information is inside the string type, so use it to do automatic
conversions.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Felipe Monteiro de Carvalho

2011-02-14 12:15:38 UTC

Permalink

On Mon, Feb 14, 2011 at 1:08 PM, Graeme Geldenhuys
<***@gmail.com> wrote:
> But that doesn't limit the developer, because the developer could simply
> define a new string type and use that instead.

Maybe I used a bad example, but anyway, var parameters need to be exact

> So for every possible encoding and code page you want to make a new
> function?

Of course not! Only the important ones. For me this means only UTF-8,
but I supposed some people might want UTF-16 too.

--
Felipe Monteiro de Carvalho

--

Graeme Geldenhuys

2011-02-14 13:35:32 UTC

Permalink

Op 2011-02-14 14:15, Felipe Monteiro de Carvalho het geskryf:
>
> Maybe I used a bad example, but anyway, var parameters need to be exact

"alias types" are not really new types, so will not affect var
parameters. So in my previous example, UTF16String will still be a
UnicodeString. The only difference will be that UTF16String has its
internal encoding bit set to UTF-16.

Here is a test program using latest FPC 2.4.3 showing that TMyString =
String - the compiler sees no difference between the two types. So the
same should be valid for UnicodeString vs UnicodeString(...) that have
their encoding bit set to something other than the platform default.

---8<-----------8<-----------8<-----------8<-----------8<-----------
program project1;

{$mode objfpc}{$H+}

uses
Classes

type
TMyText = String;

procedure TestMe(var AText: string);
begin
writeln(AText);
AText := 'Hello ' + AText;
end;

var
s: TMyText;

begin
s := 'Graeme';
TestMe(s);
writeln(s);
end.
---8<-----------8<-----------8<-----------8<-----------8<-----------

>> So for every possible encoding and code page you want to make a new
>> function?
>
> Of course not! Only the important ones. For me this means only UTF-8,
> but I supposed some people might want UTF-16 too.

Again I don't see why that is needed. You might only require UTF-8, but
somebody else wants UTF-16, and somebody else wants CP850 versions etc
etc.. Where does it end?

A simple function like:

function CompareText(const S1: UnicodeString;
const S2: UnicodeString): integer;

should be able to work with any UnicodeString parameters (including
alias types that have different encoding bits set). So it should work
fine for your UTF-8 text, and somebody else's UTF-16 etc. text.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Felipe Monteiro de Carvalho

2011-02-14 14:00:45 UTC

Permalink

On Mon, Feb 14, 2011 at 2:35 PM, Graeme Geldenhuys
<***@gmail.com> wrote:
> Here is a test program using latest FPC 2.4.3 showing that TMyString =
> String - the compiler sees no difference between the two types. So the
> same should be valid for UnicodeString vs UnicodeString(...) that have
> their encoding bit set to something other than the platform default.

It is pure speculation to assume that this behavior will remain valid.

--
Felipe Monteiro de Carvalho

--

Graeme Geldenhuys

2011-02-14 14:23:05 UTC

Permalink

Op 2011-02-14 16:00, Felipe Monteiro de Carvalho het geskryf:
>
> It is pure speculation to assume that this behavior will remain valid.

Well, it seems obvious to me that it should [and would]. From my
previous examples, the alias type is still a UnicodeString type. So why
wouldn't methods/procedures/functions that take UnicodeString types as
parameters work?

If the FPC implementation of such a UnicodeString (and the baviour I
described) cannot handle such cases, then I would have to say that the
FPC implementation would be seriously crippled. Lets hope it doesn't go
that route.

Not that I care (because Delphi doesn't do everything perfect), but how
does Delphi 2010 handle such alias types - especially when passed as
parameters (var and const)?

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Sven Barth

2011-02-15 15:38:11 UTC

Permalink

Am 14.02.2011 15:23, schrieb Graeme Geldenhuys:
> Not that I care (because Delphi doesn't do everything perfect), but how
> does Delphi 2010 handle such alias types - especially when passed as
> parameters (var and const)?

In Delphi such strings with codepage need to be defined with "type" and
thus they are different types (not compatible regarding "var").

The following example fails to compile at the call of Test.

====source begin====
program strvartest;

{$APPTYPE CONSOLE}

uses
SysUtils;

type
CyrillicString = type AnsiString(1251);
LatinString = type AnsiString(1252);

procedure Test(var aStr: CyrillicString);
begin

end;

var
s: LatinString;
begin
s := 'Foo';
Test(s);
end.
====source end====

Regards,
Sven

--

Graeme Geldenhuys

2011-02-16 07:37:42 UTC

Permalink

Op 2011-02-15 17:38, Sven Barth het geskryf:
>
> In Delphi such strings with codepage need to be defined with "type" and
> thus they are different types (not compatible regarding "var").

Lets hope FPC doesn't take that stupid Delphi idea to implementation. A
UnicodeString with a different encoding bit set is still a UnicodeString
type.

Thanks though for testing and letting us know what Delphi does.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Sven Barth

2011-02-16 08:21:18 UTC

Permalink

Am 16.02.2011 08:37, schrieb Graeme Geldenhuys:
> Op 2011-02-15 17:38, Sven Barth het geskryf:
>>
>> In Delphi such strings with codepage need to be defined with "type" and
>> thus they are different types (not compatible regarding "var").
>
>
> Lets hope FPC doesn't take that stupid Delphi idea to implementation. A
> UnicodeString with a different encoding bit set is still a UnicodeString
> type.
>

First the cpstrnew branch needs to be revived at all...

> Thanks though for testing and letting us know what Delphi does.

You're welcome. After all I have bought that Delphi XE Starter for
exactly such tests :D

Regards,
Sven

--

Sergei Gorelkin

2011-02-16 08:21:51 UTC

Permalink

Graeme Geldenhuys пишет:
> Op 2011-02-15 17:38, Sven Barth het geskryf:
>> In Delphi such strings with codepage need to be defined with "type" and
>> thus they are different types (not compatible regarding "var").
>
>
> Lets hope FPC doesn't take that stupid Delphi idea to implementation. A
> UnicodeString with a different encoding bit set is still a UnicodeString
> type.
>
> Thanks though for testing and letting us know what Delphi does.
>
It's not as stupid as it may seem. When a string being passed to a function is empty, it is actually
a nil pointer, and its encoding cannot be determined at runtime. Fine for value parameters, but for
var/out parameters the caller must know what data to assign. Making each encoding a distinct type
and having compiler do the checks is probably the only way out.

Sergei

--

Graeme Geldenhuys

2011-02-16 09:20:47 UTC

Permalink

Op 2011-02-16 10:21, Sergei Gorelkin het geskryf:
>
> When a string being passed to a
> function is empty, it is actually a nil pointer, and its encoding cannot
> be determined at runtime.

A new RTL function like...

function QueryEncoding(S1: UnicodeString): StringEncodingType;

...could easily solve this problem. Plus, even though the string is
empty, the internal type structure information for that string variable
had to be setup, so the information about the encoding should exist.

> Fine for value parameters, but for var/out
> parameters the caller must know what data to assign.

Again, if you simply use UnicodeString, the var parameter and the local
UnicodeString variable inside the function/procedure should have the
same encoding by default - so no conversion would be needed.

And even if they were different, the compiler could easily to the
auto-conversion to match the var parameter (as I described in my first
paragraph above).

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Jürgen Hestermann

2011-02-14 18:03:31 UTC

Permalink

Graeme Geldenhuys schrieb:
> A simple function like:
> function CompareText(const S1: UnicodeString;
> const S2: UnicodeString): integer;
> should be able to work with any UnicodeString parameters (including
> alias types that have different encoding bits set). So it should work
> fine for your UTF-8 text, and somebody else's UTF-16 etc. text.

Do you mean that the compiler should convert the strings as needed in
the background (as between different integer types and/or floats) so
that you can call ListBox1.Items.Add(x) with x beeing UTF8string or
UTF16string or...? I am not sure whether this is a good idea because the
programmer no longer knows how many conversions take place and therefore
cannot judge the performance impact anymore. On the other hand it would
make the life easier for beginners.

--

José Mejuto

2011-02-14 19:13:45 UTC

Permalink

Hello Lazarus-List,

Monday, February 14, 2011, 7:03:31 PM, you wrote:

JH> Do you mean that the compiler should convert the strings as needed in
JH> the background (as between different integer types and/or floats) so
JH> that you can call ListBox1.Items.Add(x) with x beeing UTF8string or
JH> UTF16string or...? I am not sure whether this is a good idea because the
JH> programmer no longer knows how many conversions take place and therefore
JH> cannot judge the performance impact anymore. On the other hand it would
JH> make the life easier for beginners.

I'm unable to see the "great" problems with "UnicodeString". The
conversions should be the minimun needed, and they will be. Problem
would be in the RTL, but not at user level. You say that the
programmer will not know how many conversions take place, that's
right, but I think they are garanteed to be the minimum except in some
corner cases like "CompareText(UTF8String,WideString)" as one of both
must be converted, but whichever one, could be a fixed situation or
platform dependent, I do not know.

Many people are concerned about "speed" due hidden conversions, so can
anybody tell me why ? Maybe I'm blind and I can not see something that
is absolutly a problem (except some pieces of RTL).

--
Best regards,
José

--

Mattias Gaertner

2011-02-14 19:29:04 UTC

Permalink

On Mon, 14 Feb 2011 20:13:45 +0100
José Mejuto <***@gmail.com> wrote:

> Hello Lazarus-List,
>
> Monday, February 14, 2011, 7:03:31 PM, you wrote:
>
> JH> Do you mean that the compiler should convert the strings as needed in
> JH> the background (as between different integer types and/or floats) so
> JH> that you can call ListBox1.Items.Add(x) with x beeing UTF8string or
> JH> UTF16string or...? I am not sure whether this is a good idea because the
> JH> programmer no longer knows how many conversions take place and therefore
> JH> cannot judge the performance impact anymore. On the other hand it would
> JH> make the life easier for beginners.
>
> I'm unable to see the "great" problems with "UnicodeString". The
> conversions should be the minimun needed, and they will be. Problem
> would be in the RTL, but not at user level.

Yes, since for example Linux allows non valid UTF-8 as file names,
so any auto conversion of file names to UTF-16 is an error.

> You say that the
> programmer will not know how many conversions take place, that's
> right, but I think they are garanteed to be the minimum except in some
> corner cases like "CompareText(UTF8String,WideString)" as one of both
> must be converted, but whichever one, could be a fixed situation or
> platform dependent, I do not know.
>
> Many people are concerned about "speed" due hidden conversions, so can
> anybody tell me why ? Maybe I'm blind and I can not see something that
> is absolutly a problem (except some pieces of RTL).

For instance searching needs a lot of compares. Comparing two
strings normally fails on the very first characters. An auto conversion
will always convert the whole string including allocating and releasing
memory, easily slowing down the conversion by an order of magnitude.

Mattias

--

José Mejuto

2011-02-14 20:14:42 UTC

Permalink

Hello Lazarus-List,

Monday, February 14, 2011, 8:29:04 PM, you wrote:

>> I'm unable to see the "great" problems with "UnicodeString". The
>> conversions should be the minimun needed, and they will be. Problem
>> would be in the RTL, but not at user level.
MG> Yes, since for example Linux allows non valid UTF-8 as file names,
MG> so any auto conversion of file names to UTF-16 is an error.

Hmmm... To me it looks like a Linux "problem"/"bug" for that kind of
access it is logical to me to use low level APIs. OK, that way you can
not access those files ? yes, but also in Windows there are similar
problems, some files can not be accessed using regular APIs and some
tricks must be used.

>> Many people are concerned about "speed" due hidden conversions, so can
>> anybody tell me why ? Maybe I'm blind and I can not see something that
>> is absolutly a problem (except some pieces of RTL).
MG> For instance searching needs a lot of compares. Comparing two
MG> strings normally fails on the very first characters. An auto conversion
MG> will always convert the whole string including allocating and releasing
MG> memory, easily slowing down the conversion by an order of magnitude.

This are the "some corner cases" which can not be handled in the usual
conversion, operation, conversion back, but I think there are not much
cases like this. Of course, there are cases like a TStringList with
100000 items in UTF16 and perform a search using an UTF8String, so or
a conversion request to the stringlist (convert all elements in one
go) or you must use your unicodestring using default unicode format
for the platform.

I would like to see an example of such problem (snippet) which could
be a headache, but maybe in the fpc mailing lists ?

--
Best regards,
José

--

Michael Schnell

2011-02-16 10:12:24 UTC

Permalink

On 02/14/2011 08:29 PM, Mattias Gaertner wrote:
>
> For instance searching needs a lot of compares. Comparing two
> strings normally fails on the very first characters. An auto conversion
> will always convert the whole string including allocating and releasing
> memory, easily slowing down the conversion by an order of magnitude.
Very valid point !

So we could have an auto-converting compare function in the RTL, that
(when finding different encodings) does not do a complete conversion
first and comparing afterwards but converts each multi-byte character of
both strings to the 32 bit Unicode character and compares thee two
DWords, and stops after the first difference.

-Michael

--

Graeme Geldenhuys

2011-02-16 10:22:33 UTC

Permalink

Op 2011-02-14 21:29, Mattias Gaertner het geskryf:
>
> Yes, since for example Linux allows non valid UTF-8 as file names,
> so any auto conversion of file names to UTF-16 is an error.

I have never noticed that, plus that seems more like a bug in Linux's
filesystem (whichever one you use - I use JFS only, and haven't noticed
such issues). So maybe file a bug report with the Linux project instead
or working around the issue forever.

> For instance searching needs a lot of compares. Comparing two
> strings normally fails on the very first characters. An auto conversion
> will always convert the whole string including allocating and releasing

You are missing the point. If full Unicode support exists in FPC and
Lazarus - say via the UnicodeString string type, then the encoding of
the search string and the encoding of the text inside the editor will be
the same - thus no conversion needed.

The only exception here will be a "Find in files" under Windows, where
Windows uses UTF-16, but normally files are stored as UTF-8. Though Mac,
Linux and *BSD will not be affected by this as they use UTF-8 pretty
much everywhere.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Marco van de Voort

2011-02-14 19:52:27 UTC

Permalink

On Mon, Feb 14, 2011 at 08:13:45PM +0100, Jos? Mejuto wrote:
> Many people are concerned about "speed" due hidden conversions, so can
> anybody tell me why ? Maybe I'm blind and I can not see something that
> is absolutly a problem (except some pieces of RTL).

Typical example is you mix two codebases which have a different opinion
about the string type. Then for every transition between those two codebases
you have a fair chance that a conversion is needed. It is throughout
possible that if you do an Tstringlist.indexof() that you do as many
conversions as elements in the stringlist (if your passed stringtype is
different from the tstringlist one).

A minimum conversion scheme uses one type internally and only converts at
the bounderies of the system. But even that has worst cases, e.g. like
operating on large database exports in a different format than native.

But at least that kind of problems is fairly localised. It is harder if it
is everywhere in the codebase, like the former example.

--

José Mejuto

2011-02-14 20:28:49 UTC

Permalink

Hello Lazarus-List,

Monday, February 14, 2011, 8:52:27 PM, you wrote:

>> Many people are concerned about "speed" due hidden conversions, so can
>> anybody tell me why ? Maybe I'm blind and I can not see something that
>> is absolutly a problem (except some pieces of RTL).
MvdV> Typical example is you mix two codebases which have a different opinion
MvdV> about the string type. Then for every transition between those two codebases
MvdV> you have a fair chance that a conversion is needed. It is throughout
MvdV> possible that if you do an Tstringlist.indexof() that you do as many
MvdV> conversions as elements in the stringlist (if your passed stringtype is
MvdV> different from the tstringlist one).

But you are in the same trouble if you use any other approach, or you
use your data in the same unicode format as the other codebase or you
update the codebase to use your "new" unicode format.

There isn't a solution for such situation. I'm currently working with
GeckoPort which uses WideString in every place and other special
strings. I know that conversions must happend so when I need to scan
for a string first convert my data to the "native" format and them
perform the scan.

I think expecting a TStringList in ansi encode to work transparently
and optimal using unicodestrings is just a dream, programmers should
update their codebase, but at least only for speed (reduce
autoconversions) and do not need to decide constantly which encoding
format is needed to call this or that function.

Using a different RTL for each encoding is even worst IMHO. But this
is just a simple opinion.

--
Best regards,
José

--

José Mejuto

2011-02-15 18:08:57 UTC

Permalink

Hello Lazarus-List,

Tuesday, February 15, 2011, 4:27:40 PM, you wrote:

MvdV> There is a solution that you can take FPC and Lazarus libraries out of the
MvdV> equation by splitting each target into an one and two byte encoding
MvdV> platform. That should fix the bulk of the problem.
[...]
MvdV> Update to what? The point is that neither of both most used encodings (UTF8/16) is
MvdV> is going away anything soon, and the split is right through platforms.

So you are talking about a "platform" about the encoding, but in any
operative system platform, this means you can choose RTL Linux 32 bits
WideString or RTL Linux 32 bits UTF8, do not ?

>> Using a different RTL for each encoding is even worst IMHO. But this
>> is just a simple opinion.
MvdV> Let's here the argumentation for that then.

Well, maybe I misunderstood the platform split...

--
Best regards,
José

--

Graeme Geldenhuys

2011-02-15 06:54:18 UTC

Permalink

Op 2011-02-14 21:13, José Mejuto het geskryf:
> right, but I think they are garanteed to be the minimum except in some
> corner cases like "CompareText(UTF8String,WideString)" as one of both
> must be converted, but whichever one,

Exactly, auto-conversion would be kept to a minimum without any user
intervention. In the above example CompareText() could quickly check
each strings encoding, and if one matches the platform default, that
string doesn't need a conversion.

So in your example above, if you ran that under Linux, only the second
parameter would require an encoding conversion.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Graeme Geldenhuys

2011-02-15 06:50:49 UTC

Permalink

Op 2011-02-14 20:03, Jürgen Hestermann het geskryf:
> Do you mean that the compiler should convert the strings as needed in
> the background (as between different integer types and/or floats) so
> that you can call ListBox1.Items.Add(x) with x beeing UTF8string or
> UTF16string or...?

Yes, but in reality how often would such conversions happen? TStringList
(used inside a TListBox) would use UnicodeString types. The encoding of
that type would default to whatever platform you compiled on. ie: under
Linux it would default to UTF-8, and under Windows it would default to
UTF-16

So if you define a new string of UnicodeString type in code, it would
automatically match the encoding type of the TStringList, so when you
add a string item to the listbox, no conversion would be needed. This
would probably be the case 99% of the time.

The developer would physically have to create a new string with and
manually set an encoding different to the platform default, before a
auto-conversion would be required.

In day-to-day work and in most cases auto-conversions will be kept to a
minimum - automatically.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Hans-Peter Diettrich

2011-02-15 14:32:58 UTC

Permalink

Graeme Geldenhuys

2011-02-16 07:15:44 UTC

Permalink

Op 2011-02-15 16:32, Hans-Peter Diettrich het geskryf:
>
> You realize the problems, that may result from the different char type
> of such an target-specific string type?

Please do share your thoughts...

I must add, that I would be very surprised if Embarcadero doesn't use
native encoded string types for the "unicode string" support in the
upcoming Delphi under Windows (UTF-16), Linux (UTF-8), Mac (UTF-8) etc..
I'm not 100% sure about the default Mac encoding, but seeing that it
comes from FreeBSD, I would guess UTF-8 there too.

As for saving text to file...It is universally known to use UTF-8 in
such cases, because UTF-8 is the perfect encoding for streaming. Hence
the W3C also said all HTML, XML etc should be preferably in UTF-8.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Hans-Peter Diettrich

2011-02-16 10:52:24 UTC

Permalink

Graeme Geldenhuys schrieb:
> Op 2011-02-15 16:32, Hans-Peter Diettrich het geskryf:
>> You realize the problems, that may result from the different char type
>> of such an target-specific string type?
>
> Please do share your thoughts...

Most people have been sure, in the past, that they use a SBCS, where
every character on screen is a char in memory. And consequently they use
indexed access to the chars in an string, and for...to loops. The same
procedures may work for UTF-16, where also most characters correspond to
one widechar, but this code will fail miserably on an UTF-8 platform,
where every single (visual) character can consist of any number of
chars, with no compiler warnings.

That's one reason why I think that it should be disallowed, in portable
code, to use any char type together with strings. Such restrictions
cannot be applied to specific string types, unless these are strictly
different from the old ShortStrings and AnsiStrings.

It would be nice, of course, for old style code, to have strings with a
known (app specific) and *immutable* encoding. String handling with such
a target independent string type would work properly on any target, as
long as the contents match the coder's expectations. In Cobol such
strings were for "usage computational", in constrast to "usage display"
with target specific encoding.

> I must add, that I would be very surprised if Embarcadero doesn't use
> native encoded string types for the "unicode string" support in the
> upcoming Delphi under Windows (UTF-16), Linux (UTF-8), Mac (UTF-8) etc..
> I'm not 100% sure about the default Mac encoding, but seeing that it
> comes from FreeBSD, I would guess UTF-8 there too.

AFAIK the UnicodeString allows for any dynamic encoding, be SBCS, MBCS
or UTF-8/16. The element (char) size and encoding have become part of
every Unicode string descriptor.

> As for saving text to file...It is universally known to use UTF-8 in
> such cases, because UTF-8 is the perfect encoding for streaming. Hence
> the W3C also said all HTML, XML etc should be preferably in UTF-8.

Right, UTF-8 is the recommended external representation of text. No byte
order problems, no conversion losses...

DoDi

--

Sven Barth

2011-02-16 12:20:37 UTC

Permalink

Am 16.02.2011 11:52, schrieb Hans-Peter Diettrich:
>> I must add, that I would be very surprised if Embarcadero doesn't use
>> native encoded string types for the "unicode string" support in the
>> upcoming Delphi under Windows (UTF-16), Linux (UTF-8), Mac (UTF-8) etc..
>> I'm not 100% sure about the default Mac encoding, but seeing that it
>> comes from FreeBSD, I would guess UTF-8 there too.
>
> AFAIK the UnicodeString allows for any dynamic encoding, be SBCS, MBCS
> or UTF-8/16. The element (char) size and encoding have become part of
> every Unicode string descriptor.

This is wrong.

The following compiles:

type
UTF8String = type AnsiString(65001);

but the following does not:

type
UTF8String = type UnicodeString(65001); // ';' expected, but '(' found

Tested using Delphi XE (65001 is the codepage for UTF-8 on Windows).

Regards,
Sven

--

Sergei Gorelkin

2011-02-16 18:25:18 UTC

Permalink

Sven Barth пишет:
> Am 16.02.2011 11:52, schrieb Hans-Peter Diettrich:
>>> I must add, that I would be very surprised if Embarcadero doesn't use
>>> native encoded string types for the "unicode string" support in the
>>> upcoming Delphi under Windows (UTF-16), Linux (UTF-8), Mac (UTF-8) etc..
>>> I'm not 100% sure about the default Mac encoding, but seeing that it
>>> comes from FreeBSD, I would guess UTF-8 there too.
>>
>> AFAIK the UnicodeString allows for any dynamic encoding, be SBCS, MBCS
>> or UTF-8/16. The element (char) size and encoding have become part of
>> every Unicode string descriptor.
>
> This is wrong.
>
> The following compiles:
>
> type
> UTF8String = type AnsiString(65001);
>
> but the following does not:
>
> type
> UTF8String = type UnicodeString(65001); // ';' expected, but '(' found
>
> Tested using Delphi XE (65001 is the codepage for UTF-8 on Windows).
>
You are right. Likewise, type AnsiString(1200) can be declared, but it won't work (1200 is utf-16
codepage).
In Delphi, UnicodeString is a very separate type, something close to the current FPC design.
It has BytesPerChar and Encoding attributes, but they are fixed to 2 and 1200 respectively and their
purpose is unclear (to consume memory? to make it look like it's compatible with AnsiString?)

This has a lot of consequences in RTL, e.g. passing them in 'array of const' uses type field
vtUnicodeString, not vtAnsiString; assigning to Variant uses varUString; a published property of
type UnicodeString has typekind=tkUString and so on. Part of these are already implemented in FPC
RTL due to compatibility reasons.

I'm afraid that due to this "compatibility" we're doomed to clone the Delphi implementation whatever
crappy it is :(

Sergei

--

Hans-Peter Diettrich

2011-02-16 22:21:14 UTC

Permalink

Sergei Gorelkin schrieb:

> I'm afraid that due to this "compatibility" we're doomed to clone the
> Delphi implementation whatever crappy it is :(

I'm not sure whether we have to be compatible with Delphi>7, and whether
the Delphi Unicode implementation is crappy. At least the choosen
implementation allowed for quite a smooth transition, from Ansi to Unicode.

DoDi

--

Graeme Geldenhuys

2011-02-17 06:54:58 UTC

Permalink

Op 2011-02-16 20:25, Sergei Gorelkin het geskryf:
> In Delphi, UnicodeString is a very separate type, something close to the
> current FPC design.
> It has BytesPerChar and Encoding attributes, but they are fixed to 2 and
> 1200 respectively and their purpose is unclear (to consume memory?

Now that to me is just stupid!

> I'm afraid that due to this "compatibility" we're doomed to clone the
> Delphi implementation whatever crappy it is :(

If so, then I wouldn't waist any more of my time on working on the
cpstrnew branch (which I have been doing quietly on my own). If I'm
going to invest time in something, I would hope to make it as best I
can, and intuitive at the same time - fixing mistake found in other
products/languages. Simply cloning a rubbish design for the sake of
cloning, is not why I got into programming in the first place!

Hans-Peter also raised a valid point. FPC had a goal to be Delphi 7
compatible, so that should leave use open to me inventive and learn from
post-Delphi 7 mistake, and make FPC even better than Delphi 7+. If FPC
just wants to be a Delphi clone, then why use FPC - just switch to the
"real" thing [Delphi]. They'll have cross-platform support soon.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Sven Barth

2011-02-17 08:11:57 UTC

Permalink

Am 17.02.2011 07:54, schrieb Graeme Geldenhuys:
>> I'm afraid that due to this "compatibility" we're doomed to clone the
>> Delphi implementation whatever crappy it is :(
>
> If so, then I wouldn't waist any more of my time on working on the
> cpstrnew branch (which I have been doing quietly on my own). If I'm
> going to invest time in something, I would hope to make it as best I
> can, and intuitive at the same time - fixing mistake found in other
> products/languages. Simply cloning a rubbish design for the sake of
> cloning, is not why I got into programming in the first place!
>

I'm also trying to fix mistakes Delphi introduced - at least in mode
objfpc.
Take my class helper implementation (WIP) for example:
Delphi allows "message", "virtual", "override", "published", etc, but
all those specifiers are ignored in the end (they just can't work with
the concept of class helpers). So I've simply forbidden them in mode objfpc.

> Hans-Peter also raised a valid point. FPC had a goal to be Delphi 7
> compatible, so that should leave use open to me inventive and learn from
> post-Delphi 7 mistake, and make FPC even better than Delphi 7+. If FPC
> just wants to be a Delphi clone, then why use FPC - just switch to the
> "real" thing [Delphi]. They'll have cross-platform support soon.

You need to have a Windows to even compile for other platforms... No,
thank you (and the IDE might not work in Wine, because it's stuffed with
.NET things).

Regards,
Sven

--

Hans-Peter Diettrich

2011-02-17 11:58:12 UTC

Permalink

Sven Barth schrieb:

>> Hans-Peter also raised a valid point. FPC had a goal to be Delphi 7
>> compatible, so that should leave use open to me inventive and learn from
>> post-Delphi 7 mistake, and make FPC even better than Delphi 7+. If FPC
>> just wants to be a Delphi clone, then why use FPC - just switch to the
>> "real" thing [Delphi]. They'll have cross-platform support soon.
>
> You need to have a Windows to even compile for other platforms... No,
> thank you (and the IDE might not work in Wine, because it's stuffed with
> .NET things).

If we cloned the multi-platform Delphi version(s), then we had to clone
the CLX, not the VCL. New support for other platforms is not yet out, so
we cannot decide right now whether we can or will follow that branch.

IMO the new Unicode versions broke so much legacy code, that FPC/Lazarus
could become a real successor of the last Ansi version, with free choice
of the added Lazarus Unicode handling (UTF-8 for now).

WRT the Delphi IDE: it can be run in a VM, what I prefer for all
commercial/long term projects anyhow. The debugging of cross-platform
applications requires a separate target machine anyhow, so that
virtualization is almost a *must* for the new Delphi versions.

DoDi

--

Hans-Peter Diettrich

2011-02-16 22:14:21 UTC

Permalink

Sven Barth schrieb:

> The following compiles:
>
> type
> UTF8String = type AnsiString(65001);
>
> but the following does not:
>
> type
> UTF8String = type UnicodeString(65001); // ';' expected, but '(' found
>
> Tested using Delphi XE (65001 is the codepage for UTF-8 on Windows).

Please test again, with something like
MyUnicodeString.Encoding := 65001;

DoDi

--

Graeme Geldenhuys

2011-02-16 12:42:37 UTC

Permalink

Op 2011-02-16 12:52, Hans-Peter Diettrich het geskryf:
> Most people have been sure, in the past, that they use a SBCS, where
> every character on screen is a char in memory. And consequently they use
> indexed access to the chars in an string, and for...to loops.

Yes, and that code accesses string characters 99% of the times in a
sequential manner, be that left-to-right (or other way round), hardly
ever random. So to overcome this "supposedly" limitation, one simply
needs to create a StringIterator (which I already have in my projects
where character extraction is needed) will work just fine. So I don't
see this as a problem at all.

> The same
> procedures may work for UTF-16,

No, character indexes will not work for UTF-16 either. Not ALL Unicode
Characters can fit into a 2-bytes. Also what about screen characters
that are made up of multiple code-points (combining diacritics etc)?
eg:
U+0041 (A) + U+030A (̊) = Å

Depending on how that string is normalized, doing a MyString[1] might
only return 'A' and not Å as you would have expected.

> one widechar, but this code will fail miserably on an UTF-8 platform,

And so too for UTF-16 - as I have just shown. If you want to use UTF-16
like that (just because *most* of the Unicode code-points can fit into
2-bytes), then it is no better that UCS-2.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Hans-Peter Diettrich

2011-02-16 22:58:13 UTC

Permalink

Graeme Geldenhuys schrieb:
> Op 2011-02-16 12:52, Hans-Peter Diettrich het geskryf:
>> Most people have been sure, in the past, that they use a SBCS, where
>> every character on screen is a char in memory. And consequently they use
>> indexed access to the chars in an string, and for...to loops.
>
> Yes, and that code accesses string characters 99% of the times in a
> sequential manner, be that left-to-right (or other way round), hardly
> ever random. So to overcome this "supposedly" limitation, one simply
> needs to create a StringIterator (which I already have in my projects
> where character extraction is needed) will work just fine. So I don't
> see this as a problem at all.

What's the type of the loop variable???

The iteration costs time, so that many users will insist in using "fast"
SBCS access. No doubt that proper Unicode coding will require iterators,
unless Pos can return an valid index immediately.

>> The same
>> procedures may work for UTF-16,
>
> No, character indexes will not work for UTF-16 either. Not ALL Unicode
> Characters can fit into a 2-bytes.

When an Unicode string contains the same characters as an Ansi string,
then all these BMP characters fit into one widechar.

> Also what about screen characters
> that are made up of multiple code-points (combining diacritics etc)?
> eg:
> U+0041 (A) + U+030A (̊) = Å

These are special Unicode issues, that never have been an issue with
Ansi strings, and should not be in Unicode - as long as dealing with the
same content as before. Again the Cobol distinction applies: the user
does not have to bother with the internals of strings of "usage display"
- they only are read, written and displayed, and what else can be made
in portable "high-level" string handling.

Dealing with *all* the Unicode quirks IMO is beyond "usual" coding, it
will be reserved to specialized text processing components or applications.

Perhaps you understand better now, why I suggest an string type with an
immutable application defined codepage, for "traditional" coding? This
would be "usage computational", where the known rules for low-level
string handling apply, just as used with AnsiStrings.

> Depending on how that string is normalized, doing a MyString[1] might
> only return 'A' and not Å as you would have expected.

No difference to the current encoding, isn't it? You should not assume
that such non-canonical Unicode is or has ever been translated into a
single Ansi char, by automatic conversion.

>> one widechar, but this code will fail miserably on an UTF-8 platform,
>
> And so too for UTF-16 - as I have just shown. If you want to use UTF-16
> like that (just because *most* of the Unicode code-points can fit into
> 2-bytes), then it is no better that UCS-2.

*Most* users will be happy with the BMP. Those using codepages outside
the BMP had to live with all that stuff, since ever.

IMO the most important thing about Unicode is to teach the users the
difference between low and high level string handling. Indexed access to
characters is a low level operation, that should not be used in
Unicode-aware applications without specific knowledge. Low level string
handling requires the exact knowledge about the encoding of an string,
with eventual branches for the *expected* encodings and char types. High
level string handling is not character-based, so that your objections do
not apply.

DoDi

--

Jürgen Hestermann

2011-02-17 06:19:48 UTC

Permalink

Hans-Peter Diettrich schrieb:
> Indexed access to characters is a low level operation, that should
not be used in Unicode-aware applications without specific knowledge.

I often search for substrings, delete them from the string, insert other
strings at certain places, etc.
How can you do all this without knowledge of the internal structure of
the string?

--

Michael Schnell

2011-02-17 09:28:31 UTC

Permalink

On 02/17/2011 07:19 AM, Jürgen Hestermann wrote:
>
> I often search for substrings, delete them from the string, insert
> other strings at certain places, etc.
> How can you do all this without knowledge of the internal structure of
> the string?
This (magically :-) ) does work with UTF8. You just can't use
MyString[i] or store a single character. You need to handle any single
characters as strings.

-Michael

--

Graeme Geldenhuys

2011-02-17 09:35:15 UTC

Permalink

Op 2011-02-17 11:28, Michael Schnell het geskryf:
> On 02/17/2011 07:19 AM, Jürgen Hestermann wrote:
>>
>> I often search for substrings, delete them from the string, insert
>> other strings at certain places, etc.
>> How can you do all this without knowledge of the internal structure of
>> the string?
> This (magically :-) ) does work with UTF8.

NO, it doesn't! You can't use FPC's Copy(), Pos() etc reliably with
UTF-8 text, because thouse RTL functions work purely on ANSI text
(1-byte characters - speaking of String type text here) and don't know
about multi-byte characters, combining diacritics etc. Hence LCL and
fpGUI have special functions similar to RTL, that knows how to work with
UTF-8 encoded text. eg: UTF8Pos(), UTF8Length and UTF8Copy() etc functions.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Michael Schnell

2011-02-17 09:50:06 UTC

Permalink

On 02/17/2011 10:35 AM, Graeme Geldenhuys wrote:
>
> You can't use FPC's Copy(), Pos() etc reliably with
> UTF-8 text, because thouse RTL functions work purely on ANSI text
> (1-byte characters - speaking of String type text here) and don't know
> about multi-byte characters,
Thats the "magic" :-) . pos() does finds the correct multi-byte characters.
> combining diacritics etc.
This of course does not work, as theses "Unicode quirks" (this name was
not introduced by me !) forces that the same visual character can be
encoded in different ways. I don't know if it is even possible (and
sensible) to support this at language-level.

-Michael

--

Mattias Gaertner

2011-02-17 10:04:20 UTC

Permalink

Â
Â

Graeme Geldenhuys <***@gmail.com> hat am 17. Februar 2011 um 10:35
geschrieben:

> Op 2011-02-17 11:28, Michael Schnell het geskryf:
> > On 02/17/2011 07:19 AM, JÃŒrgen Hestermann wrote:
> >>
> >> I often search for substrings, delete them from the string, insert
> >> other strings at certain places, etc.
> >> How can you do all this without knowledge of the internal structure of
> >> the string?
> > This (magically :-) ) does work with UTF8.
>
> NO, it doesn't! You can't use FPC's Copy(), Pos() etc reliably with
> UTF-8 text, because thouse RTL functions work purely on ANSI text
> (1-byte characters - speaking of String type text here) and don't know
> about multi-byte characters, combining diacritics etc.Â
Yes, it does. UTF8Pos simply calls Pos and converts the byte position to code
point.
Pos works, because the first byte of an UTF-8 code point is distinct from the
following (%111), so if you search for a valid UTF-8 string Pos will return a
valid UTF-8 position. Of course this is a byte position.
And since copy, insert and delete use byte positions as well you can use them
together without trouble.

> Hence LCL and fpGUI have special functions similar to RTL, that knows how to
> work with
> UTF-8 encoded text. eg: UTF8Pos(), UTF8Length and UTF8Copy() etc functions.
They are useful when you must deal with code points. For example TEdit.SelStart
and SelLength are in code points.

Mattias

Hans-Peter Diettrich

2011-02-17 12:41:50 UTC

Permalink

Graeme Geldenhuys schrieb:
> Op 2011-02-17 11:28, Michael Schnell het geskryf:
>> On 02/17/2011 07:19 AM, Jürgen Hestermann wrote:
>>> I often search for substrings, delete them from the string, insert
>>> other strings at certain places, etc.
>>> How can you do all this without knowledge of the internal structure of
>>> the string?
>> This (magically :-) ) does work with UTF8.
>
> NO, it doesn't! You can't use FPC's Copy(), Pos() etc reliably with
> UTF-8 text,

You can, when you do it in the *right* way.

> because thouse RTL functions work purely on ANSI text
> (1-byte characters - speaking of String type text here) and don't know
> about multi-byte characters, combining diacritics etc.

Pos() certainly works with MBCS as well, and you cannot expect that
combining characters and ligatures are handled by the basic Unicode
functions. When Copy requires an byte count, you can compute it from the
difference of the index positions of the involved substrings. It would
be better, though, when the basic procedures would not deal with counts
or sizes at all.

> Hence LCL and
> fpGUI have special functions similar to RTL, that knows how to work with
> UTF-8 encoded text. eg: UTF8Pos(), UTF8Length and UTF8Copy() etc functions.

This is a stupid idea, IMO. An "UTF8" prefix is inappropriate when it
comes to the distinction between physical and logical functionality.
E.g. the number of *logical* (maybe visible) characters can be
determined from any string encoding, and that function should have an
*unique* name and (possibly) overloaded implementations. Likewise a
SubString procedure could take two index positions, which can be
determined without knowledge of the string encoding. This way string
insertion or extraction do not require a re-parse of the strings, in
order to translate logical into physical indices and counts.

IMO we simply have to agree that Length() is a physical property, the
number of elements in an array. A logical character count has a very
different meaning in string handling, and not even a *single* meaning,
when we start dealing with ligatures and other Unicode stuff[1].

[1] In a mix of LTR and RTL parts a distinction between sequential
physical and logical indices is required as well. The first RTL
codepoint physically follows the preceding LTR codepoint, but logically
(on screen...) it precedes the *next* LTR codepoint. I only see one
proper solution to such quirks, by restricting the arguments of string
handling functions to physical (array) indices. Logical increments of
such indices are at the discretion of the user, depending on his
understanding of the desired result. Library functions only can deal
with different encodings, but always will return physical indices.

DoDi

--

Hans-Peter Diettrich

2011-02-17 12:06:05 UTC

Permalink

Jürgen Hestermann schrieb:
> Hans-Peter Diettrich schrieb:
> > Indexed access to characters is a low level operation, that should
> not be used in Unicode-aware applications without specific knowledge.
>
> I often search for substrings, delete them from the string, insert other
> strings at certain places, etc.
> How can you do all this without knowledge of the internal structure of
> the string?

How *not*?

Just in your mentioned cases the strings are treated only as strings,
not as arrays of chars - that's the difference between high and low
level string handling. Dealing with chars inside strings, and
incrementing/decrementing indices, is kind of pointer arithmetic, which
also deserves knowledge about pointers.

DoDi

--

Graeme Geldenhuys

2011-02-17 07:21:07 UTC

Permalink

Op 2011-02-17 00:58, Hans-Peter Diettrich het geskryf:
>
> What's the type of the loop variable???

Any time that can store 4-bytes. Be that a string, dynamic array or a
custom object/class type.

> The iteration costs time, so that many users will insist in using "fast"
> SBCS access.

That would also insist they can't use Unicode text - which is the whole
point of this conversation.

> No doubt that proper Unicode coding will require iterators,
> unless Pos can return an valid index immediately.

There are many ways of implementing fast unicode Pos(), Length(), Copy()
etc... I have read numerous implementations - some fast, some not.

> When an Unicode string contains the same characters as an Ansi string,
> then all these BMP characters fit into one widechar.

Yes, but still, not all Unicode characters fit into a widechar
(2-bytes). Most [if not all - I'm not sure here] spoken languages fit
into the BMP, but that might not always be the case. Maybe some day you
want to translate all your text into Klingon or Goa'uld or whatever
alien race visits our planet. Being prepared and supporting the full
Unicode is the best option at the moment.

> These are special Unicode issues, that never have been an issue with
> Ansi strings, and should not be in Unicode - as long as dealing with the
> same content as before.

My example might not have been extensive enough to get the point across.
The point being that what you see on screen as a "character" might be a
combination of code-points. This is not a "issue of Unicode", but a
functionality of Unicode - hence the reason there are stacks of
information about various Unicode normalizations too. eg: Mac's keep
them separated, where under Linux I believe such combined diacritics are
replaced with a single code-point that can represent the same
information [if it exists].

> - they only are read, written and displayed, and what else can be made
> in portable "high-level" string handling.

Well, for any string handling in your application, you need to know the
difference between what is perceived as a Unicode "character" on the
screen, and the various ways such a "character" can be presented in a
language structure. There is no way around this, unless FPC defines that
such Unicode strings are always stored in some specific normalized manner.

> Dealing with *all* the Unicode quirks IMO is beyond "usual" coding, it
> will be reserved to specialized text processing components or applications.

I'm not arguing that point.

> *Most* users will be happy with the BMP. Those using codepages outside
> the BMP had to live with all that stuff, since ever.

Then you should call it UCS-2 support, and not Unicode support. We are
talking about implementing Unicode support here.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Sven Barth

2011-02-17 08:16:57 UTC

Permalink

[OT]

Am 17.02.2011 08:21, schrieb Graeme Geldenhuys:
> Maybe some day you
> want to translate all your text into Klingon or Goa'uld or whatever
> alien race visits our planet. Being prepared and supporting the full
> Unicode is the best option at the moment.

If I wouldn't be at work know, I would have laughed aloud. Thanks for
this comment :D

Regards,
Sven (a Star Trek and Stargate fan)

[/OT]

--

José Mejuto

2011-02-17 08:46:48 UTC

Permalink

Hello Lazarus-List,

Thursday, February 17, 2011, 8:21:07 AM, you wrote:

GG> Well, for any string handling in your application, you need to know the
GG> difference between what is perceived as a Unicode "character" on the
GG> screen, and the various ways such a "character" can be presented in a
GG> language structure. There is no way around this, unless FPC defines that
GG> such Unicode strings are always stored in some specific normalized manner.

I think FPC should handle them as codepoints, this means they are
normalized. If the string is not normalized is responsability of the
user to normalize it via the supported functions or create new
functions to support it unnormalized.

In other words strings support do not deal with their representation.

--
Best regards,
José

--

Michael Schnell

2011-02-17 09:25:13 UTC

Permalink

On 02/16/2011 11:58 PM, Hans-Peter Diettrich wrote:
>
> Dealing with *all* the Unicode quirks IMO is beyond "usual" coding, it
> will be reserved to specialized text processing components or
> applications.
+1 from here,
but I never did anything with Mac and I once was told that the Mac
constantly uses such quirks.

-Michael

--

Marco van de Voort

2011-02-15 15:28:56 UTC

Permalink

On Tue, Feb 15, 2011 at 08:50:49AM +0200, Graeme Geldenhuys wrote:
>
> Yes, but in reality how often would such conversions happen?

Does that really matter? If I don't now beforehand that I can control such issues
with some strategic choices, I won't start using it at all.

--

Hans-Peter Diettrich

2011-02-16 23:08:30 UTC

Permalink

Marco van de Voort schrieb:

Can you please make your mailer preserve *spaces* in the subject?

DoDi

--

Jürgen Hestermann

2011-02-16 07:12:38 UTC

Permalink

Graeme Geldenhuys schrieb:
> Op 2011-02-14 20:03, Jürgen Hestermann het geskryf:
>> Do you mean that the compiler should convert the strings as needed in
>> the background (as between different integer types and/or floats) so
>> that you can call ListBox1.Items.Add(x) with x beeing UTF8string or
>> UTF16string or...?
> Yes, but in reality how often would such conversions happen? TStringList
> (used inside a TListBox) would use UnicodeString types. The encoding of
> that type would default to whatever platform you compiled on. ie: under
> Linux it would default to UTF-8, and under Windows it would default to
> UTF-16

That's sounds like yet another approach. So up to now I see 3 models how
strings could be handled:

--------------
1.) Full programmer responsibillity (current model):
The programmer is fully responsible for (and has full control about) the
strings used in his program. Libraries use mostly UTF8 with some
exception like API related libraries. The programmer needs to know about
the used string types in all used libraries and if conversions are
needed he has to initiate them manually.

Pros:
The programmer knows exactly what happens under the hood so he can judge
performance and incompatibilities (at least he should).

Cons:
Much harder to code because he *needs* to know about all the details of
string encodings in different libraries. When strings are saved to files
they would be compatible accross OS platforms because the programmer can
use the same type in all cases so files can be exchanged accross them.

--------------
2.) A generic "UnicodeString" is mapped to different real sting types
"under the hood". So the used string type in programs (and libraries
like LCL) differs from platform to platform. The programmer does not
even know what type is used. If a conversion is still needed for special
routines it would be done automatically in the background without the
programmer having to know about it. Other real string types like
UTF8string are available but it's not encouraged to use them.

Pros:
Easy to code. In general, deeper knowledge about string encodings and
their storage is not needed. String conversions are seldom needed.

Cons:
When non-unicode strings are used on a platform (i.e. ANSI on Windows)
but unicode is required by the program it becomes clumsy because then
the programmer has to use it's own (unicode) string type and then
conversion are needed for all library and other functions. When strings
are saved to files they may differ on different platforms so files
cannot be exchanged accross them. All libraries have to be rewritten to
handle different string types.

-------------
3.) A middle course: UTF8 is chosen to be the main string type which
should be used whenever possible (within LCL and other libraries) and
also programmers are encouraged to use it so that conversions become
(more and more) unlikely. When using interfaces with different string
trypes (like OS APIs) there would be an automatic conversion in the
background.

Pros:
Easy to code. No doubt about the used string type and its capabilities
for the programmer, it's always UTF8 for him. When strings are saved to
disk they are all UTF8 on all platforms so files can be exchanged
between Linux and Windows (and others).

Cons:
Because LCL and other libraries use UTF8 there could be a performace
impact when compiling to non-UTF8 OS (where API's use ANSI or other
UTF16 or whatever).

I would prefer model 3.)

--

Graeme Geldenhuys

2011-02-16 07:30:23 UTC

Permalink

Op 2011-02-16 09:12, Jürgen Hestermann het geskryf:
>
> That's sounds like yet another approach. So up to now I see 3 models how
> strings could be handled:

UnicodeString should become the default. I would even go as far as
recommending that String = UnicodeString in a newer FPC version.

UTF8String is not a special type, it is simply a UnicodeString that has
it's encoding bit set to UTF-8. UTF8String is still a UnicodeString as
far as the compiler is concerned. It's an "alias type".

Just like we now use String everywhere, in future we could use
UnicodeString everywhere and don't need to worry about what encoding we
are using. This should even be the case when we talk to OS API's or
external libraries. By default libraries under Windows use UTF-16. Your
UnicodeString under Windows will already be in UTF-16 - thus no
conversion needed. Same for Linux. By default most libraries use UTF-8,
and UnicodeString will automatically be in UTF-8 encoding anyway - thus
again no conversion needed.

All RTL functions which use UnicodeString will obviously be the correct
encoding on each platform too. So again by default, simply using
UnicodeString will not require any encoding conversions.

Will will work for 99% all all applications. Only when the developer
wants to do something specific with a specific encoding, that developer
can simply define a new string type as follows:
eg:
type
MyCP850StringType = UnicodeString(cp850);

That developer can then use that string knowing the encoding is CP850.
But this would be very seldom needed.

As for the LCL or any libraries. They could simply use UnicodeString
everywhere (thus my recommendation that String = UnicodeString in the
future so our code wouldn't need much changing), and it will work just
fine under all platforms, because UnicodeString will default to each
platforms default encoding.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Sven Barth

2011-02-16 08:28:34 UTC

Permalink

Am 16.02.2011 08:30, schrieb Graeme Geldenhuys:
> Op 2011-02-16 09:12, Jürgen Hestermann het geskryf:
>>
>> That's sounds like yet another approach. So up to now I see 3 models how
>> strings could be handled:
>
> UnicodeString should become the default. I would even go as far as
> recommending that String = UnicodeString in a newer FPC version.
>

I personally would prefer the code page aware string to be a new string
type (e.g. CodePageString) instead of abusing UnicodeString here (it
should be kept as a 2-byte string for backwards compatibility).

But I wouldn't mind if then String is by default an alias to this
CodePageString type (or only in certain modes or with certain switches
similar to {$H+/-} as this would decrease the changes/overview needed
for old code).

Regards,
Sven

--

Graeme Geldenhuys

2011-02-16 09:20:38 UTC

Permalink

Op 2011-02-16 10:28, Sven Barth het geskryf:
>
> I personally would prefer the code page aware string to be a new string
> type (e.g. CodePageString) instead of abusing UnicodeString here (it

I think FPC (and Delphi) already have enough string types, so I would
presonally vote for NOT adding yet another string type. I also think
UnicodeString = UTF16 (as in the case of Delphi) is just wrong - Delphi
should then have rather introduced a new type called UTF16String. The
Unicode Standards defined 8 or so valid encodings described as Unicode,
so how can Embarcadero make the assumption that Unicode = UTF-16 only.

> But I wouldn't mind if then String is by default an alias to this
> CodePageString type (or only in certain modes or with certain switches
> similar to {$H+/-} as this would decrease the changes/overview needed
> for old code).

+1
Lets not make the same mistake as Delphi. And I do realize it will not
magically work for every application, because some code might assume
Char = 1byte, so such code would still have to be fixed. But it would be
much less effort that renaming ALL String references in our code to
UnicodeString.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Hans-Peter Diettrich

2011-02-16 11:04:24 UTC

Permalink

Graeme Geldenhuys schrieb:

> Will will work for 99% all all applications. Only when the developer
> wants to do something specific with a specific encoding, that developer
> can simply define a new string type as follows:
> eg:
> type
> MyCP850StringType = UnicodeString(cp850);
>
> That developer can then use that string knowing the encoding is CP850.
> But this would be very seldom needed.

I'm pretty sure that such strings will be widely used, by people that
prefer to use string indexing with a fixed character size.

In any other model there is no use of any char type (except perhaps
UTF32char), instead only (sub)strings should be used in code. IMO it was
a wise decision that Pascal literals make no difference between char and
string literals.

DoDi

--

Michael Schnell

2011-02-16 12:25:12 UTC

Permalink

On 02/16/2011 12:04 PM, Hans-Peter Diettrich wrote:
>
> I'm pretty sure that such strings will be widely used, by people that
> prefer to use string indexing with a fixed character size.
Yep. Bus as discussed here earlier, Length() - that needs to be done
with the same paradigm as mystr[i] - is a problem. When using e.g.
stream-I/O people will want it to be the byte count, when doing a loop
along a string, they want it to be the unicode-character count.

-Michael

--

Graeme Geldenhuys

2011-02-16 12:59:32 UTC

Permalink

Op 2011-02-16 14:25, Michael Schnell het geskryf:
> Yep. Bus as discussed here earlier, Length() - that needs to be done
> with the same paradigm as mystr[i] - is a problem. When using e.g.
> stream-I/O people will want it to be the byte count, when doing a loop
> along a string, they want it to be the unicode-character count.

This is why Length() can stay as is - returning a byte count. We can
then introduce a new StringIterator class or something that can return
one unicode-character at a time - for loop purposes. I have already
create such iterators for myself going through strings or any tyes of
list classes found in the RTL. Using such iterators instead of dump
for..loops, I can now move forward, backwards, skip elements, filter
elements etc... Such Iterators are very powerful and flexible.

eg:

var
itr: IStringIterator;
c: UnicodeChar; // or something that can store at least 4 bytes
begin
itr := TStringHelper.GetUnicodeIterator(MyString);
while itr.HasNext do
begin
i := itr.Next
... now do something with the Unicode character stored in i
end;
end;

This also reduces confusion. The developer knows what is being returned
simply by looking at the code, unlike the confusion with Length() [is it
number of chars or number of bytes].

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Michael Schnell

2011-02-16 13:59:10 UTC

Permalink

On 02/16/2011 01:59 PM, Graeme Geldenhuys wrote:
>
> while itr.HasNext do
> begin
> i := itr.Next
> ... now do something with the Unicode character stored in i
> end;
Do we already have / plan a dedicated iterator loop syntax for this

Delphi Prism would go

for each i in itrdo
begin
... now do something with the Unicode character stored in i
end;

for this

-Michael

--

Sven Barth

2011-02-16 14:04:57 UTC

Permalink

Am 16.02.2011 14:59, schrieb Michael Schnell:
> On 02/16/2011 01:59 PM, Graeme Geldenhuys wrote:
>>
>> while itr.HasNext do
>> begin
>> i := itr.Next
>> ... now do something with the Unicode character stored in i
>> end;
> Do we already have / plan a dedicated iterator loop syntax for this
>
> Delphi Prism would go
>
> for each i in itrdo
> begin
> ... now do something with the Unicode character stored in i
> end;
>
>
> for this

Take a look at the tforin*.pp tests here:
http://svn.freepascal.org/cgi-bin/viewvc.cgi/trunk/tests/test/

But basically the syntax is

for enumeratedvar in enumerator do
// something

It's the Delphi compatible variant of the Prism syntax. Of course you
also might need to define a suitable enumerator (e.g. for enumerating a
UTF-8 string).

Regards,
Sven

--

Michael Schnell

2011-02-16 15:54:22 UTC

Permalink

On 02/16/2011 03:04 PM, Sven Barth wrote:
>
>
> It's the Delphi compatible variant of the Prism syntax.

Sounds great.
-Michael

--

Graeme Geldenhuys

2011-02-16 15:09:53 UTC

Permalink

Op 2011-02-16 15:59, Michael Schnell het geskryf:
> Do we already have / plan a dedicated iterator loop syntax for this

A similar syntax is already supported in FPC 2.4.2. See the Language
Reference documentation, section 10.2.5 (pg 115 in the ref.pdf).

I personally prefer the Iterator interface though, because it supports
bi-directional iteration, filtered iteration, skips, resets etc.
Basically a lot more functional for me, than the FPC's for..in..do syntax.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Hans-Peter Diettrich

2011-02-16 23:08:55 UTC

Permalink

Michael Schnell schrieb:

> Delphi Prism would go
>
> for each i in itrdo
> begin
> ... now do something with the Unicode character stored in i
> end;

Depends on the type of i. Neither Char nor WideChar are capable of
holding every Unicode codepoint (32 bit).

DoDi

--

Michael Schnell

2011-02-17 09:33:12 UTC

Permalink

On 02/17/2011 12:08 AM, Hans-Peter Diettrich wrote:
>
> Depends on the type of i. Neither Char nor WideChar are capable of
> holding every Unicode codepoint (32 bit).
>
I think Graeme wrote:

c: UnicodeChar; // or something that can store at least 4 bytes

(meaning i instead of c)

-Michael

--

Hans-Peter Diettrich

2011-02-16 23:02:11 UTC

Permalink

Michael Schnell schrieb:

>> I'm pretty sure that such strings will be widely used, by people that
>> prefer to use string indexing with a fixed character size.
> Yep. Bus as discussed here earlier, Length() - that needs to be done
> with the same paradigm as mystr[i] - is a problem. When using e.g.
> stream-I/O people will want it to be the byte count, when doing a loop
> along a string, they want it to be the unicode-character count.

That's why such loops should be disallowed with Unicode strings, as kind
of low level string handling. For-Each loops may be acceptable as high
level string handling, but with what type of the loop variable???

DoDi

--

Michael Schnell

2011-02-17 09:43:03 UTC

Permalink

On 02/17/2011 12:02 AM, Hans-Peter Diettrich wrote:
>
> That's why such loops should be disallowed with Unicode strings, as
> kind of low level string handling.
Not only this, but the normal user would like to do

MyChar := MyString[Length(MyString];

to get the last character of a string.

This would only work if length is counted in Unicode characters
disallowing the use of length in stream I/O (such as file read) functions.

Thus IMHO the Length function name should be dumped and two new
functions (such as CharacterCount and ByteCount) should be introduced.

> For-Each loops may be acceptable as high level string handling, but
> with what type of the loop variable???
>
We obviously would need a UnicodeChar Type that holds the 32 Bit encoding.

But the said "quirks" can't be handled by this. I up till now don't
understand if - technically - these "quirks" are seen as a single
Unicode character or as a sequence of Unicode Characters. Nor do I
understand how they can be used in a decent way and if they are
necessary or just legacy.

-Michael

--

Michael Schnell

2011-02-16 10:17:54 UTC

Permalink

On 02/16/2011 08:12 AM, Jürgen Hestermann wrote:
>
> 1.) Full programmer responsibillity (current model):
This not really matches the current model. For a decent implementation
of this, there need to be different string types for the different
encodings that are handled as incompatible. Assigning does not do
auto-conversion but produces a compile time error.

-Michael

--

Marco van de Voort

2011-02-16 21:15:00 UTC

Permalink

On Mon, Feb 14, 2011 at 03:35:32PM +0200, Graeme Geldenhuys wrote:

> "alias types" are not really new types, so will not affect var
> parameters. So in my previous example, UTF16String will still be a
> UnicodeString. The only difference will be that UTF16String has its
> internal encoding bit set to UTF-16.

> Here is a test program using latest FPC 2.4.3 showing that TMyString =
> String - the compiler sees no difference between the two types. So the
> same should be valid for UnicodeString vs UnicodeString(...) that have
> their encoding bit set to something other than the platform default.

There is no unicodestring(...). Ansistring(...) and unicodestring are
treated as different types on a Pascal level.

There is a catch-all rawbytestring for deep RTL work, but you really use that as string type.
See earlier discussions.

--

Graeme Geldenhuys

2011-02-17 06:44:33 UTC

Permalink

Op 2011-02-16 23:15, Marco van de Voort het geskryf:
>
> There is no unicodestring(...). Ansistring(...) and

I know there currently isn't, but are you also saying that we can't
extend UnicodeString to support UnicodeString(…) syntax? To me,
AnsiString(…) [like is done in Delphi now], makes even less sense.

On a side note:
DON'T bother replying if your answers is going to be the lame excuse
"it's Delphi compatible". Because if I hear that one more time, I'm
really go nuts! :) Just because Delphi is brain-dead, doesn't mean FPC
must also be brain-dead.

I guess the most logical would be String(…) syntax - upgrading the
String type to be encoding aware, similar to what was done when String
was a shortstring and then became a longstring later. Making the String
type encoding-aware would be a natural progression in the language. This
functionality could maybe even be trigger by a new compiler directive,
similar to what {$H+} does, but could be default enabled if compiler
mode objfpc is used. This means FPC would be one-up to Delphi, where we
can toggle the unicode functionality, and Delphi can't.

Regards,
- Graeme -

--
fpGUI Toolkit - a cross-platform GUI toolkit using Free Pascal
http://fpgui.sourceforge.net/

--

Marcos Douglas

2011-02-17 10:30:11 UTC

Permalink

On Thu, Feb 17, 2011 at 3:44 AM, Graeme Geldenhuys
<***@gmail.com> wrote:
> I guess the most logical would be String(…) syntax - upgrading the
> String type to be encoding aware, similar to what was done when String
> was a shortstring and then became a longstring later. Making the String
> type encoding-aware would be a natural progression in the language. This
> functionality could maybe even be trigger by a new compiler directive,
> similar to what {$H+} does, but could be default enabled if compiler
> mode objfpc is used. This means FPC would be one-up to Delphi, where we
> can toggle the unicode functionality, and Delphi can't.

+1

Marcos Douglas

--

Marco van de Voort

2011-02-16 21:11:29 UTC

Permalink

On Mon, Feb 14, 2011 at 12:16:55PM +0100, Felipe Monteiro de Carvalho wrote:
> On Sat, Feb 12, 2011 at 6:49 PM, Marco van de Voort <***@stack.nl> wrote:
> > This is all undecided. I lean towards splitting operating system targets
> > into a utf8 and a utf16 one for most platforms(*), since nobody will ever agree
> > on one encoding. ?Not even per platform.
> >
> > (*) and a legacy "ansi" one if need be.
>
> Why do we need "targets"?
>
> Wouldn't it be better to simply duplicate all string functions for
> utf8 and utf16 and ansi if necessary?

How do you duplicate every usage of "string" in the entire Lazarus tree?

--

Felipe Monteiro de Carvalho

2011-02-17 09:20:38 UTC

Permalink

On Wed, Feb 16, 2011 at 10:11 PM, Marco van de Voort <***@stack.nl> wrote:
> How do you duplicate every usage of "string" in the entire Lazarus tree?

I don't understand your question. I proposed to duplicate RTL string
routines. Lazarus would use only the UTF-8 variant and thus Lazarus
routines don't need to be duplicated.

UTF-16 routines would be used by msegui for example

--
Felipe Monteiro de Carvalho

--

Florian Klämpfl

2011-01-02 22:16:37 UTC

Permalink

Am 02.01.2011 20:33, schrieb Sven Barth:
> On 02.01.2011 18:29, Graeme Geldenhuys wrote:
>> On 2 January 2011 13:47, Sven Barth<***@googlemail.com> wrote:
>>> Casting from AnsiString to UnicodeString invokes the WideString
>>> Manager's
>>> Ansi2UnicodeMoveProc which converts the supplied AnsiString to a correct
>>> UTF16 string.
>>
>> Does that mean FPC and LCL always treats UnicodeString type as a UTF16
>> encoded type? If so, that is a rather odd "type name" then, because
>> "unicode" is NOT just UTF16, it is also UTF8, UTF16, UTF16-LE,
>> UTF16-BE and UTF32. A better, and more correct, type name would then
>> have been UTF16String, just like there is a UTF8String type (though I
>> don't really know how the latter differs from AnsiString (which is
>> basically an array of bytes).
>
> Yes, UnicodeString (and WideString as well) is treated as UTF16 encoded
> string.

This is only a temp. solution (till at the end of the world, the
cpstrnew branch is ready ;)). The encoding of an unicode string will be
dependent on a variable.

--

Graeme Geldenhuys

2011-01-03 07:41:35 UTC

Permalink

On 3 January 2011 00:16, Florian Klämpfl wrote:
>>
>> Yes, UnicodeString (and WideString as well) is treated as UTF16 encoded
>> string.
>
> This is only a temp. solution. The encoding of an unicode string will be
> dependent on a variable.

That is good to know. Thanks Florian.

--
Regards,
- Graeme -

_______________________________________________
fpGUI - a cross-platform Free Pascal GUI toolkit
http://fpgui.sourceforge.net

--

Sven Barth

2011-01-03 12:34:15 UTC

Permalink

Am 03.01.2011 08:41, schrieb Graeme Geldenhuys:
> On 3 January 2011 00:16, Florian Klämpfl wrote:
>>>
>>> Yes, UnicodeString (and WideString as well) is treated as UTF16 encoded
>>> string.
>>
>> This is only a temp. solution. The encoding of an unicode string will be
>> dependent on a variable.
>
>
> That is good to know. Thanks Florian.

Yes, thanks for clarification.

Regards,
Sven

--

Marco van de Voort

2011-02-12 17:50:18 UTC

Permalink

On Sun, Jan 02, 2011 at 11:16:37PM +0100, Florian Kl?mpfl wrote:
>
> This is only a temp. solution (till at the end of the world, the
> cpstrnew branch is ready ;)). The encoding of an unicode string will be
> dependent on a variable.

The encoding of an 1-byte encoding based string.

--

Sven Barth

2011-01-02 11:49:08 UTC

Permalink

On 01.01.2011 21:14, Vladimir Zhirov wrote:
> Sven Barth wrote:
>
>> You need to convert the UTF8 string to a different one, e.g.
>> UTF16:
>>
>> var
>> us: UnicodeString;
>> begin
>> us := UTF8Encode(s);
>> end;
>>
>> Now us[2] will return the a-umlaut.
>
> I would suggest using Utf8Copy(s, 2, 1) instead. It helps
> to avoid conversion and works correctly even for characters
> that take 4 bytes in UnicodeString/WideString (i.e. 2
> wide characters). Utf8Copy is declared in LCLProc unit.

When using the LCL this is indeed a better way.

Regards,
Sven

--

Bo Berglund

2011-01-02 18:04:25 UTC

Permalink

On Sat, 01 Jan 2011 19:13:26 +0100, Sven Barth
<***@googlemail.com> wrote:

>> Is it converted somehow?
>> The native widget's encoding is either UTF-8 or UTF-16.
>> Is the string actually a Utf8String or Utf16String then?
>> When do I need to pay attention to it?
>
>Currently there is no automatic conversion (it's planned in one of the
>branches of FPC). For now a String (=AnsiString) can be seen as an
>"array of byte". You as a developer are responsible that the string
>contains the correct encoding.
>
>So in your above example the string that is stored in "s" will be UTF8
>encoded, because it comes from the GUI. But if that string contains
>multibyte characters those characters will appear as single "one byte"
>characters if you access the string using [], Pos, Copy, etc.
>
>Example (note: this is not accurate UTF8 encoding, I'm just making that
>up here)
>
>TMemo.Lines[0] contains: 'hä?!' ( h a-umlaut ? ! )
>I now assume that an a-umlaut is encoded as "ae" (which isn't really the
>case, but it's for the sake of an example ^^)
>s now contains: 'h a e ? !'
>
>If you now want to access the second character of s you'd expect that
>you'd get the a-umlaut, but if you do s[2] you'll get an "a". And if you
>access the third one (s[3]) you'll get the "e" instead of "?".
>
>You need to convert the UTF8 string to a different one, e.g. UTF16:
>
>var
> us: UnicodeString;
>begin
> us := UTF8Encode(s);
>end;
>
>Now us[2] will return the a-umlaut.
>
>I hope this example clears that up a bit, if not: just ask more questions ;)
>

I just stumbled across this thread and it worries me a little since
the way Delphi introduced unicode is by ambush....

What they did was to redefine the type string from AnsiString to
something else unicode-ish in Delphi 2009. So all applications doing
some string manipulation on data of type string broke severely. I hope
I will not see the same here in FPC/Lazarus?

My concern is that I am communicating using RS232 and I use string
variables to hold my messages and commands. The protocol used is
defined on a byte by byte level and it will not accept some
"automatic" conversion being forced on the variables.

So, will FPC stay with the current definition of string and let the
developers decide what to handle as unicode strings by using a
different type for these strings? For example "widestring" or
"unicodestring" or the like?

--
Bo Berglund
Developer in Sweden

--

Sven Barth

2011-01-02 19:40:02 UTC

Permalink

On 02.01.2011 19:04, Bo Berglund wrote:
> On Sat, 01 Jan 2011 19:13:26 +0100, Sven Barth
> <***@googlemail.com> wrote:
>
>>> Is it converted somehow?
>>> The native widget's encoding is either UTF-8 or UTF-16.
>>> Is the string actually a Utf8String or Utf16String then?
>>> When do I need to pay attention to it?
>>
>> Currently there is no automatic conversion (it's planned in one of the
>> branches of FPC). For now a String (=AnsiString) can be seen as an
>> "array of byte". You as a developer are responsible that the string
>> contains the correct encoding.
>>
>> So in your above example the string that is stored in "s" will be UTF8
>> encoded, because it comes from the GUI. But if that string contains
>> multibyte characters those characters will appear as single "one byte"
>> characters if you access the string using [], Pos, Copy, etc.
>>
>> Example (note: this is not accurate UTF8 encoding, I'm just making that
>> up here)
>>
>> TMemo.Lines[0] contains: 'hä?!' ( h a-umlaut ? ! )
>> I now assume that an a-umlaut is encoded as "ae" (which isn't really the
>> case, but it's for the sake of an example ^^)
>> s now contains: 'h a e ? !'
>>
>> If you now want to access the second character of s you'd expect that
>> you'd get the a-umlaut, but if you do s[2] you'll get an "a". And if you
>> access the third one (s[3]) you'll get the "e" instead of "?".
>>
>> You need to convert the UTF8 string to a different one, e.g. UTF16:
>>
>> var
>> us: UnicodeString;
>> begin
>> us := UTF8Encode(s);
>> end;
>>
>> Now us[2] will return the a-umlaut.
>>
>> I hope this example clears that up a bit, if not: just ask more questions ;)
>>
>
> I just stumbled across this thread and it worries me a little since
> the way Delphi introduced unicode is by ambush....
>
> What they did was to redefine the type string from AnsiString to
> something else unicode-ish in Delphi 2009. So all applications doing
> some string manipulation on data of type string broke severely. I hope
> I will not see the same here in FPC/Lazarus?
>
> My concern is that I am communicating using RS232 and I use string
> variables to hold my messages and commands. The protocol used is
> defined on a byte by byte level and it will not accept some
> "automatic" conversion being forced on the variables.
>
> So, will FPC stay with the current definition of string and let the
> developers decide what to handle as unicode strings by using a
> different type for these strings? For example "widestring" or
> "unicodestring" or the like?

There is currently a branch of FPC where a "codepage aware" string type
is developed. I don't know how far the RTL and compiler will be modified
once that finalizes, but I believe that the developers will pay enough
attention to backwards compatibility (as they do know as well) and that
they'll listen to community input (and "fears") regarding this as well.

Regards,
Sven

--

Michael Schnell

2011-01-03 11:43:16 UTC

Permalink

On 01/02/2011 07:04 PM, Bo Berglund wrote:
>
> What they did was to redefine the type string from AnsiString to
> something else unicode-ish in Delphi 2009.
I don't have D2009+, but from what I've read I understand that it's
possible to force a New Delphi String to a certain (e.g. Code page based
ANSI) coding and thus "old" software working on it should just run. If a
(e.g.) library function is called it should automatically be converted
if the callee needs this.

-Michael

--

Vladimir Zhirov

2011-01-01 23:14:46 UTC

Permalink

Juha Manninen wrote:

> What happens when I do:
> var s: string;
> ...
> s := TMemo.Lines[0];
>
> Is it converted somehow?
> The native widget's encoding is either UTF-8 or UTF-16.
> Is the string actually a Utf8String or Utf16String then?
> When do I need to pay attention to it?

The string is Utf8String (= AnsiString for now and in the near
future), as Sven pointed out.

You need to pay attention to it when you cross the boundary
between FPC's RTL/FCL and LCL. LCL uses utf-8 regardless
of underlying widgetset, but RTL/FCL uses platform-specific
encoding. IIRC most if not all encoding dependent routines
of RTL/FCL are related to file operations and I/O.

LCL provides functions for platform-independent conversion
between utf-8 and platform-specific encoding (see FileUtil unit):

1. Utf8ToSys/SysToUtf8 as general-purpose functions;
2. Set of wrappers such as FileExistsUtf8, FindFirstUtf8 etc.
to make our code more readable.

--

cobines

2011-01-02 17:19:00 UTC

Permalink

2011/1/2 Vladimir Zhirov <***@gmail.com>:
> LCL provides functions for platform-independent conversion
> between utf-8 and platform-specific encoding (see FileUtil unit):
>
> 1. Utf8ToSys/SysToUtf8 as general-purpose functions;

Those only work for characters in current Ansi code page. It is better
to use UTF8Decode, UTF8Encode.

--
cobines

--

Vladimir Zhirov

2011-01-02 20:34:30 UTC

Permalink

cobines wrote:
> Those only work for characters in current Ansi code page.

Yes, on Win32. They also work properly for utf-8 characters
on Linux and Mac OS X.

> It is better to use UTF8Decode, UTF8Encode.

UTF8Decode and UTF8Encode do not provide any platform abstraction,
they merely convert single-byte utf-8 string to double-byte
utf-16/usc-2 string.

And since RTL/FCL functions on Win32 use single-byte Ansi strings,
they cannot handle characters outside of current code page anyway.
I do not see how UTF8Decode/UTF8Encode can help here.
The only way is to use them in combination with wide version
of Windows API (functions ending with "W"), but it is not portable.

To make it clear, when I called Utf8ToSys/SysToUtf8 "general purpose"
I did not mean they should be used all over the code. They are only
needed when calling RTL/FCL functions if there is no appropriate
wrapper function like FileExistsUtf8.

--

cobines

2011-01-02 22:24:59 UTC

Permalink

2011/1/2 Vladimir Zhirov <***@gmail.com>:
> cobines wrote:
>> Those only work for characters in current Ansi code page.
>
> Yes, on Win32. They also work properly for utf-8 characters
> on Linux and Mac OS X.
>
>> It is better to use UTF8Decode, UTF8Encode.
>
> UTF8Decode and UTF8Encode do not provide any platform abstraction,
> they merely convert single-byte utf-8 string to double-byte
> utf-16/usc-2 string.
>
> And since RTL/FCL functions on Win32 use single-byte Ansi strings,
> they cannot handle characters outside of current code page anyway.
> I do not see how UTF8Decode/UTF8Encode can help here.
> The only way is to use them in combination with wide version
> of Windows API (functions ending with "W"), but it is not portable.
>
> To make it clear, when I called Utf8ToSys/SysToUtf8 "general purpose"
> I did not mean they should be used all over the code. They are only
> needed when calling RTL/FCL functions if there is no appropriate
> wrapper function like FileExistsUtf8.

Yes, I was thinking of Win32.

What I wanted to say was that it's not simply a matter of adding UTF8ToSys call:

FileExists(UTF8ToSys(FileName)).

It might work on Linux, but won't always work on Win32.

As you said, either do FileExistsUtf8, or directly use unicode OS API,
like on Windows:
Windows.FindFirstFileW(UTF8Decode(FileName))

--
cobines

--