mbstoc16s(), mdbstoc32s(), c16stombs(), c32stombs()

Discussion:

(too old to reply)

Philipp Klaus Krause

2016-11-07 20:13:49 UTC

I suggest to add functions for converting strings between char, char16_t
and char32_t encodings.
These would be similar to the existing functions from 7.22.8 for
conversion between char and wchar_t, and could be added to uchar.h.

C has quite some functions for converting between char and wchar_t.
Though some of them are thread-unsafe (the 7.22.7 ones) or inefficient¹
(7.29.6.3, 7.29.6.4). But the 7.22.8 ones look ok.

On the other hand for converting between char, char16_t, char32_t there
are only the functions from 7.28.1. They do not convert whole strings at
a time and are inefficient¹.

So converting strng between char, char16_t and char32_t is a natural
addition, and using an interface similar to 7.22.8 seems like a good
choice to me.

What do you think? A reasonable addition to C? Worth writing a proposal
for C2X? Any better alternatives?

Philipp

¹ Restartable functions can handle partial characters as input which
comes with a substantial burden on implementations, affecting both speed
and code size substantially.

James R. Kuyper

2016-11-07 21:07:51 UTC

Permalink

Post by Philipp Klaus Krause
I suggest to add functions for converting strings between char, char16_t
and char32_t encodings.
These would be similar to the existing functions from 7.22.8 for
conversion between char and wchar_t, and could be added to uchar.h.
C has quite some functions for converting between char and wchar_t.
Though some of them are thread-unsafe (the 7.22.7 ones) or inefficient¹
(7.29.6.3, 7.29.6.4). But the 7.22.8 ones look ok.
On the other hand for converting between char, char16_t, char32_t there
are only the functions from 7.28.1. They do not convert whole strings at
a time and are inefficient¹.
So converting strng between char, char16_t and char32_t is a natural
addition, and using an interface similar to 7.22.8 seems like a good
choice to me.
What do you think? A reasonable addition to C? Worth writing a proposal
for C2X? Any better alternatives?
Philipp
¹ Restartable functions can handle partial characters as input which
comes with a substantial burden on implementations, affecting both speed
and code size substantially.

It seems reasonable to me. Offhand, it's not obvious why this wasn't
done in C2011.

Tim Rentsch

2016-11-12 16:10:50 UTC

Permalink

Post by Philipp Klaus Krause
I suggest to add functions for converting strings between char, char16_t
and char32_t encodings.
These would be similar to the existing functions from 7.22.8 for
conversion between char and wchar_t, and could be added to uchar.h.
C has quite some functions for converting between char and wchar_t.
Though some of them are thread-unsafe (the 7.22.7 ones) or inefficient[1]
(7.29.6.3, 7.29.6.4). But the 7.22.8 ones look ok.
On the other hand for converting between char, char16_t, char32_t there
are only the functions from 7.28.1. They do not convert whole strings at
a time and are inefficient[1].
So converting strng between char, char16_t and char32_t is a natural
addition, and using an interface similar to 7.22.8 seems like a good
choice to me.
What do you think? A reasonable addition to C? Worth writing a proposal
for C2X? Any better alternatives?

Let me offer some questions.

Would the functions you are suggesting provide only a performance
improvement (in both speed and size), or would they offer something
more? If it's something more than just performance, what is that?
If it is only for reasons of speed/size improvement, what sort of
gains can be expected?

Do you have a specific task that you are hoping to accomplish, or
is the motivation just one of general goodness? If there is a
specific task (or tasks plural), what might that/those be?

The wchar_t type is supposed to encode every character in a single
wchar_t element, ie, no multi-element encodings. The char16_t and
char32_t encodings might not have that property. (I think in most
cases char32_t will not have multi-element encodings and char16_t
will have multi-element encodings, but in principle I think both of
them are allowed multi-element encodings.) How does this affect
the behavior of functions like the ones you are suggesting? What
are the implications for return values, error conditions, state
saving, etc?

What function prototypes and semantic descriptions would you
specifically suggest?

I don't know the answers to any of these questions. Can you
provide some? Until I know more I don't feel able to respond
to your questions in any useful way.

j***@verizon.net

2016-11-12 18:17:33 UTC

Permalink

Note: to save space, I'm only going to refer to char16_t; but everything I say about char16_t has an obvious char32_t analog. The only significant asymmetry is that char16_t is guaranteed to have multi-element encodings if __STDC_UTF_16__ is pre#defined by the implementation, while char32_t will only have multi-element encodings if __STDC_UTF_32__ is NOT pre#defined.

Subject: mbstoc16s(), mdbstoc32s(), c16stombs(), c32stombs()

Since mbrtoc16() and c16rtomb() both exist, while mbtoc16() and c16tomb() do not, I think it would be more appropriate to define char16_t functions analogous to mbsrtowcs() and wcsrtombs() rather than mbstowcs() and wcstombs().

Post by Philipp Klaus Krause
I suggest to add functions for converting strings between char, char16_t
and char32_t encodings.
These would be similar to the existing functions from 7.22.8 for
conversion between char and wchar_t, and could be added to uchar.h.
C has quite some functions for converting between char and wchar_t.
Though some of them are thread-unsafe (the 7.22.7 ones) or inefficient[1]
(7.29.6.3, 7.29.6.4). But the 7.22.8 ones look ok.
On the other hand for converting between char, char16_t, char32_t there
are only the functions from 7.28.1. They do not convert whole strings at
a time and are inefficient[1].
So converting strng between char, char16_t and char32_t is a natural
addition, and using an interface similar to 7.22.8 seems like a good
choice to me.
What do you think? A reasonable addition to C? Worth writing a proposal
for C2X? Any better alternatives?

Let me offer some questions.
Would the functions you are suggesting provide only a performance
improvement (in both speed and size), or would they offer something
more?

Performance improvement compared with what? As I understand it, he's asking about functions that do things no existing standard library function currently does for char16_t: handle entire strings rather than single characters.

As I understand his suggestion (as modified by me above), mbsrtoc16s() would have essentially the same relationship to mbrtoc16() that mbsrtows() has to mbrtowc(), while c16srtombs() would have essentially teh same relationship to c16rtomb() that wcsrtombs() has to wcrtomb().

... If it's something more than just performance, what is that?
If it is only for reasons of speed/size improvement, what sort of
gains can be expected?
Do you have a specific task that you are hoping to accomplish, or
is the motivation just one of general goodness? If there is a
specific task (or tasks plural), what might that/those be?

Convert strings encoded using the encodings associate with char16_t into multi-byte character strings, and vice versa.

The wchar_t type is supposed to encode every character in a single
wchar_t element, ie, no multi-element encodings. The char16_t and
char32_t encodings might not have that property. (I think in most
cases char32_t will not have multi-element encodings and char16_t
will have multi-element encodings, but in principle I think both of
them are allowed multi-element encodings.) How does this affect
the behavior of functions like the ones you are suggesting? What
are the implications for return values, error conditions, state
saving, etc?

The implications are that the string processing functions should handle multi-element encodings correctly on input, and should create multi-element encodings correctly on output. What does "correctly" mean? I'm not entirely sure, which is one reason I'd like to have a standard library function written by someone who does know. Your wording seems to imply that there might be multiple different ways to do this "correctly". Could you describe the possibilities that you see? The descriptions for these functions could mandate one particular choice from among the possibilities, or the functions could take one or more additional arguments to determine which possibility to implement.

Since mbrtoc16() and mbrtoc32() return more different error codes than mbrtowc(), the corresponding string functions will probably also need to report more error conditions. It's Philipp's suggestion, I'll let him do the work of figuring out what they should be.

Tim Rentsch

2016-11-15 19:23:21 UTC

Permalink

Note: to save space, I'm only going to refer to char16_t; but
everything I say about char16_t has an obvious char32_t analog. The
only significant asymmetry is that char16_t is guaranteed to have
multi-element encodings if __STDC_UTF_16__ is pre#defined by the
implementation, while char32_t will only have multi-element encodings
if __STDC_UTF_32__ is NOT pre#defined.

Subject: mbstoc16s(), mdbstoc32s(), c16stombs(), c32stombs()

Since mbrtoc16() and c16rtomb() both exist, while mbtoc16() and
c16tomb() do not, I think it would be more appropriate to define
char16_t functions analogous to mbsrtowcs() and wcsrtombs() rather
than mbstowcs() and wcstombs().

Post by Philipp Klaus Krause
I suggest to add functions for converting strings between char, char16_t
and char32_t encodings.
These would be similar to the existing functions from 7.22.8 for
conversion between char and wchar_t, and could be added to uchar.h.
C has quite some functions for converting between char and wchar_t.
Though some of them are thread-unsafe (the 7.22.7 ones) or inefficient[1]
(7.29.6.3, 7.29.6.4). But the 7.22.8 ones look ok.
On the other hand for converting between char, char16_t, char32_t there
are only the functions from 7.28.1. They do not convert whole strings at
a time and are inefficient[1].
So converting strng between char, char16_t and char32_t is a natural
addition, and using an interface similar to 7.22.8 seems like a good
choice to me.
What do you think? A reasonable addition to C? Worth writing a proposal
for C2X? Any better alternatives?

Let me offer some questions.
Would the functions you are suggesting provide only a performance
improvement (in both speed and size), or would they offer something
more?

Performance improvement compared with what? [...]

Compared to providing the desired functionality in portable
C using the already existing standard functions (which I
assume is possible, but I haven't checked carefully which
is partly why I asked the question).

As I understand his suggestion (as modified by me above),
mbsrtoc16s() would have essentially the same relationship to
mbrtoc16() that mbsrtows() has to mbrtowc(), while c16srtombs() would
have essentially teh same relationship to c16rtomb() that wcsrtombs()
has to wcrtomb().

Convert strings encoded using the encodings associate with char16_t
into multi-byte character strings, and vice versa.

I knew that. The question was meant in the sense of what
applications of those functions are expected.

The implications are that the string processing functions should
handle multi-element encodings correctly on input, and should create
multi-element encodings correctly on output. What does "correctly"
mean? I'm not entirely sure, which is one reason I'd like to have a
standard library function written by someone who does know. Your
wording seems to imply that there might be multiple different ways to
do this "correctly". Could you describe the possibilities that you
see? The descriptions for these functions could mandate one
particular choice from among the possibilities, or the functions
could take one or more additional arguments to determine which
possibility to implement.

I was assuming, without really thinking about it deeply, that the
result should be "as if" an input string were converted to a
null-terminated wchar_t array, and the null-terminated wchar_t
array were then converted to the output type, and that this
transformation is unambiguous. Furthermore I think the two
string conversions should be equivalent to converting character
by characters, using the already existing standard conversion
functions. Here again I'm not sure these assumptions are held
to be correct, which is partly why I ask the questions I did.

Since mbrtoc16() and mbrtoc32() return more different error codes
than mbrtowc(), the corresponding string functions will probably also
need to report more error conditions. It's Philipp's suggestion,
I'll let him do the work of figuring out what they should be.

Yes, I was expecting that since I was replying to his posting
that he would be answering the questions.

j***@verizon.net

2016-11-15 20:20:57 UTC