Post by j***@verizon.netPost by Tim RentschPost by j***@verizon.netPost by Tim RentschNote: to save space, I'm only going to refer to char16_t; but
everything I say about char16_t has an obvious char32_t analog. The
...
Post by Tim RentschPost by j***@verizon.netThe justification for defining mbsrtoc16s(), despite the fact that
mbrtoc16() already exists, is the convenience factor. [...]
That is a plausible motivating factor. It is not however the only
such factor, and it may or may not be one as far as M. Krause is
concerned. I still am interested to hear his answer.
As to determining suitability, IMO saying some new feature would be
convenient is not by itself sufficient justification to warrant its
inclusion in the Standard. There really should be some further
motivation beyond that.
It would be helpful to know what the "further motivation" was that,
in your opinion, justified the addition of the corresponding wide
character string functions - or, is it your opinion that there was
no such motivation, and that they should therefore be dropped? To
my mind, convenience would seem to be the only justification for
those functions, and it also seems a sufficient justification, so
I've never worried about whether there's any further motivation.
I have no idea what arguments were offered to motivate those
functions, or indeed if there were any such arguments at all.
Since I don't know what arguments were offered, I'm not in a
position to say that the same arguments apply - maybe they
do, and maybe they don't, but either way I don't know.
Post by j***@verizon.netPost by Tim RentschPost by j***@verizon.netPost by Tim RentschI knew that. The question was meant in the sense of what
applications of those functions are expected.
I've no specific idea, but multi-byte character strings are
ubiquitous, and 16-bit encodings are not unheard of (I gather that
Windows uses UTF-16). Therefore, the possibility that there's a
shortage of applications which have a need to convert strings
between such encodings is not one I'm willing to bother worrying
about. YMMV. Since these functions don't exist yet, obviously
any such application is currently using some other method for
performing such conversions - however, I'd expect at least some of
the developers of such code to be happy to switch to a C standard
library function, as soon as they became sufficiently widely
available.
So, the bottom line is you really don't know?
I don't "know" anything about reality; all I have is varying
degrees of certainty about various statements about reality, which
is never either exactly 0% or exactly 100%. I'm sufficiently sure,
for the reasons given above, that the need exists, that I'm not
going to bother worrying about the possibility that it doesn't. If
those reasons aren't sufficient for you, that's fine - you should
investigate further - but I see no need to do so.
Let me put my question differently. Am I right in saying that
your earlier comments are just speculation, in the sense that
you don't have any concrete evidence or examples to offer?
Post by j***@verizon.netPost by Tim RentschPost by j***@verizon.netI'd been thinking in terms of a direct conversion, but I suppose
using wchar_t as an intermediary might have advantages. However,
if that's the case, then the conversion routines between wchar_t
and char16_t should be added to the standard library.
What I was trying to do is make sure the semantics are well-defined
and also consistent with a wchar_t representation, not describe an
implementation. ...
The standard imposes some requirements on the representation of
multi-byte characters, wchar_t, char16_t or char32_t, but not enough
to mandate that conversions between any pair of those types are
invertible. If any of those conversions is not invertible, forcing
the translation between any two of those types to go through a
particular third type might make the conversion unnecessarily lossy.
I wouldn't mind it if the standard added words requiring that some
or all of those conversions be invertible.
What I think are the important round trips, ie, those starting
and ending with multi-byte characters, cannot be made invertible
because some encodings (that the Standard wants to allow) are
inherently potentially redundant. But it might be enough to
say that a round-trip operation must be idempotent, ie, applying
it twice is the same as applying it once.
Post by j***@verizon.netPost by Tim RentschPost by j***@verizon.netI'm not entirely clear how to use the char16_t functions either,
despite having carefully read their complete description. That's
part of the reason why I wouldn't mind having string-oriented
versions written by the library implementor, rather than having to
write the equivalent code myself.
This seems like an odd thing to say. If you aren't sure how the
*c16* functions work, how can you evaluate whether some additional
functions should be added?
I think they should be added, precisely because I don't know how the
single character functions should be used, despite having read those
descriptions.
Have you tried to write any code that uses them? If you did that
might alleviate some of your uncertainty.
Post by j***@verizon.netThat means that those descriptions are at the very
least, obscure, so I'm probably not the only person unsure about the
matter. Anyone who implements the single-character functions must
understand how they are to be used, and should therefore be capable
of implementing the string-related functions better than I could. I
might not be able to evaluate whether they did it right, but I could
at least choose to trust that they've done so.
I don't think it follows necessarily that the functions are hard
to understand. It may be simply that you are distracted by other
things (eg, your kids) and haven't had time to look at them
carefully. My guess is that in fact you would have no trouble
if you could take some time to look at them and perhaps if it were
important to do so, eg, as part of a work assignment. I agree
the functions are a little weird but they are not that difficult.
Post by j***@verizon.netPost by Tim Rentsch... Furthermore, if the already existing
multi-byte string conversion functions are any indication, new
functions for dealing with charXX_t strings will be defined in
terms of the more elementary charXX_t character conversion
functions. So if you don't yet understand the existing char16_t
conversion functions, there's a good chance that would carry over
to new char16_t string conversion functions that make use of
them (in the as-if sense, I mean).
Not necessarily - the definition by the standard of how those string
functions make use of the single character functions might, if
sufficiently well written, resolve my current uncertainties about
how they should be used. The explanation should be sufficiently
well-written to allow implementors to implement it correctly, which
should be good enough for me to understand it.
We aren't saying anything different here. If there (only) is a
good chance that X is true, then it is not necessarily so that
X is true.
Post by j***@verizon.netIn particular, if one of the single-character functions is currently
defined in a way that makes it impossible to use it while
implementing the corresponding string function (a possibility that
is within the range of my current uncertainty about them), then that
would, in my opinion, be a defect in the current standard.
There is an open Defect Report on a question related to that.
Post by j***@verizon.netIf that
is the case, being forced to write up a description of the string
functions would allow the committee to realize that there was a
defect in the description of the corresponding single character
function, and correct it.
To me this seems a bit bass-ackwards. If the Standard has a
potential defect (as indeed has already been identified), it
should be fairly easy to determine whether there is in fact
a defect based on already known use cases. I fully expect
that here there will be no problem in identifying a defect,
either one of how the documentation is written or one of
how the semantics are defined. Trying to write a description
for some new functionality would only muddy the waters.
Much of the discussion we've had has been fairly abstract. I
think it would be good to make it more concrete. So here are
definitions for two of the functions you alluded to above:
size_t
mbsrtoc16s( char16_t *out, const char **in, size_t n, mbstate_t *state ){
size_t m = 0;
while( m < n ){
size_t k = mbrtoc16( out+m, *in, 1, state );
/**/ if( k == 0 ) return m;
else if( k < -3 ) m++, *in += k;
else if( k == -3 ) m++;
else if( k == -2 ) *in += 1;
else if( k == -1 ) return -1;
else assert(0);
}
return m;
}
size_t
c16srtombs( char *out, const char16_t **in, size_t n, mbstate_t *state ){
mbstate_t r = *state;
char bytes[ MB_LEN_MAX ];
size_t m = 0;
do {
size_t k = c16rtomb( bytes, **in, &r );
if( k == -1 ) return -1;
if( m+k > n ) return m;
memcpy( out+m, bytes, k );
m += k;
*in += 1;
*state = r;
} while( m < 1 || out[m-1] != 0 );
return m-1;
}
A few comments:
(1) I didn't implement the special functionality for when 'out'
is null. It should be easy to add this if anyone wants it.
(2) It assumes the open DR for c16rtomb has been addressed
appropriately. More specifically, it makes use of a modified
c16rtomb() that handles surrogate pairs correctly.
(3) Obviously there are several performance improvements that
might be made. I wrote the code just very straightforwardly,
with no attention given to performance concerns.
(4) The code shown is tested and working, although it was not
tested as thoroughly as my normal process would call for. I
did test round trips for every UTF-16 code point.