Discussion:
mbstoc16s(), mdbstoc32s(), c16stombs(), c32stombs()
(too old to reply)
Philipp Klaus Krause
2016-11-07 20:13:49 UTC
Permalink
Raw Message
I suggest to add functions for converting strings between char, char16_t
and char32_t encodings.
These would be similar to the existing functions from 7.22.8 for
conversion between char and wchar_t, and could be added to uchar.h.

C has quite some functions for converting between char and wchar_t.
Though some of them are thread-unsafe (the 7.22.7 ones) or inefficient¹
(7.29.6.3, 7.29.6.4). But the 7.22.8 ones look ok.

On the other hand for converting between char, char16_t, char32_t there
are only the functions from 7.28.1. They do not convert whole strings at
a time and are inefficient¹.

So converting strng between char, char16_t and char32_t is a natural
addition, and using an interface similar to 7.22.8 seems like a good
choice to me.

What do you think? A reasonable addition to C? Worth writing a proposal
for C2X? Any better alternatives?

Philipp

¹ Restartable functions can handle partial characters as input which
comes with a substantial burden on implementations, affecting both speed
and code size substantially.
James R. Kuyper
2016-11-07 21:07:51 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
I suggest to add functions for converting strings between char, char16_t
and char32_t encodings.
These would be similar to the existing functions from 7.22.8 for
conversion between char and wchar_t, and could be added to uchar.h.
C has quite some functions for converting between char and wchar_t.
Though some of them are thread-unsafe (the 7.22.7 ones) or inefficient¹
(7.29.6.3, 7.29.6.4). But the 7.22.8 ones look ok.
On the other hand for converting between char, char16_t, char32_t there
are only the functions from 7.28.1. They do not convert whole strings at
a time and are inefficient¹.
So converting strng between char, char16_t and char32_t is a natural
addition, and using an interface similar to 7.22.8 seems like a good
choice to me.
What do you think? A reasonable addition to C? Worth writing a proposal
for C2X? Any better alternatives?
Philipp
¹ Restartable functions can handle partial characters as input which
comes with a substantial burden on implementations, affecting both speed
and code size substantially.
It seems reasonable to me. Offhand, it's not obvious why this wasn't
done in C2011.
Tim Rentsch
2016-11-12 16:10:50 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
I suggest to add functions for converting strings between char, char16_t
and char32_t encodings.
These would be similar to the existing functions from 7.22.8 for
conversion between char and wchar_t, and could be added to uchar.h.
C has quite some functions for converting between char and wchar_t.
Though some of them are thread-unsafe (the 7.22.7 ones) or inefficient[1]
(7.29.6.3, 7.29.6.4). But the 7.22.8 ones look ok.
On the other hand for converting between char, char16_t, char32_t there
are only the functions from 7.28.1. They do not convert whole strings at
a time and are inefficient[1].
So converting strng between char, char16_t and char32_t is a natural
addition, and using an interface similar to 7.22.8 seems like a good
choice to me.
What do you think? A reasonable addition to C? Worth writing a proposal
for C2X? Any better alternatives?
Let me offer some questions.

Would the functions you are suggesting provide only a performance
improvement (in both speed and size), or would they offer something
more? If it's something more than just performance, what is that?
If it is only for reasons of speed/size improvement, what sort of
gains can be expected?

Do you have a specific task that you are hoping to accomplish, or
is the motivation just one of general goodness? If there is a
specific task (or tasks plural), what might that/those be?

The wchar_t type is supposed to encode every character in a single
wchar_t element, ie, no multi-element encodings. The char16_t and
char32_t encodings might not have that property. (I think in most
cases char32_t will not have multi-element encodings and char16_t
will have multi-element encodings, but in principle I think both of
them are allowed multi-element encodings.) How does this affect
the behavior of functions like the ones you are suggesting? What
are the implications for return values, error conditions, state
saving, etc?

What function prototypes and semantic descriptions would you
specifically suggest?

I don't know the answers to any of these questions. Can you
provide some? Until I know more I don't feel able to respond
to your questions in any useful way.
j***@verizon.net
2016-11-12 18:17:33 UTC
Permalink
Raw Message
Note: to save space, I'm only going to refer to char16_t; but everything I say about char16_t has an obvious char32_t analog. The only significant asymmetry is that char16_t is guaranteed to have multi-element encodings if __STDC_UTF_16__ is pre#defined by the implementation, while char32_t will only have multi-element encodings if __STDC_UTF_32__ is NOT pre#defined.
Subject: mbstoc16s(), mdbstoc32s(), c16stombs(), c32stombs()
Since mbrtoc16() and c16rtomb() both exist, while mbtoc16() and c16tomb() do not, I think it would be more appropriate to define char16_t functions analogous to mbsrtowcs() and wcsrtombs() rather than mbstowcs() and wcstombs().
Post by Philipp Klaus Krause
I suggest to add functions for converting strings between char, char16_t
and char32_t encodings.
These would be similar to the existing functions from 7.22.8 for
conversion between char and wchar_t, and could be added to uchar.h.
C has quite some functions for converting between char and wchar_t.
Though some of them are thread-unsafe (the 7.22.7 ones) or inefficient[1]
(7.29.6.3, 7.29.6.4). But the 7.22.8 ones look ok.
On the other hand for converting between char, char16_t, char32_t there
are only the functions from 7.28.1. They do not convert whole strings at
a time and are inefficient[1].
So converting strng between char, char16_t and char32_t is a natural
addition, and using an interface similar to 7.22.8 seems like a good
choice to me.
What do you think? A reasonable addition to C? Worth writing a proposal
for C2X? Any better alternatives?
Let me offer some questions.
Would the functions you are suggesting provide only a performance
improvement (in both speed and size), or would they offer something
more?
Performance improvement compared with what? As I understand it, he's asking about functions that do things no existing standard library function currently does for char16_t: handle entire strings rather than single characters.

As I understand his suggestion (as modified by me above), mbsrtoc16s() would have essentially the same relationship to mbrtoc16() that mbsrtows() has to mbrtowc(), while c16srtombs() would have essentially teh same relationship to c16rtomb() that wcsrtombs() has to wcrtomb().
... If it's something more than just performance, what is that?
If it is only for reasons of speed/size improvement, what sort of
gains can be expected?
Do you have a specific task that you are hoping to accomplish, or
is the motivation just one of general goodness? If there is a
specific task (or tasks plural), what might that/those be?
Convert strings encoded using the encodings associate with char16_t into multi-byte character strings, and vice versa.
The wchar_t type is supposed to encode every character in a single
wchar_t element, ie, no multi-element encodings. The char16_t and
char32_t encodings might not have that property. (I think in most
cases char32_t will not have multi-element encodings and char16_t
will have multi-element encodings, but in principle I think both of
them are allowed multi-element encodings.) How does this affect
the behavior of functions like the ones you are suggesting? What
are the implications for return values, error conditions, state
saving, etc?
The implications are that the string processing functions should handle multi-element encodings correctly on input, and should create multi-element encodings correctly on output. What does "correctly" mean? I'm not entirely sure, which is one reason I'd like to have a standard library function written by someone who does know. Your wording seems to imply that there might be multiple different ways to do this "correctly". Could you describe the possibilities that you see? The descriptions for these functions could mandate one particular choice from among the possibilities, or the functions could take one or more additional arguments to determine which possibility to implement.

Since mbrtoc16() and mbrtoc32() return more different error codes than mbrtowc(), the corresponding string functions will probably also need to report more error conditions. It's Philipp's suggestion, I'll let him do the work of figuring out what they should be.
Tim Rentsch
2016-11-15 19:23:21 UTC
Permalink
Raw Message
Note: to save space, I'm only going to refer to char16_t; but
everything I say about char16_t has an obvious char32_t analog. The
only significant asymmetry is that char16_t is guaranteed to have
multi-element encodings if __STDC_UTF_16__ is pre#defined by the
implementation, while char32_t will only have multi-element encodings
if __STDC_UTF_32__ is NOT pre#defined.
Subject: mbstoc16s(), mdbstoc32s(), c16stombs(), c32stombs()
Since mbrtoc16() and c16rtomb() both exist, while mbtoc16() and
c16tomb() do not, I think it would be more appropriate to define
char16_t functions analogous to mbsrtowcs() and wcsrtombs() rather
than mbstowcs() and wcstombs().
Post by Philipp Klaus Krause
I suggest to add functions for converting strings between char, char16_t
and char32_t encodings.
These would be similar to the existing functions from 7.22.8 for
conversion between char and wchar_t, and could be added to uchar.h.
C has quite some functions for converting between char and wchar_t.
Though some of them are thread-unsafe (the 7.22.7 ones) or inefficient[1]
(7.29.6.3, 7.29.6.4). But the 7.22.8 ones look ok.
On the other hand for converting between char, char16_t, char32_t there
are only the functions from 7.28.1. They do not convert whole strings at
a time and are inefficient[1].
So converting strng between char, char16_t and char32_t is a natural
addition, and using an interface similar to 7.22.8 seems like a good
choice to me.
What do you think? A reasonable addition to C? Worth writing a proposal
for C2X? Any better alternatives?
Let me offer some questions.
Would the functions you are suggesting provide only a performance
improvement (in both speed and size), or would they offer something
more?
Performance improvement compared with what? [...]
Compared to providing the desired functionality in portable
C using the already existing standard functions (which I
assume is possible, but I haven't checked carefully which
is partly why I asked the question).
As I understand his suggestion (as modified by me above),
mbsrtoc16s() would have essentially the same relationship to
mbrtoc16() that mbsrtows() has to mbrtowc(), while c16srtombs() would
have essentially teh same relationship to c16rtomb() that wcsrtombs()
has to wcrtomb().
... If it's something more than just performance, what is that?
If it is only for reasons of speed/size improvement, what sort of
gains can be expected?
Do you have a specific task that you are hoping to accomplish, or
is the motivation just one of general goodness? If there is a
specific task (or tasks plural), what might that/those be?
Convert strings encoded using the encodings associate with char16_t
into multi-byte character strings, and vice versa.
I knew that. The question was meant in the sense of what
applications of those functions are expected.
The wchar_t type is supposed to encode every character in a single
wchar_t element, ie, no multi-element encodings. The char16_t and
char32_t encodings might not have that property. (I think in most
cases char32_t will not have multi-element encodings and char16_t
will have multi-element encodings, but in principle I think both of
them are allowed multi-element encodings.) How does this affect
the behavior of functions like the ones you are suggesting? What
are the implications for return values, error conditions, state
saving, etc?
The implications are that the string processing functions should
handle multi-element encodings correctly on input, and should create
multi-element encodings correctly on output. What does "correctly"
mean? I'm not entirely sure, which is one reason I'd like to have a
standard library function written by someone who does know. Your
wording seems to imply that there might be multiple different ways to
do this "correctly". Could you describe the possibilities that you
see? The descriptions for these functions could mandate one
particular choice from among the possibilities, or the functions
could take one or more additional arguments to determine which
possibility to implement.
I was assuming, without really thinking about it deeply, that the
result should be "as if" an input string were converted to a
null-terminated wchar_t array, and the null-terminated wchar_t
array were then converted to the output type, and that this
transformation is unambiguous. Furthermore I think the two
string conversions should be equivalent to converting character
by characters, using the already existing standard conversion
functions. Here again I'm not sure these assumptions are held
to be correct, which is partly why I ask the questions I did.
Since mbrtoc16() and mbrtoc32() return more different error codes
than mbrtowc(), the corresponding string functions will probably also
need to report more error conditions. It's Philipp's suggestion,
I'll let him do the work of figuring out what they should be.
Yes, I was expecting that since I was replying to his posting
that he would be answering the questions.
j***@verizon.net
2016-11-15 20:20:57 UTC
Permalink
Raw Message
Post by Tim Rentsch
Note: to save space, I'm only going to refer to char16_t; but
everything I say about char16_t has an obvious char32_t analog. The
...
...
Post by Tim Rentsch
Post by Tim Rentsch
Would the functions you are suggesting provide only a performance
improvement (in both speed and size), or would they offer something
more?
Performance improvement compared with what? [...]
Compared to providing the desired functionality in portable
C using the already existing standard functions (which I
assume is possible, but I haven't checked carefully which
is partly why I asked the question).
As I understand his suggestion (as modified by me above),
mbsrtoc16s() would have essentially the same relationship to
mbrtoc16() that mbsrtows() has to mbrtowc(), while c16srtombs() would
have essentially teh same relationship to c16rtomb() that wcsrtombs()
has to wcrtomb().
The justification for defining mbsrtoc16s(), despite the fact that mbrtoc16()
already exists, is the convenience factor. That justification has to be at
least as good as the justification for defining mbsrtows(), despite the fact
that mbrtowc() exists, since mbrtoc16() is more complicated to use than
mbrtowc(). (Similarly for the reverse functions).

...
Post by Tim Rentsch
Convert strings encoded using the encoding associate with char16_t
into multi-byte character strings, and vice versa.
I knew that. The question was meant in the sense of what
applications of those functions are expected.
I've no specific idea, but multi-byte character strings are ubiquitous, and
16-bit encodings are not unheard of (I gather that Windows uses UTF-16).
Therefore, the possibility that there's a shortage of applications which have a
need to convert strings between such encodings is not one I'm willing to bother
worrying about. YMMV. Since these functions don't exist yet, obviously any such
application is currently using some other method for performing such
conversions - however, I'd expect at least some of the developers of such code
to be happy to switch to a C standard library function, as soon as they became
sufficiently widely available.

...
Post by Tim Rentsch
The implications are that the string processing functions should
handle multi-element encodings correctly on input, and should create
multi-element encodings correctly on output. What does "correctly"
mean? I'm not entirely sure, which is one reason I'd like to have a
standard library function written by someone who does know. Your
wording seems to imply that there might be multiple different ways to
do this "correctly". Could you describe the possibilities that you
see? The descriptions for these functions could mandate one
particular choice from among the possibilities, or the functions
could take one or more additional arguments to determine which
possibility to implement.
I was assuming, without really thinking about it deeply, that the
result should be "as if" an input string were converted to a
null-terminated wchar_t array, and the null-terminated wchar_t
array were then converted to the output type, and that this
transformation is unambiguous.
I'd been thinking in terms of a direct conversion, but I suppose using wchar_t
as an intermediary might have advantages. However, if that's the case, then the
conversion routines between wchar_t and char16_t should be added to the
standard library.
Post by Tim Rentsch
... Furthermore I think the two
string conversions should be equivalent to converting character
by characters, using the already existing standard conversion
functions. Here again I'm not sure these assumptions are held
to be correct, which is partly why I ask the questions I did.
I'm not entirely clear how to use the char16_t functions either, despite having
carefully read their complete description. That's part of the reason why I
wouldn't mind having string-oriented versions written by the library
implementor, rather than having to write the equivalent code myself.
Jakob Bohm
2016-11-16 15:16:12 UTC
Permalink
Raw Message
Post by j***@verizon.net
Post by Tim Rentsch
Note: to save space, I'm only going to refer to char16_t; but
everything I say about char16_t has an obvious char32_t analog. The
...
...
Post by Tim Rentsch
Post by Tim Rentsch
Would the functions you are suggesting provide only a performance
improvement (in both speed and size), or would they offer something
more?
Performance improvement compared with what? [...]
Compared to providing the desired functionality in portable
C using the already existing standard functions (which I
assume is possible, but I haven't checked carefully which
is partly why I asked the question).
As I understand his suggestion (as modified by me above),
mbsrtoc16s() would have essentially the same relationship to
mbrtoc16() that mbsrtows() has to mbrtowc(), while c16srtombs() would
have essentially teh same relationship to c16rtomb() that wcsrtombs()
has to wcrtomb().
The justification for defining mbsrtoc16s(), despite the fact that mbrtoc16()
already exists, is the convenience factor. That justification has to be at
least as good as the justification for defining mbsrtows(), despite the fact
that mbrtowc() exists, since mbrtoc16() is more complicated to use than
mbrtowc(). (Similarly for the reverse functions).
There is also the issue of thread safety without the overhead of
setting up per thread state buffers inside the implementation.

This is because char-by-char functions need to use internal state to
track the various multi-element cases (such as UTF-16 surrogates and
UNICODE non-composed accent modifiers). Whereas a whole-string
function can rely on getting all elements of such items in one call,
thus needing only regular local variables.
Post by j***@verizon.net
...
Post by Tim Rentsch
Convert strings encoded using the encoding associate with char16_t
into multi-byte character strings, and vice versa.
I knew that. The question was meant in the sense of what
applications of those functions are expected.
I've no specific idea, but multi-byte character strings are ubiquitous, and
16-bit encodings are not unheard of (I gather that Windows uses UTF-16).
Win32, Win64 and the (now rare) WinCE platforms prefers/requires the
UTF-16 encoding of UNICODE because the oldest release of their common
API (developed 1989 to 1992) was originally designed for the UCS-2
character set.

Sun/Oracle/Open Java JDK calls to and from C (the JNI api) uses UTF-16
and/or "modified UTF-8", which is a naive mapping of individual
char16_t elements to their (sometimes invalid) UTF-8 encodings, plus a
special encoding of (char16_t)0 string elements.

The (now defunct) Symbian OS also used UTF-16 as its system character
set, but that OS required native programs to be in C++, not C.
Post by j***@verizon.net
Therefore, the possibility that there's a shortage of applications which have a
need to convert strings between such encodings is not one I'm willing to bother
worrying about. YMMV. Since these functions don't exist yet, obviously any such
application is currently using some other method for performing such
conversions - however, I'd expect at least some of the developers of such code
to be happy to switch to a C standard library function, as soon as they became
sufficiently widely available.
...
Post by Tim Rentsch
The implications are that the string processing functions should
handle multi-element encodings correctly on input, and should create
multi-element encodings correctly on output. What does "correctly"
mean? I'm not entirely sure, which is one reason I'd like to have a
standard library function written by someone who does know. Your
wording seems to imply that there might be multiple different ways to
do this "correctly". Could you describe the possibilities that you
see? The descriptions for these functions could mandate one
particular choice from among the possibilities, or the functions
could take one or more additional arguments to determine which
possibility to implement.
I was assuming, without really thinking about it deeply, that the
result should be "as if" an input string were converted to a
null-terminated wchar_t array, and the null-terminated wchar_t
array were then converted to the output type, and that this
transformation is unambiguous.
I'd been thinking in terms of a direct conversion, but I suppose using wchar_t
as an intermediary might have advantages. However, if that's the case, then the
conversion routines between wchar_t and char16_t should be added to the
standard library.
Note that the ability or inability of wchar_t strings to represent all
of values of char32_t strings is probably implementation defined.
Though many platforms probably define their wchar_t as identical to
char16_t or char32_t, I don't think this is required behavior.
Post by j***@verizon.net
Post by Tim Rentsch
... Furthermore I think the two
string conversions should be equivalent to converting character
by characters, using the already existing standard conversion
functions. Here again I'm not sure these assumptions are held
to be correct, which is partly why I ask the questions I did.
I'm not entirely clear how to use the char16_t functions either, despite having
carefully read their complete description. That's part of the reason why I
wouldn't mind having string-oriented versions written by the library
implementor, rather than having to write the equivalent code myself.
Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
Tim Rentsch
2016-11-22 10:31:08 UTC
Permalink
Raw Message
Post by j***@verizon.net
Post by Tim Rentsch
Note: to save space, I'm only going to refer to char16_t; but
everything I say about char16_t has an obvious char32_t analog. The
...
...
Post by Tim Rentsch
Post by Tim Rentsch
Would the functions you are suggesting provide only a performance
improvement (in both speed and size), or would they offer something
more?
Performance improvement compared with what? [...]
Compared to providing the desired functionality in portable
C using the already existing standard functions (which I
assume is possible, but I haven't checked carefully which
is partly why I asked the question).
As I understand his suggestion (as modified by me above),
mbsrtoc16s() would have essentially the same relationship to
mbrtoc16() that mbsrtows() has to mbrtowc(), while c16srtombs() would
have essentially teh same relationship to c16rtomb() that wcsrtombs()
has to wcrtomb().
The justification for defining mbsrtoc16s(), despite the fact that
mbrtoc16() already exists, is the convenience factor. [...]
That is a plausible motivating factor. It is not however the only
such factor, and it may or may not be one as far as M. Krause is
concerned. I still am interested to hear his answer.

As to determining suitability, IMO saying some new feature would be
convenient is not by itself sufficient justification to warrant its
inclusion in the Standard. There really should be some further
motivation beyond that.

Note by the way that the use cases for, eg, mbstowcs and wcstombs
may not carry over to the char16_t/char32_t types.

I should add a clarifying remark here. ISTM you are mostly
interested in making a case that some new functions should be
added, whereas I am mostly interested in gaining a better
understanding of what is being proposed exactly, and why. At this
point I am not taking a stance either way as to whether some
functions along these lines would be worth adding to the standard
library.
Post by j***@verizon.net
Post by Tim Rentsch
Convert strings encoded using the encoding associate with char16_t
into multi-byte character strings, and vice versa.
I knew that. The question was meant in the sense of what
applications of those functions are expected.
I've no specific idea, but multi-byte character strings are
ubiquitous, and 16-bit encodings are not unheard of (I gather that
Windows uses UTF-16). Therefore, the possibility that there's a
shortage of applications which have a need to convert strings
between such encodings is not one I'm willing to bother worrying
about. YMMV. Since these functions don't exist yet, obviously
any such application is currently using some other method for
performing such conversions - however, I'd expect at least some of
the developers of such code to be happy to switch to a C standard
library function, as soon as they became sufficiently widely
available.
So, the bottom line is you really don't know?
Post by j***@verizon.net
Post by Tim Rentsch
The implications are that the string processing functions should
handle multi-element encodings correctly on input, and should
create multi-element encodings correctly on output. What does
"correctly" mean? I'm not entirely sure, which is one reason
I'd like to have a standard library function written by someone
who does know. Your wording seems to imply that there might be
multiple different ways to do this "correctly". Could you
describe the possibilities that you see? The descriptions for
these functions could mandate one particular choice from among
the possibilities, or the functions could take one or more
additional arguments to determine which possibility to
implement.
I was assuming, without really thinking about it deeply, that the
result should be "as if" an input string were converted to a
null-terminated wchar_t array, and the null-terminated wchar_t
array were then converted to the output type, and that this
transformation is unambiguous.
I'd been thinking in terms of a direct conversion, but I suppose
using wchar_t as an intermediary might have advantages. However,
if that's the case, then the conversion routines between wchar_t
and char16_t should be added to the standard library.
What I was trying to do is make sure the semantics are well-defined
and also consistent with a wchar_t representation, not describe an
implementation. I think it suffices to translate a multi-byte
string to a wchar_t string, then translate one multi-byte character
at a time, first from wchar_t to multi-byte, then from multi-byte
to char16_t (or char32_t).

Note by the way that translating from multi-byte to wchar_t and
then from wchar_t back to multi-byte is not guaranteed to be an
identity mapping.

The other direction depends on the resolution of the Defect Report
on c16rtomb, which I think is still open.
Post by j***@verizon.net
Post by Tim Rentsch
... Furthermore I think the two
string conversions should be equivalent to converting character
by characters, using the already existing standard conversion
functions. Here again I'm not sure these assumptions are held
to be correct, which is partly why I ask the questions I did.
I'm not entirely clear how to use the char16_t functions either,
despite having carefully read their complete description. That's
part of the reason why I wouldn't mind having string-oriented
versions written by the library implementor, rather than having to
write the equivalent code myself.
This seems like an odd thing to say. If you aren't sure how the
*c16* functions work, how can you evaluate whether some additional
functions should be added? Furthermore, if the already existing
multi-byte string conversion functions are any indication, new
functions for dealing with charXX_t strings will be defined in
terms of the more elementary charXX_t character conversion
functions. So if you don't yet understand the existing char16_t
conversion functions, there's a good chance that would carry over
to new char16_t string conversion functions that make use of
them (in the as-if sense, I mean).
j***@verizon.net
2016-11-22 17:27:36 UTC
Permalink
Raw Message
Post by Tim Rentsch
Post by j***@verizon.net
Post by Tim Rentsch
Note: to save space, I'm only going to refer to char16_t; but
everything I say about char16_t has an obvious char32_t analog. The
...
Post by Tim Rentsch
Post by j***@verizon.net
The justification for defining mbsrtoc16s(), despite the fact that
mbrtoc16() already exists, is the convenience factor. [...]
That is a plausible motivating factor. It is not however the only
such factor, and it may or may not be one as far as M. Krause is
concerned. I still am interested to hear his answer.
As to determining suitability, IMO saying some new feature would be
convenient is not by itself sufficient justification to warrant its
inclusion in the Standard. There really should be some further
motivation beyond that.
It would be helpful to know what the "further motivation" was that, in your
opinion, justified the addition of the corresponding wide character string
functions - or, is it your opinion that there was no such motivation, and that
they should therefore be dropped? To my mind, convenience would seem to be the
only justification for those functions, and it also seems a sufficient
justification, so I've never worried about whether there's any further
motivation.
Post by Tim Rentsch
Post by j***@verizon.net
Post by Tim Rentsch
I knew that. The question was meant in the sense of what
applications of those functions are expected.
I've no specific idea, but multi-byte character strings are
ubiquitous, and 16-bit encodings are not unheard of (I gather that
Windows uses UTF-16). Therefore, the possibility that there's a
shortage of applications which have a need to convert strings
between such encodings is not one I'm willing to bother worrying
about. YMMV. Since these functions don't exist yet, obviously
any such application is currently using some other method for
performing such conversions - however, I'd expect at least some of
the developers of such code to be happy to switch to a C standard
library function, as soon as they became sufficiently widely
available.
So, the bottom line is you really don't know?
I don't "know" anything about reality; all I have is varying degrees of
certainty about various statements about reality, which is never either exactly
0% or exactly 100%. I'm sufficiently sure, for the reasons given above, that
the need exists, that I'm not going to bother worrying about the possibility
that it doesn't. If those reasons aren't sufficient for you, that's fine - you
should investigate further - but I see no need to do so.
Post by Tim Rentsch
Post by j***@verizon.net
I'd been thinking in terms of a direct conversion, but I suppose
using wchar_t as an intermediary might have advantages. However,
if that's the case, then the conversion routines between wchar_t
and char16_t should be added to the standard library.
What I was trying to do is make sure the semantics are well-defined
and also consistent with a wchar_t representation, not describe an
implementation. ...
The standard imposes some requirements on the representation of multi-byte
characters, wchar_t, char16_t or char32_t, but not enough to mandate that
conversions between any pair of those types are invertible. If any of those
conversions is not invertible, forcing the translation between any two of those
types to go through a particular third type might make the conversion
unnecessarily lossy. I wouldn't mind it if the standard added words requiring
that some or all of those conversions be invertible.

...
Post by Tim Rentsch
Post by j***@verizon.net
I'm not entirely clear how to use the char16_t functions either,
despite having carefully read their complete description. That's
part of the reason why I wouldn't mind having string-oriented
versions written by the library implementor, rather than having to
write the equivalent code myself.
This seems like an odd thing to say. If you aren't sure how the
*c16* functions work, how can you evaluate whether some additional
functions should be added?
I think they should be added, precisely because I don't know how the single
character functions should be used, despite having read those descriptions.
That means that those descriptions are at the very least, obscure, so I'm
probably not the only person unsure about the matter. Anyone who implements the
single-character functions must understand how they are to be used, and should
therefore be capable of implementing the string-related functions better than I
could. I might not be able to evaluate whether they did it right, but I could
at least choose to trust that they've done so.
Post by Tim Rentsch
... Furthermore, if the already existing
multi-byte string conversion functions are any indication, new
functions for dealing with charXX_t strings will be defined in
terms of the more elementary charXX_t character conversion
functions. So if you don't yet understand the existing char16_t
conversion functions, there's a good chance that would carry over
to new char16_t string conversion functions that make use of
them (in the as-if sense, I mean).
Not necessarily - the definition by the standard of how those string functions
make use of the single character functions might, if sufficiently well written,
resolve my current uncertainties about how they should be used. The explanation
should be sufficiently well-written to allow implementors to implement it
correctly, which should be good enough for me to understand it.

In particular, if one of the single-character functions is currently defined in
a way that makes it impossible to use it while implementing the corresponding
string function (a possibility that is within the range of my current
uncertainty about them), then that would, in my opinion, be a defect in the
current standard. If that is the case, being forced to write up a description
of the string functions would allow the committee to realize that there was a
defect in the description of the corresponding single character function, and
correct it.
Jakob Bohm
2016-11-22 17:55:43 UTC
Permalink
Raw Message
Post by j***@verizon.net
Post by Tim Rentsch
Post by j***@verizon.net
Post by Tim Rentsch
Note: to save space, I'm only going to refer to char16_t; but
everything I say about char16_t has an obvious char32_t analog. The
...
Post by Tim Rentsch
Post by j***@verizon.net
The justification for defining mbsrtoc16s(), despite the fact that
mbrtoc16() already exists, is the convenience factor. [...]
That is a plausible motivating factor. It is not however the only
such factor, and it may or may not be one as far as M. Krause is
concerned. I still am interested to hear his answer.
As to determining suitability, IMO saying some new feature would be
convenient is not by itself sufficient justification to warrant its
inclusion in the Standard. There really should be some further
motivation beyond that.
It would be helpful to know what the "further motivation" was that, in your
opinion, justified the addition of the corresponding wide character string
functions - or, is it your opinion that there was no such motivation, and that
they should therefore be dropped? To my mind, convenience would seem to be the
only justification for those functions, and it also seems a sufficient
justification, so I've never worried about whether there's any further
motivation.
Post by Tim Rentsch
Post by j***@verizon.net
Post by Tim Rentsch
I knew that. The question was meant in the sense of what
applications of those functions are expected.
I've no specific idea, but multi-byte character strings are
ubiquitous, and 16-bit encodings are not unheard of (I gather that
Windows uses UTF-16). Therefore, the possibility that there's a
shortage of applications which have a need to convert strings
between such encodings is not one I'm willing to bother worrying
about. YMMV. Since these functions don't exist yet, obviously
any such application is currently using some other method for
performing such conversions - however, I'd expect at least some of
the developers of such code to be happy to switch to a C standard
library function, as soon as they became sufficiently widely
available.
So, the bottom line is you really don't know?
I don't "know" anything about reality; all I have is varying degrees of
certainty about various statements about reality, which is never either exactly
0% or exactly 100%. I'm sufficiently sure, for the reasons given above, that
the need exists, that I'm not going to bother worrying about the possibility
that it doesn't. If those reasons aren't sufficient for you, that's fine - you
should investigate further - but I see no need to do so.
Post by Tim Rentsch
Post by j***@verizon.net
I'd been thinking in terms of a direct conversion, but I suppose
using wchar_t as an intermediary might have advantages. However,
if that's the case, then the conversion routines between wchar_t
and char16_t should be added to the standard library.
What I was trying to do is make sure the semantics are well-defined
and also consistent with a wchar_t representation, not describe an
implementation. ...
The standard imposes some requirements on the representation of multi-byte
characters, wchar_t, char16_t or char32_t, but not enough to mandate that
conversions between any pair of those types are invertible. If any of those
conversions is not invertible, forcing the translation between any two of those
types to go through a particular third type might make the conversion
unnecessarily lossy. I wouldn't mind it if the standard added words requiring
that some or all of those conversions be invertible.
...
Post by Tim Rentsch
Post by j***@verizon.net
I'm not entirely clear how to use the char16_t functions either,
despite having carefully read their complete description. That's
part of the reason why I wouldn't mind having string-oriented
versions written by the library implementor, rather than having to
write the equivalent code myself.
This seems like an odd thing to say. If you aren't sure how the
*c16* functions work, how can you evaluate whether some additional
functions should be added?
I think they should be added, precisely because I don't know how the single
character functions should be used, despite having read those descriptions.
That means that those descriptions are at the very least, obscure, so I'm
probably not the only person unsure about the matter. Anyone who implements the
single-character functions must understand how they are to be used, and should
therefore be capable of implementing the string-related functions better than I
could. I might not be able to evaluate whether they did it right, but I could
at least choose to trust that they've done so.
Post by Tim Rentsch
... Furthermore, if the already existing
multi-byte string conversion functions are any indication, new
functions for dealing with charXX_t strings will be defined in
terms of the more elementary charXX_t character conversion
functions. So if you don't yet understand the existing char16_t
conversion functions, there's a good chance that would carry over
to new char16_t string conversion functions that make use of
them (in the as-if sense, I mean).
Not necessarily - the definition by the standard of how those string functions
make use of the single character functions might, if sufficiently well written,
resolve my current uncertainties about how they should be used. The explanation
should be sufficiently well-written to allow implementors to implement it
correctly, which should be good enough for me to understand it.
In particular, if one of the single-character functions is currently defined in
a way that makes it impossible to use it while implementing the corresponding
string function (a possibility that is within the range of my current
uncertainty about them), then that would, in my opinion, be a defect in the
current standard. If that is the case, being forced to write up a description
of the string functions would allow the committee to realize that there was a
defect in the description of the corresponding single character function, and
correct it.
If (as I suspect) the single character functions keep internal state
when passed multi-element logical characters, then those single
character functions will be less thread safe/reentry safe than all-in-
one-invocation string functions. I could imagine the single character
functions taking some opaque state variable as an in-out argument
instead, which would make them logically suitable for thread-safe
implementation all-at-once string functions, but inlining their
implementations will probably be more efficient.

Additionally, when defining new string conversion functions, it might
be useful to define them in a way that can also handle 0 valued
characters that don't terminate the string, as otherwise code that
needs to handle those would need to do some weird gymnastics just to
work around the standard functions truncating at 0-bytes.

As to round-tripping through wchar_t, this would fail miserably for the
following common case:

char32_t == UCS-4 UNICODE
char16_t == UTF-16 UNICODE
wchar_t == char16_t

UCS-4 UNICODE can store character points all the way up to U+7FFFFFFF,
but only those up to U+0010FFFF can be encoded as UTF-16 sequences of
char16_t. While the current version of the UNICODE standard uses no
character outside the UTF-16 compatible range (and even contains
verbiage to ban such use), someday they are going to run out of room
and start using more of the UCS-4 values.

Similarly, a 36 bit machine might, for implementation reasons, define
wchar_t as a 36 bit value, which thus cannot be round-tripped via
UCS-4 char32_t.



Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
Tim Rentsch
2016-11-23 17:38:10 UTC
Permalink
Raw Message
Post by Jakob Bohm
Post by j***@verizon.net
Post by Tim Rentsch
Post by j***@verizon.net
Post by Tim Rentsch
Note: to save space, I'm only going to refer to char16_t; but
everything I say about char16_t has an obvious char32_t analog. The
...
Post by Tim Rentsch
Post by j***@verizon.net
The justification for defining mbsrtoc16s(), despite the fact that
mbrtoc16() already exists, is the convenience factor. [...]
That is a plausible motivating factor. It is not however the only
such factor, and it may or may not be one as far as M. Krause is
concerned. I still am interested to hear his answer.
As to determining suitability, IMO saying some new feature would be
convenient is not by itself sufficient justification to warrant its
inclusion in the Standard. There really should be some further
motivation beyond that.
It would be helpful to know what the "further motivation" was that,
in your opinion, justified the addition of the corresponding wide
character string functions - or, is it your opinion that there was
no such motivation, and that they should therefore be dropped? To
my mind, convenience would seem to be the only justification for
those functions, and it also seems a sufficient justification, so
I've never worried about whether there's any further motivation.
Post by Tim Rentsch
Post by j***@verizon.net
Post by Tim Rentsch
I knew that. The question was meant in the sense of what
applications of those functions are expected.
I've no specific idea, but multi-byte character strings are
ubiquitous, and 16-bit encodings are not unheard of (I gather
that Windows uses UTF-16). Therefore, the possibility that
there's a shortage of applications which have a need to convert
strings between such encodings is not one I'm willing to bother
worrying about. YMMV. Since these functions don't exist yet,
obviously any such application is currently using some other
method for performing such conversions - however, I'd expect at
least some of the developers of such code to be happy to switch
to a C standard library function, as soon as they became
sufficiently widely available.
So, the bottom line is you really don't know?
I don't "know" anything about reality; all I have is varying
degrees of certainty about various statements about reality, which
is never either exactly 0% or exactly 100%. I'm sufficiently sure,
for the reasons given above, that the need exists, that I'm not
going to bother worrying about the possibility that it doesn't. If
those reasons aren't sufficient for you, that's fine - you should
investigate further - but I see no need to do so.
Post by Tim Rentsch
Post by j***@verizon.net
I'd been thinking in terms of a direct conversion, but I suppose
using wchar_t as an intermediary might have advantages. However,
if that's the case, then the conversion routines between wchar_t
and char16_t should be added to the standard library.
What I was trying to do is make sure the semantics are well-defined
and also consistent with a wchar_t representation, not describe an
implementation. ...
The standard imposes some requirements on the representation of
multi-byte characters, wchar_t, char16_t or char32_t, but not
enough to mandate that conversions between any pair of those types
are invertible. If any of those conversions is not invertible,
forcing the translation between any two of those types to go
through a particular third type might make the conversion
unnecessarily lossy. I wouldn't mind it if the standard added
words requiring that some or all of those conversions be
invertible.
...
Post by Tim Rentsch
Post by j***@verizon.net
I'm not entirely clear how to use the char16_t functions either,
despite having carefully read their complete description. That's
part of the reason why I wouldn't mind having string-oriented
versions written by the library implementor, rather than having to
write the equivalent code myself.
This seems like an odd thing to say. If you aren't sure how the
*c16* functions work, how can you evaluate whether some additional
functions should be added?
I think they should be added, precisely because I don't know how
the single character functions should be used, despite having read
those descriptions. That means that those descriptions are at the
very least, obscure, so I'm probably not the only person unsure
about the matter. Anyone who implements the single-character
functions must understand how they are to be used, and should
therefore be capable of implementing the string-related functions
better than I could. I might not be able to evaluate whether they
did it right, but I could at least choose to trust that they've
done so.
Post by Tim Rentsch
... Furthermore, if the already existing
multi-byte string conversion functions are any indication, new
functions for dealing with charXX_t strings will be defined in
terms of the more elementary charXX_t character conversion
functions. So if you don't yet understand the existing char16_t
conversion functions, there's a good chance that would carry over
to new char16_t string conversion functions that make use of
them (in the as-if sense, I mean).
Not necessarily - the definition by the standard of how those
string functions make use of the single character functions might,
if sufficiently well written, resolve my current uncertainties
about how they should be used. The explanation should be
sufficiently well-written to allow implementors to implement it
correctly, which should be good enough for me to understand it.
In particular, if one of the single-character functions is
currently defined in a way that makes it impossible to use it while
implementing the corresponding string function (a possibility that
is within the range of my current uncertainty about them), then
that would, in my opinion, be a defect in the current standard. If
that is the case, being forced to write up a description of the
string functions would allow the committee to realize that there
was a defect in the description of the corresponding single
character function, and correct it.
If (as I suspect) the single character functions keep internal state
when passed multi-element logical characters, [...]
None of the char16_t/char32_t conversion functions have that
property. Intermediate state is kept in an mbstate_t object,
a pointer to which is an argument in each call.
Post by Jakob Bohm
Additionally, when defining new string conversion functions, it
might be useful to define them in a way that can also handle 0
valued characters that don't terminate the string, as otherwise code
that needs to handle those would need to do some weird gymnastics
just to work around the standard functions truncating at 0-bytes.
The Standard explicitly disallows any such encoding, so there is
no motivation to define any standard library function to work
around it.
Post by Jakob Bohm
As to round-tripping through wchar_t, this would fail miserably for
char32_t == UCS-4 UNICODE
char16_t == UTF-16 UNICODE
wchar_t == char16_t
UCS-4 UNICODE can store character points all the way up to U+7FFFFFFF,
but only those up to U+0010FFFF can be encoded as UTF-16 sequences of
char16_t. [...]
If such an implementation claims to support the UCS-4 character
set then it is non-conforming. The wchar_t type must be able to
"represent distinct codes for all members of the largest extended
character set specified among the supported locales". Obviously
it can't do this for 0x7FFFFFFF values if it has only 16 bits to
represent them.

IIUC the Microsoft C implementation uses UTF-16 for wchar_t, but
it sidesteps the conformance issues by having its documentation
claim to support only those code pages whose character sets are
in the sixteen bit space. A prime example of weasel-conformance.

A round trip through wchar_t is guaranteed to work for any
documentation-supported code point. And if such a transformation
were made part of C semantics in some places, maybe that would be
enough to get Microsoft off its corporate a** and fix their
wchar_t stupidity.
Philipp Klaus Krause
2016-11-23 21:50:57 UTC
Permalink
Raw Message
Post by Tim Rentsch
IIUC the Microsoft C implementation uses UTF-16 for wchar_t, but
it sidesteps the conformance issues by having its documentation
claim to support only those code pages whose character sets are
in the sixteen bit space. A prime example of weasel-conformance.
SDCC now has a 32-bit wchar_t. But up to 3.5.0 it only supported 8-bit
character sets, and had an 8-bit wchar_t.

Philipp
Tim Rentsch
2016-12-01 18:25:59 UTC
Permalink
Raw Message
Post by Tim Rentsch
IIUC the Microsoft C implementation uses UTF-16 for wchar_t, but
it sidesteps the conformance issues by having its documentation
claim to support only those code pages whose character sets are
in the sixteen bit space. A prime example of weasel-conformance.
SDCC now has a 32-bit wchar_t. But up to 3.5.0 it only supported 8-bit
character sets, and had an 8-bit wchar_t.
I have no problem with implementations that are clearly one way
or the other, which includes SDCC 3.5.0 as you have described it.
My complaint with the Microsoft implementation is they want it
both ways - they claim to support only 16-bit character sets, but
they supply locales that provide a larger range, and the compiler
will happily translate wide string constants with individual
character values > 65535 (and which produce more than one wchar_t
in the array). In my book that counts as weasel-conformance.
Jakob Bohm
2016-12-01 18:40:24 UTC
Permalink
Raw Message
Post by Tim Rentsch
Post by Tim Rentsch
IIUC the Microsoft C implementation uses UTF-16 for wchar_t, but
it sidesteps the conformance issues by having its documentation
claim to support only those code pages whose character sets are
in the sixteen bit space. A prime example of weasel-conformance.
SDCC now has a 32-bit wchar_t. But up to 3.5.0 it only supported 8-bit
character sets, and had an 8-bit wchar_t.
I have no problem with implementations that are clearly one way
or the other, which includes SDCC 3.5.0 as you have described it.
My complaint with the Microsoft implementation is they want it
both ways - they claim to support only 16-bit character sets, but
they supply locales that provide a larger range, and the compiler
will happily translate wide string constants with individual
character values > 65535 (and which produce more than one wchar_t
in the array). In my book that counts as weasel-conformance.
As I have said before, Microsoft seems to be the only company that
actually uses wchar_t for something other than silly wrappers around
8-bit syscalls. They *cannot* change wchar_t to char32_t without
breaking every single piece of wchar_t-using 3rd party source code
written for their OS since 1992.

So they have no real choice but to ignore any additional requirements
imposed by C committee members with no understanding of reality. In
fact, they should have also stuck to their well-thought out 1992
definition of how L"%s" and L"%c" function in format strings rather
than bowing to the arbitrary nonsense that the C committee defined.

Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
Kaz Kylheku
2016-12-01 20:50:55 UTC
Permalink
Raw Message
Post by Jakob Bohm
Post by Tim Rentsch
Post by Tim Rentsch
IIUC the Microsoft C implementation uses UTF-16 for wchar_t, but
it sidesteps the conformance issues by having its documentation
claim to support only those code pages whose character sets are
in the sixteen bit space. A prime example of weasel-conformance.
SDCC now has a 32-bit wchar_t. But up to 3.5.0 it only supported 8-bit
character sets, and had an 8-bit wchar_t.
I have no problem with implementations that are clearly one way
or the other, which includes SDCC 3.5.0 as you have described it.
My complaint with the Microsoft implementation is they want it
both ways - they claim to support only 16-bit character sets, but
they supply locales that provide a larger range, and the compiler
will happily translate wide string constants with individual
character values > 65535 (and which produce more than one wchar_t
in the array). In my book that counts as weasel-conformance.
As I have said before, Microsoft seems to be the only company that
actually uses wchar_t for something other than silly wrappers around
8-bit syscalls.
Yes, something far stupider: silly wrappers for 16 bit syscalls.

32 bit wchar_t internally representing code points, over top of UTF-8
syscalls and I/O is far more intelligent (and worldly) than 16 bit
wchar_t storing UTF-16 code *units*, over UTF-16 syscalls and I/O.
Jakob Bohm
2016-11-24 00:01:47 UTC
Permalink
Raw Message
Post by Tim Rentsch
Post by Jakob Bohm
Post by j***@verizon.net
Post by Tim Rentsch
Post by j***@verizon.net
Post by Tim Rentsch
Note: to save space, I'm only going to refer to char16_t; but
everything I say about char16_t has an obvious char32_t analog. The
...
Post by Tim Rentsch
Post by j***@verizon.net
The justification for defining mbsrtoc16s(), despite the fact that
mbrtoc16() already exists, is the convenience factor. [...]
That is a plausible motivating factor. It is not however the only
such factor, and it may or may not be one as far as M. Krause is
concerned. I still am interested to hear his answer.
As to determining suitability, IMO saying some new feature would be
convenient is not by itself sufficient justification to warrant its
inclusion in the Standard. There really should be some further
motivation beyond that.
It would be helpful to know what the "further motivation" was that,
in your opinion, justified the addition of the corresponding wide
character string functions - or, is it your opinion that there was
no such motivation, and that they should therefore be dropped? To
my mind, convenience would seem to be the only justification for
those functions, and it also seems a sufficient justification, so
I've never worried about whether there's any further motivation.
Post by Tim Rentsch
Post by j***@verizon.net
Post by Tim Rentsch
I knew that. The question was meant in the sense of what
applications of those functions are expected.
I've no specific idea, but multi-byte character strings are
ubiquitous, and 16-bit encodings are not unheard of (I gather
that Windows uses UTF-16). Therefore, the possibility that
there's a shortage of applications which have a need to convert
strings between such encodings is not one I'm willing to bother
worrying about. YMMV. Since these functions don't exist yet,
obviously any such application is currently using some other
method for performing such conversions - however, I'd expect at
least some of the developers of such code to be happy to switch
to a C standard library function, as soon as they became
sufficiently widely available.
So, the bottom line is you really don't know?
I don't "know" anything about reality; all I have is varying
degrees of certainty about various statements about reality, which
is never either exactly 0% or exactly 100%. I'm sufficiently sure,
for the reasons given above, that the need exists, that I'm not
going to bother worrying about the possibility that it doesn't. If
those reasons aren't sufficient for you, that's fine - you should
investigate further - but I see no need to do so.
Post by Tim Rentsch
Post by j***@verizon.net
I'd been thinking in terms of a direct conversion, but I suppose
using wchar_t as an intermediary might have advantages. However,
if that's the case, then the conversion routines between wchar_t
and char16_t should be added to the standard library.
What I was trying to do is make sure the semantics are well-defined
and also consistent with a wchar_t representation, not describe an
implementation. ...
The standard imposes some requirements on the representation of
multi-byte characters, wchar_t, char16_t or char32_t, but not
enough to mandate that conversions between any pair of those types
are invertible. If any of those conversions is not invertible,
forcing the translation between any two of those types to go
through a particular third type might make the conversion
unnecessarily lossy. I wouldn't mind it if the standard added
words requiring that some or all of those conversions be
invertible.
...
Post by Tim Rentsch
Post by j***@verizon.net
I'm not entirely clear how to use the char16_t functions either,
despite having carefully read their complete description. That's
part of the reason why I wouldn't mind having string-oriented
versions written by the library implementor, rather than having to
write the equivalent code myself.
This seems like an odd thing to say. If you aren't sure how the
*c16* functions work, how can you evaluate whether some additional
functions should be added?
I think they should be added, precisely because I don't know how
the single character functions should be used, despite having read
those descriptions. That means that those descriptions are at the
very least, obscure, so I'm probably not the only person unsure
about the matter. Anyone who implements the single-character
functions must understand how they are to be used, and should
therefore be capable of implementing the string-related functions
better than I could. I might not be able to evaluate whether they
did it right, but I could at least choose to trust that they've
done so.
Post by Tim Rentsch
... Furthermore, if the already existing
multi-byte string conversion functions are any indication, new
functions for dealing with charXX_t strings will be defined in
terms of the more elementary charXX_t character conversion
functions. So if you don't yet understand the existing char16_t
conversion functions, there's a good chance that would carry over
to new char16_t string conversion functions that make use of
them (in the as-if sense, I mean).
Not necessarily - the definition by the standard of how those
string functions make use of the single character functions might,
if sufficiently well written, resolve my current uncertainties
about how they should be used. The explanation should be
sufficiently well-written to allow implementors to implement it
correctly, which should be good enough for me to understand it.
In particular, if one of the single-character functions is
currently defined in a way that makes it impossible to use it while
implementing the corresponding string function (a possibility that
is within the range of my current uncertainty about them), then
that would, in my opinion, be a defect in the current standard. If
that is the case, being forced to write up a description of the
string functions would allow the committee to realize that there
was a defect in the description of the corresponding single
character function, and correct it.
If (as I suspect) the single character functions keep internal state
when passed multi-element logical characters, [...]
None of the char16_t/char32_t conversion functions have that
property. Intermediate state is kept in an mbstate_t object,
a pointer to which is an argument in each call.
Good
Post by Tim Rentsch
Post by Jakob Bohm
Additionally, when defining new string conversion functions, it
might be useful to define them in a way that can also handle 0
valued characters that don't terminate the string, as otherwise code
that needs to handle those would need to do some weird gymnastics
just to work around the standard functions truncating at 0-bytes.
The Standard explicitly disallows any such encoding, so there is
no motivation to define any standard library function to work
around it.
The C standard cannot disallow the existence of application/libraries
that allow 0 characters in structures such as "pascal strings", "C++
strings", "I/O buffers represented as strings" etc. That is the
context in which I was referring to 0 chars.
Post by Tim Rentsch
Post by Jakob Bohm
As to round-tripping through wchar_t, this would fail miserably for
char32_t == UCS-4 UNICODE
char16_t == UTF-16 UNICODE
wchar_t == char16_t
UCS-4 UNICODE can store character points all the way up to U+7FFFFFFF,
but only those up to U+0010FFFF can be encoded as UTF-16 sequences of
char16_t. [...]
If such an implementation claims to support the UCS-4 character
set then it is non-conforming. The wchar_t type must be able to
"represent distinct codes for all members of the largest extended
character set specified among the supported locales". Obviously
it can't do this for 0x7FFFFFFF values if it has only 16 bits to
represent them.
IIUC the Microsoft C implementation uses UTF-16 for wchar_t, but
it sidesteps the conformance issues by having its documentation
claim to support only those code pages whose character sets are
in the sixteen bit space. A prime example of weasel-conformance.
A round trip through wchar_t is guaranteed to work for any
documentation-supported code point. And if such a transformation
were made part of C semantics in some places, maybe that would be
enough to get Microsoft off its corporate a** and fix their
wchar_t stupidity.
The story is somewhat different, Microsoft built the core of their OS
based on an early UNICODE draft. The switch from UCS-2 to UCS-4 as the
basis of the standard happened too late for Microsoft to update their
API, ABI, file format etc. specifications accordingly, thus *all*
compilers targeting the Win32/Win64/WinCE API need (for API/ABI
reasons) to use the wchar_t == char16_t definition, regardless of the
much later introduction of char32_t as a C type.

The situation for Sun's Java is somewhat similar, although Sun doesn't
have a good excuse for not knowing about UCS-4.

Thus as a practical matter, weaseling around a contradictory standard
requirement (that obviously didn't consider the situation on the most
widely used wchar_t using APIs/ABIs) is the only alternative to
removing that requirement from the standard.

Here is a challenge: Name me an OS which has an actual use for the
wchar_t type in its API/ABI as anything other than a token gesture to
its presence in the C standard, and which uses a 32 bit wchar_t for
that?


Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
James R. Kuyper
2016-11-24 00:37:26 UTC
Permalink
Raw Message
...
Post by Jakob Bohm
Post by Tim Rentsch
Post by Jakob Bohm
Additionally, when defining new string conversion functions, it
might be useful to define them in a way that can also handle 0
valued characters that don't terminate the string, as otherwise code
that needs to handle those would need to do some weird gymnastics
just to work around the standard functions truncating at 0-bytes.
The Standard explicitly disallows any such encoding, so there is
no motivation to define any standard library function to work
around it.
The C standard cannot disallow the existence of application/libraries
that allow 0 characters in structures such as "pascal strings", "C++
strings", "I/O buffers represented as strings" etc. That is the
context in which I was referring to 0 chars.
The standard can, however, refuse to identify such structures as
strings. And it does:

The standard specifies that, for multibyte characters, "A byte with all
bits zero shall be interpreted as a null character independent of shift
state. Such a byte shall not occur as part of any other multibyte
character." (5.2.1.2p1)

"A string is a contiguous sequence of characters terminated by and
including the first null character. The term multibyte string is
sometimes used instead to emphasize special processing given to
multibyte characters contained in the string or to avoid confusion
with a wide string." (7.1.1p1)

Therefore, a function that takes an argument described as pointing at a
string should not read past the first null character; one described as
writing a string should write at least one terminating null character.

You can propose functions whose description indicates that they do not
process strings, but rather some other data structure. Those functions
can use any method you deem appropriate to determine where the end of
the data to be processed is - but they are not string functions as far
as the C standard is concerned.
Jakob Bohm
2016-11-24 19:46:52 UTC
Permalink
Raw Message
Post by James R. Kuyper
...
Post by Jakob Bohm
Post by Tim Rentsch
Post by Jakob Bohm
Additionally, when defining new string conversion functions, it
might be useful to define them in a way that can also handle 0
valued characters that don't terminate the string, as otherwise code
that needs to handle those would need to do some weird gymnastics
just to work around the standard functions truncating at 0-bytes.
The Standard explicitly disallows any such encoding, so there is
no motivation to define any standard library function to work
around it.
The C standard cannot disallow the existence of application/libraries
that allow 0 characters in structures such as "pascal strings", "C++
strings", "I/O buffers represented as strings" etc. That is the
context in which I was referring to 0 chars.
The standard can, however, refuse to identify such structures as
The standard specifies that, for multibyte characters, "A byte with all
bits zero shall be interpreted as a null character independent of shift
state. Such a byte shall not occur as part of any other multibyte
character." (5.2.1.2p1)
"A string is a contiguous sequence of characters terminated by and
including the first null character. The term multibyte string is
sometimes used instead to emphasize special processing given to
multibyte characters contained in the string or to avoid confusion
with a wide string." (7.1.1p1)
Therefore, a function that takes an argument described as pointing at a
string should not read past the first null character; one described as
writing a string should write at least one terminating null character.
You can propose functions whose description indicates that they do not
process strings, but rather some other data structure. Those functions
can use any method you deem appropriate to determine where the end of
the data to be processed is - but they are not string functions as far
as the C standard is concerned.
I was merely suggesting, that if functions of the kind discussed in
this thread were to be added to the next version of the standard, it
might be useful to make them usable with <string.h> functions such as
memchr(), not just <string.h> functions such as strchr().

Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
Tim Rentsch
2016-12-01 19:09:13 UTC
Permalink
Raw Message
Post by Jakob Bohm
Post by Tim Rentsch
Post by Jakob Bohm
[...]
Additionally, when defining new string conversion functions, it
might be useful to define them in a way that can also handle 0
valued characters that don't terminate the string, as otherwise code
that needs to handle those would need to do some weird gymnastics
just to work around the standard functions truncating at 0-bytes.
The Standard explicitly disallows any such encoding, so there is
no motivation to define any standard library function to work
around it.
The C standard cannot disallow the existence of application/libraries
that allow 0 characters in structures such as "pascal strings", "C++
strings", "I/O buffers represented as strings" etc. That is the
context in which I was referring to 0 chars.
It still is true that there is no motivation to define any string
conversion functions in the standard library to work on such things.
If applications want to do so, fine, but they clearly are working
outside the domain of what the Standard considers usual.
Post by Jakob Bohm
Post by Tim Rentsch
Post by Jakob Bohm
As to round-tripping through wchar_t, this would fail miserably for
char32_t == UCS-4 UNICODE
char16_t == UTF-16 UNICODE
wchar_t == char16_t
UCS-4 UNICODE can store character points all the way up to U+7FFFFFFF,
but only those up to U+0010FFFF can be encoded as UTF-16 sequences of
char16_t. [...]
If such an implementation claims to support the UCS-4 character
set then it is non-conforming. The wchar_t type must be able to
"represent distinct codes for all members of the largest extended
character set specified among the supported locales". Obviously
it can't do this for 0x7FFFFFFF values if it has only 16 bits to
represent them.
IIUC the Microsoft C implementation uses UTF-16 for wchar_t, but
it sidesteps the conformance issues by having its documentation
claim to support only those code pages whose character sets are
in the sixteen bit space. A prime example of weasel-conformance.
A round trip through wchar_t is guaranteed to work for any
documentation-supported code point. And if such a transformation
were made part of C semantics in some places, maybe that would be
enough to get Microsoft off its corporate a** and fix their
wchar_t stupidity.
The story is somewhat different, Microsoft built the core of their OS
based on an early UNICODE draft. The switch from UCS-2 to UCS-4 as the
basis of the standard happened too late for Microsoft to update their
API, ABI, file format etc. specifications accordingly, thus *all*
compilers targeting the Win32/Win64/WinCE API need (for API/ABI
reasons) to use the wchar_t == char16_t definition, regardless of the
much later introduction of char32_t as a C type.
I understand why they made the choice they did 20 years ago. My
complaint is that they haven't fixed it in the 20 years since then.
Post by Jakob Bohm
The situation for Sun's Java is somewhat similar, although Sun doesn't
have a good excuse for not knowing about UCS-4.
AFAIK the requirements for Java are not written in terms of C types,
so any comment about Java is not relevant.
Post by Jakob Bohm
Thus as a practical matter, weaseling around a contradictory standard
requirement (that obviously didn't consider the situation on the most
widely used wchar_t using APIs/ABIs) is the only alternative to
removing that requirement from the standard.
Hogwash. The type wchar_t is part of C89/C90. It was obvious from
the get-go (before MS Windows 95, remember) that it would evolve as
time went on, so making changes to that should have been anticipated
before choosing UCS-2. Even if it wasn't anticipated early on, MS
has had 20 years to devise an evolutionary path to upgrade to 32-bit
wchar_t (with, eg, parallel libraries for the two choices). If
nothing else it could have been done as part of the transition to
64-bit architectures. What we have instead is Microsoft as usual
thumbing its nose at the ISO C standard.
Post by Jakob Bohm
Here is a challenge: Name me an OS which has an actual use for the
wchar_t type in its API/ABI as anything other than a token gesture to
its presence in the C standard, and which uses a 32 bit wchar_t for
that?
If an OS mostly doesn't use wchar_t in its API or ABI, that makes it
EASIER to change the representation of wchar_t without having to
modify the OS. That's a good design decision, not a bad one. If
Microsoft chose to lock a significant part of their API/ABI to a
wchar_t fixed at 16-bits, they made a bad design decision. Okay,
at the time maybe that's understandable, but that's no excuse for
not addressing the problem in the two decades since then.
Jakob Bohm
2016-12-02 03:10:07 UTC
Permalink
Raw Message
Post by Tim Rentsch
Post by Jakob Bohm
Post by Tim Rentsch
Post by Jakob Bohm
[...]
Additionally, when defining new string conversion functions, it
might be useful to define them in a way that can also handle 0
valued characters that don't terminate the string, as otherwise code
that needs to handle those would need to do some weird gymnastics
just to work around the standard functions truncating at 0-bytes.
The Standard explicitly disallows any such encoding, so there is
no motivation to define any standard library function to work
around it.
The C standard cannot disallow the existence of application/libraries
that allow 0 characters in structures such as "pascal strings", "C++
strings", "I/O buffers represented as strings" etc. That is the
context in which I was referring to 0 chars.
It still is true that there is no motivation to define any string
conversion functions in the standard library to work on such things.
If applications want to do so, fine, but they clearly are working
outside the domain of what the Standard considers usual.
Post by Jakob Bohm
Post by Tim Rentsch
Post by Jakob Bohm
As to round-tripping through wchar_t, this would fail miserably for
char32_t == UCS-4 UNICODE
char16_t == UTF-16 UNICODE
wchar_t == char16_t
UCS-4 UNICODE can store character points all the way up to U+7FFFFFFF,
but only those up to U+0010FFFF can be encoded as UTF-16 sequences of
char16_t. [...]
If such an implementation claims to support the UCS-4 character
set then it is non-conforming. The wchar_t type must be able to
"represent distinct codes for all members of the largest extended
character set specified among the supported locales". Obviously
it can't do this for 0x7FFFFFFF values if it has only 16 bits to
represent them.
IIUC the Microsoft C implementation uses UTF-16 for wchar_t, but
it sidesteps the conformance issues by having its documentation
claim to support only those code pages whose character sets are
in the sixteen bit space. A prime example of weasel-conformance.
A round trip through wchar_t is guaranteed to work for any
documentation-supported code point. And if such a transformation
were made part of C semantics in some places, maybe that would be
enough to get Microsoft off its corporate a** and fix their
wchar_t stupidity.
The story is somewhat different, Microsoft built the core of their OS
based on an early UNICODE draft. The switch from UCS-2 to UCS-4 as the
basis of the standard happened too late for Microsoft to update their
API, ABI, file format etc. specifications accordingly, thus *all*
compilers targeting the Win32/Win64/WinCE API need (for API/ABI
reasons) to use the wchar_t == char16_t definition, regardless of the
much later introduction of char32_t as a C type.
I understand why they made the choice they did 20 years ago. My
complaint is that they haven't fixed it in the 20 years since then.
Post by Jakob Bohm
The situation for Sun's Java is somewhat similar, although Sun doesn't
have a good excuse for not knowing about UCS-4.
AFAIK the requirements for Java are not written in terms of C types,
so any comment about Java is not relevant.
The important Java *implementations* and most native-code java
extensions are written in the C language.
Post by Tim Rentsch
Post by Jakob Bohm
Thus as a practical matter, weaseling around a contradictory standard
requirement (that obviously didn't consider the situation on the most
widely used wchar_t using APIs/ABIs) is the only alternative to
removing that requirement from the standard.
Hogwash. The type wchar_t is part of C89/C90. It was obvious from
the get-go (before MS Windows 95, remember) that it would evolve as
time went on, so making changes to that should have been anticipated
before choosing UCS-2. Even if it wasn't anticipated early on, MS
has had 20 years to devise an evolutionary path to upgrade to 32-bit
wchar_t (with, eg, parallel libraries for the two choices). If
nothing else it could have been done as part of the transition to
64-bit architectures. What we have instead is Microsoft as usual
thumbing its nose at the ISO C standard.
Post by Jakob Bohm
Here is a challenge: Name me an OS which has an actual use for the
wchar_t type in its API/ABI as anything other than a token gesture to
its presence in the C standard, and which uses a 32 bit wchar_t for
that?
If an OS mostly doesn't use wchar_t in its API or ABI, that makes it
EASIER to change the representation of wchar_t without having to
modify the OS.
Hence why those are the operating systems that should be forced to
change (if anyone), not the ones that would actually have a serious
problem changing from a decision predating contradictory decisions by
a committee dominated by their competitors.
Post by Tim Rentsch
That's a good design decision, not a bad one. If
Microsoft chose to lock a significant part of their API/ABI to a
wchar_t fixed at 16-bits, they made a bad design decision. Okay,
at the time maybe that's understandable, but that's no excuse for
not addressing the problem in the two decades since then.
Microsoft spent the 1990s trying to get rid of 8 bit char in system
calls (except as unsigned bytes in streams etc.), their reduced Windows
CE API actually removed the system calls that took char* strings
completely, and in mainstream Win32 prior to Windows Vista (or maybe an
even later version), locales with no standard single byte or double
byte char character sets (such as some Indian dialects) would simply
return error when trying to use char* calls rather than using UTF-8,
probably because there were ancient ABI guarantees (dating back to the
first Japanese version of MS-DOS) in the form of locale information
functions that would tell applications which char values indicated that
the encoded character was 2 bytes and which ones indicated 1 byte, with
no way to tell callers that some char values would actually indicate 3
or more bytes, this applied only to the locale-default character set
and not to character sets that could be specified explicitly to
conversion functions, which is where UTF-8 etc. were already supported.

Prior to the introduction of char16_t and char32_t, wchar_t was a
vaguely defined type which implementations could choose to be any
number of bits, so until then, there was no reason to anticipate the
later versions of the standard demanding that it had to be 32 bits to
support the rarer Unicode code points.

And in case you missed it, UTF-32 stored as char32_t does *not*
represent all logical characters as single values either, because of
the various modifier code points in UNICODE, such as accents etc.

Also the very introduction of char16_t and char32_t makes sense
only in the context of interoperating with systems that are
internally tied to wchar_t being one or the other, so an interpretation
that wchar_t==char16_t==UTF-16 is a formal violation makes no sense.

P.S.

Microsoft's newer interpreted runtime (.NET) uses UTF-8 exclusively.


Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
Tim Rentsch
2016-12-10 06:33:26 UTC
Permalink
Raw Message
Post by Jakob Bohm
[decoupling OS API from wchar_t]
That's a good design decision, not a bad one. If
Microsoft chose to lock a significant part of their API/ABI to a
wchar_t fixed at 16-bits, they made a bad design decision. Okay,
at the time maybe that's understandable, but that's no excuse for
not addressing the problem in the two decades since then.
Microsoft spent the 1990s trying to get rid of 8 bit char in system
calls (except as unsigned bytes in streams etc.), their reduced Windows
CE API actually removed the system calls that took char* strings
completely, and in mainstream Win32 prior to Windows Vista (or maybe an
even later version), locales with no standard single byte or double
byte char character sets (such as some Indian dialects) would simply
return error when trying to use char* calls rather than using UTF-8,
probably because there were ancient ABI guarantees (dating back to the
first Japanese version of MS-DOS) in the form of locale information
functions that would tell applications which char values indicated that
the encoded character was 2 bytes and which ones indicated 1 byte, with
no way to tell callers that some char values would actually indicate 3
or more bytes, this applied only to the locale-default character set
and not to character sets that could be specified explicitly to
conversion functions, which is where UTF-8 etc. were already supported.
A long (172 words!) rambling single sentence that apparently
has nothing to do with wchar_t.
Post by Jakob Bohm
Prior to the introduction of char16_t and char32_t, wchar_t was a
vaguely defined type which implementations could choose to be any
number of bits, [...]
Not so. Even in 1990 wchar_t was required to be large enough to
"represent distinct codes for all members of the largest extended
character set specified among the supported locales". It didn't
take a genius to see that this set would change and grow over time.
Post by Jakob Bohm
And in case you missed it, UTF-32 stored as char32_t does *not*
represent all logical characters as single values either, because of
the various modifier code points in UNICODE, such as accents etc.
Irrelevant to what is being discussed.
Post by Jakob Bohm
Also the very introduction of char16_t and char32_t makes sense
only in the context of interoperating with systems that are
internally tied to wchar_t being one or the other, so an interpretation
that wchar_t==char16_t==UTF-16 is a formal violation makes no sense.
I disagree with the premise, but in any case the point is
irrelevant to the discussion. Among other things, char16_t (and
char32_t) didn't even enter the language until 2011. The
requirements for wchar_t had been in place more than 20 years by
then, so an argument based on char16_t simply doesn't hold water.
Jakob Bohm
2016-12-13 16:33:34 UTC
Permalink
Raw Message
Post by Tim Rentsch
Post by Jakob Bohm
[decoupling OS API from wchar_t]
That's a good design decision, not a bad one. If
Microsoft chose to lock a significant part of their API/ABI to a
wchar_t fixed at 16-bits, they made a bad design decision. Okay,
at the time maybe that's understandable, but that's no excuse for
not addressing the problem in the two decades since then.
Microsoft spent the 1990s trying to get rid of 8 bit char in system
calls (except as unsigned bytes in streams etc.), their reduced Windows
CE API actually removed the system calls that took char* strings
completely, and in mainstream Win32 prior to Windows Vista (or maybe an
even later version), locales with no standard single byte or double
byte char character sets (such as some Indian dialects) would simply
return error when trying to use char* calls rather than using UTF-8,
probably because there were ancient ABI guarantees (dating back to the
first Japanese version of MS-DOS) in the form of locale information
functions that would tell applications which char values indicated that
the encoded character was 2 bytes and which ones indicated 1 byte, with
no way to tell callers that some char values would actually indicate 3
or more bytes, this applied only to the locale-default character set
and not to character sets that could be specified explicitly to
conversion functions, which is where UTF-8 etc. were already supported.
A long (172 words!) rambling single sentence that apparently
has nothing to do with wchar_t.
I was trying to counter your completely misguided slander of one of the
few places where Microsoft has been doing the right thing (until they
were pressured into breaking compatibility of wprintf() to cater to the
wrong definition). It is also a description of Microsoft's attempts at
the transition plan you asked for.
Post by Tim Rentsch
Post by Jakob Bohm
Prior to the introduction of char16_t and char32_t, wchar_t was a
vaguely defined type which implementations could choose to be any
number of bits, [...]
Not so. Even in 1990 wchar_t was required to be large enough to
"represent distinct codes for all members of the largest extended
character set specified among the supported locales". It didn't
take a genius to see that this set would change and grow over time.
That phrasing can (and was) easily read as applying to being large
enough to not omitting any characters, for example by providing room
for every UTF-16 16-bit value on a system where none of the locale-
specific char character codes support anywhere near the full UNICODE
repertoire (because none are UTF-8 or its alternatives). Reading this
as a prohibition against using UTF-16 to add higher UNICODE codepoints
to a system designed for UCS-2 is maliciously holding past decisions
against new standards (interpretations).
Post by Tim Rentsch
Post by Jakob Bohm
And in case you missed it, UTF-32 stored as char32_t does *not*
represent all logical characters as single values either, because of
the various modifier code points in UNICODE, such as accents etc.
Irrelevant to what is being discussed.
It shows that UTF-32 wchar_t does not satisfy your supposed
interpretation, and an implementation would have to come up with its
own unique type and encoding for representing all the composite UNICODE
values as single wchar_t values.



Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
Tim Rentsch
2016-12-25 18:02:33 UTC
Permalink
Raw Message
Merry Christmas...

Keith Thompson
2016-11-23 19:42:39 UTC
Permalink
Raw Message
Jakob Bohm <jb-***@wisemo.com> writes:
[...]
Post by Jakob Bohm
UCS-4 UNICODE can store character points all the way up to U+7FFFFFFF,
but only those up to U+0010FFFF can be encoded as UTF-16 sequences of
char16_t. While the current version of the UNICODE standard uses no
character outside the UTF-16 compatible range (and even contains
verbiage to ban such use), someday they are going to run out of room
and start using more of the UCS-4 values.
According to Wikipedia, UCS-4 was part of the original ISO 10646
standard, but the character set was restricted in 2003, making
effectively UCS-4 identical to UTF-32. The working group has
a policy that all future assignments will be restricted to the
Unicode range of 0 to U+10FFFF. (Yes, policies can change, but
they've been pretty clear about this one.)

https://en.wikipedia.org/wiki/UTF-32#History

Certainly a C implementation *could* use a character encoding that
supports code points above 0x10FFFF, but such an encoding would
not be Unicode.
--
Keith Thompson (The_Other_Keith) kst-***@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
Tim Rentsch
2016-11-23 19:00:54 UTC
Permalink
Raw Message
Post by j***@verizon.net
Post by Tim Rentsch
Post by j***@verizon.net
Post by Tim Rentsch
Note: to save space, I'm only going to refer to char16_t; but
everything I say about char16_t has an obvious char32_t analog. The
...
Post by Tim Rentsch
Post by j***@verizon.net
The justification for defining mbsrtoc16s(), despite the fact that
mbrtoc16() already exists, is the convenience factor. [...]
That is a plausible motivating factor. It is not however the only
such factor, and it may or may not be one as far as M. Krause is
concerned. I still am interested to hear his answer.
As to determining suitability, IMO saying some new feature would be
convenient is not by itself sufficient justification to warrant its
inclusion in the Standard. There really should be some further
motivation beyond that.
It would be helpful to know what the "further motivation" was that,
in your opinion, justified the addition of the corresponding wide
character string functions - or, is it your opinion that there was
no such motivation, and that they should therefore be dropped? To
my mind, convenience would seem to be the only justification for
those functions, and it also seems a sufficient justification, so
I've never worried about whether there's any further motivation.
I have no idea what arguments were offered to motivate those
functions, or indeed if there were any such arguments at all.
Since I don't know what arguments were offered, I'm not in a
position to say that the same arguments apply - maybe they
do, and maybe they don't, but either way I don't know.
Post by j***@verizon.net
Post by Tim Rentsch
Post by j***@verizon.net
Post by Tim Rentsch
I knew that. The question was meant in the sense of what
applications of those functions are expected.
I've no specific idea, but multi-byte character strings are
ubiquitous, and 16-bit encodings are not unheard of (I gather that
Windows uses UTF-16). Therefore, the possibility that there's a
shortage of applications which have a need to convert strings
between such encodings is not one I'm willing to bother worrying
about. YMMV. Since these functions don't exist yet, obviously
any such application is currently using some other method for
performing such conversions - however, I'd expect at least some of
the developers of such code to be happy to switch to a C standard
library function, as soon as they became sufficiently widely
available.
So, the bottom line is you really don't know?
I don't "know" anything about reality; all I have is varying
degrees of certainty about various statements about reality, which
is never either exactly 0% or exactly 100%. I'm sufficiently sure,
for the reasons given above, that the need exists, that I'm not
going to bother worrying about the possibility that it doesn't. If
those reasons aren't sufficient for you, that's fine - you should
investigate further - but I see no need to do so.
Let me put my question differently. Am I right in saying that
your earlier comments are just speculation, in the sense that
you don't have any concrete evidence or examples to offer?
Post by j***@verizon.net
Post by Tim Rentsch
Post by j***@verizon.net
I'd been thinking in terms of a direct conversion, but I suppose
using wchar_t as an intermediary might have advantages. However,
if that's the case, then the conversion routines between wchar_t
and char16_t should be added to the standard library.
What I was trying to do is make sure the semantics are well-defined
and also consistent with a wchar_t representation, not describe an
implementation. ...
The standard imposes some requirements on the representation of
multi-byte characters, wchar_t, char16_t or char32_t, but not enough
to mandate that conversions between any pair of those types are
invertible. If any of those conversions is not invertible, forcing
the translation between any two of those types to go through a
particular third type might make the conversion unnecessarily lossy.
I wouldn't mind it if the standard added words requiring that some
or all of those conversions be invertible.
What I think are the important round trips, ie, those starting
and ending with multi-byte characters, cannot be made invertible
because some encodings (that the Standard wants to allow) are
inherently potentially redundant. But it might be enough to
say that a round-trip operation must be idempotent, ie, applying
it twice is the same as applying it once.
Post by j***@verizon.net
Post by Tim Rentsch
Post by j***@verizon.net
I'm not entirely clear how to use the char16_t functions either,
despite having carefully read their complete description. That's
part of the reason why I wouldn't mind having string-oriented
versions written by the library implementor, rather than having to
write the equivalent code myself.
This seems like an odd thing to say. If you aren't sure how the
*c16* functions work, how can you evaluate whether some additional
functions should be added?
I think they should be added, precisely because I don't know how the
single character functions should be used, despite having read those
descriptions.
Have you tried to write any code that uses them? If you did that
might alleviate some of your uncertainty.
Post by j***@verizon.net
That means that those descriptions are at the very
least, obscure, so I'm probably not the only person unsure about the
matter. Anyone who implements the single-character functions must
understand how they are to be used, and should therefore be capable
of implementing the string-related functions better than I could. I
might not be able to evaluate whether they did it right, but I could
at least choose to trust that they've done so.
I don't think it follows necessarily that the functions are hard
to understand. It may be simply that you are distracted by other
things (eg, your kids) and haven't had time to look at them
carefully. My guess is that in fact you would have no trouble
if you could take some time to look at them and perhaps if it were
important to do so, eg, as part of a work assignment. I agree
the functions are a little weird but they are not that difficult.
Post by j***@verizon.net
Post by Tim Rentsch
... Furthermore, if the already existing
multi-byte string conversion functions are any indication, new
functions for dealing with charXX_t strings will be defined in
terms of the more elementary charXX_t character conversion
functions. So if you don't yet understand the existing char16_t
conversion functions, there's a good chance that would carry over
to new char16_t string conversion functions that make use of
them (in the as-if sense, I mean).
Not necessarily - the definition by the standard of how those string
functions make use of the single character functions might, if
sufficiently well written, resolve my current uncertainties about
how they should be used. The explanation should be sufficiently
well-written to allow implementors to implement it correctly, which
should be good enough for me to understand it.
We aren't saying anything different here. If there (only) is a
good chance that X is true, then it is not necessarily so that
X is true.
Post by j***@verizon.net
In particular, if one of the single-character functions is currently
defined in a way that makes it impossible to use it while
implementing the corresponding string function (a possibility that
is within the range of my current uncertainty about them), then that
would, in my opinion, be a defect in the current standard.
There is an open Defect Report on a question related to that.
Post by j***@verizon.net
If that
is the case, being forced to write up a description of the string
functions would allow the committee to realize that there was a
defect in the description of the corresponding single character
function, and correct it.
To me this seems a bit bass-ackwards. If the Standard has a
potential defect (as indeed has already been identified), it
should be fairly easy to determine whether there is in fact
a defect based on already known use cases. I fully expect
that here there will be no problem in identifying a defect,
either one of how the documentation is written or one of
how the semantics are defined. Trying to write a description
for some new functionality would only muddy the waters.

Much of the discussion we've had has been fairly abstract. I
think it would be good to make it more concrete. So here are
definitions for two of the functions you alluded to above:

size_t
mbsrtoc16s( char16_t *out, const char **in, size_t n, mbstate_t *state ){
size_t m = 0;
while( m < n ){
size_t k = mbrtoc16( out+m, *in, 1, state );
/**/ if( k == 0 ) return m;
else if( k < -3 ) m++, *in += k;
else if( k == -3 ) m++;
else if( k == -2 ) *in += 1;
else if( k == -1 ) return -1;
else assert(0);
}
return m;
}

size_t
c16srtombs( char *out, const char16_t **in, size_t n, mbstate_t *state ){
mbstate_t r = *state;
char bytes[ MB_LEN_MAX ];
size_t m = 0;

do {
size_t k = c16rtomb( bytes, **in, &r );
if( k == -1 ) return -1;
if( m+k > n ) return m;
memcpy( out+m, bytes, k );
m += k;
*in += 1;
*state = r;
} while( m < 1 || out[m-1] != 0 );

return m-1;
}

A few comments:

(1) I didn't implement the special functionality for when 'out'
is null. It should be easy to add this if anyone wants it.

(2) It assumes the open DR for c16rtomb has been addressed
appropriately. More specifically, it makes use of a modified
c16rtomb() that handles surrogate pairs correctly.

(3) Obviously there are several performance improvements that
might be made. I wrote the code just very straightforwardly,
with no attention given to performance concerns.

(4) The code shown is tested and working, although it was not
tested as thoroughly as my normal process would call for. I
did test round trips for every UTF-16 code point.
Philipp Klaus Krause
2016-11-23 21:59:35 UTC
Permalink
Raw Message
You need to have a static mbstate_t in each function to use in case
state is 0.

Philipp
Tim Rentsch
2016-12-01 19:10:25 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
You need to have a static mbstate_t in each function to use in case
state is 0.
Yes, I should have mentioned that in my comments. Thank you
for remarking on it.
Philipp Klaus Krause
2016-11-23 22:13:42 UTC
Permalink
Raw Message
Post by j***@verizon.net
Note: to save space, I'm only going to refer to char16_t; but
everything I say about char16_t has an obvious char32_t analog. The
only significant asymmetry is that char16_t is guaranteed to have
multi-element encodings if __STDC_UTF_16__ is pre#defined by the
implementation, while char32_t will only have multi-element encodings
if __STDC_UTF_32__ is NOT pre#defined.
Subject: mbstoc16s(), mdbstoc32s(), c16stombs(), c32stombs()
Since mbrtoc16() and c16rtomb() both exist, while mbtoc16() and
c16tomb() do not, I think it would be more appropriate to define
char16_t functions analogous to mbsrtowcs() and wcsrtombs() rather
than mbstowcs() and wcstombs().
Unfortunately, the restartable functions are a lot heavier than the
non-restartable ones.
E.g. in SDCC for STM8 mbrtowc() has twice the code size (437 vs. 227
bytes) and three times the stack memory consumption (51 vs 17 bytes) of
mbtowc(). Since SDCC targets tiny 8-bit systems this difference matters
a lot.

But even for big systems there is a problem: At the WG14 meeting in
London some other compiler developers told me that the performance of
the restartable functions is unacceptable for some of their customers,
and that they recommend use of the non-restartable functions instead.

Philipp
Tim Rentsch
2016-12-01 19:16:34 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
Note: to save space, I'm only going to refer to char16_t; but
everything I say about char16_t has an obvious char32_t analog. The
only significant asymmetry is that char16_t is guaranteed to have
multi-element encodings if __STDC_UTF_16__ is pre#defined by the
implementation, while char32_t will only have multi-element encodings
if __STDC_UTF_32__ is NOT pre#defined.
Subject: mbstoc16s(), mdbstoc32s(), c16stombs(), c32stombs()
Since mbrtoc16() and c16rtomb() both exist, while mbtoc16() and
c16tomb() do not, I think it would be more appropriate to define
char16_t functions analogous to mbsrtowcs() and wcsrtombs() rather
than mbstowcs() and wcstombs().
Unfortunately, the restartable functions are a lot heavier than the
non-restartable ones.
E.g. in SDCC for STM8 mbrtowc() has twice the code size (437 vs. 227
bytes) and three times the stack memory consumption (51 vs 17 bytes) of
mbtowc(). Since SDCC targets tiny 8-bit systems this difference matters
a lot.
But even for big systems there is a problem: At the WG14 meeting in
London some other compiler developers told me that the performance of
the restartable functions is unacceptable for some of their customers,
and that they recommend use of the non-restartable functions instead.
I see how that would be true for character-at-a-time functions.
But surely the string-level functions could be optimized for
common cases so that most of the time they would run nearly as
fast as non-restartable ones. Do you agree with that, or is
there some fundamental reason why that cannot be so?
Philipp Klaus Krause
2016-11-23 22:16:28 UTC
Permalink
Raw Message
Post by Tim Rentsch
Would the functions you are suggesting provide only a performance
improvement (in both speed and size), or would they offer something
more? If it's something more than just performance, what is that?
If it is only for reasons of speed/size improvement, what sort of
gains can be expected?
The new functions would

1) Provide better performance and code size by being non-restartable
2) Be more convenient, since by allowing conversions of whole strings at
a time

Philipp
Tim Rentsch
2016-12-01 19:19:22 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
Post by Tim Rentsch
Would the functions you are suggesting provide only a performance
improvement (in both speed and size), or would they offer something
more? If it's something more than just performance, what is that?
If it is only for reasons of speed/size improvement, what sort of
gains can be expected?
The new functions would
1) Provide better performance and code size by being non-restartable
2) Be more convenient, since by allowing conversions of whole strings at
a time
Given these motivations I would want to see a detailed
description of the function(s) behavior before offering
any more definite opinion.
Philipp Klaus Krause
2016-12-09 14:43:09 UTC
Permalink
Raw Message
Post by Tim Rentsch
Given these motivations I would want to see a detailed
description of the function(s) behavior before offering
any more definite opinion.
http://colecovision.eu/stuff/proposal-mbstoc16s

Philipp
Tim Rentsch
2016-12-10 06:35:13 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
Post by Tim Rentsch
Given these motivations I would want to see a detailed
description of the function(s) behavior before offering
any more definite opinion.
http://colecovision.eu/stuff/proposal-mbstoc16s
I took a copy of this webpage and read through it.
More comments in next followup.

You might want to fix up the formatting a bit.
Philipp Klaus Krause
2016-11-23 22:33:15 UTC
Permalink
Raw Message
Post by Tim Rentsch
Do you have a specific task that you are hoping to accomplish, or
is the motivation just one of general goodness? If there is a
specific task (or tasks plural), what might that/those be?
SDCC is used to program 8-bit systems, including USB devices. Some
protocols supported on some 8-bit devices, such as USB, use UTF-16. I
want to provide a good way of dealing with UTF-16 for the users of SDCC.
IMO, convenient and efficient (code, size, memory usage, performance)
conversion functions from and to multibyte strings would be useful there.

Philipp
James R. Kuyper
2016-11-23 23:04:25 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
Post by Tim Rentsch
Do you have a specific task that you are hoping to accomplish, or
is the motivation just one of general goodness? If there is a
specific task (or tasks plural), what might that/those be?
SDCC is used to program 8-bit systems, including USB devices. Some
protocols supported on some 8-bit devices, such as USB, use UTF-16. I
want to provide a good way of dealing with UTF-16 for the users of SDCC.
IMO, convenient and efficient (code, size, memory usage, performance)
conversion functions from and to multibyte strings would be useful there.
A need to provide a way to deal with UTF-16 is sufficient to justify
creation of interfaces using char16_t; it is not sufficient to justify
needing routines to convert between char16_t strings and multi-byte
character strings. I'm not saying that you can't justify it - I expect
that to be easy - but you haven't done so yet. What kind of information
do you expect SDDC users to receive in mbs format, and need to convert
to c16s format, or vice versa? Why couldn't it be wcs rather than mbs?
Philipp Klaus Krause
2016-11-24 08:57:30 UTC
Permalink
Raw Message
Post by James R. Kuyper
Post by Philipp Klaus Krause
Post by Tim Rentsch
Do you have a specific task that you are hoping to accomplish, or
is the motivation just one of general goodness? If there is a
specific task (or tasks plural), what might that/those be?
SDCC is used to program 8-bit systems, including USB devices. Some
protocols supported on some 8-bit devices, such as USB, use UTF-16. I
want to provide a good way of dealing with UTF-16 for the users of SDCC.
IMO, convenient and efficient (code, size, memory usage, performance)
conversion functions from and to multibyte strings would be useful there.
A need to provide a way to deal with UTF-16 is sufficient to justify
creation of interfaces using char16_t; it is not sufficient to justify
needing routines to convert between char16_t strings and multi-byte
character strings. I'm not saying that you can't justify it - I expect
that to be easy - but you haven't done so yet. What kind of information
do you expect SDDC users to receive in mbs format, and need to convert
to c16s format, or vice versa? Why couldn't it be wcs rather than mbs?
mbs is what typically needs the least amount of memory, what developers
understand best, and what is best supported by the standard library and
third-party libraries. Having good conversions functions between mbs and
UTF-16 easily lets developers do most of their processing in mbs and
then just convert to / from UTF-16 where UTF-16 is needed.

Philipp
Tim Rentsch
2016-12-01 19:33:19 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
Post by James R. Kuyper
Post by Tim Rentsch
Do you have a specific task that you are hoping to accomplish, or
is the motivation just one of general goodness? If there is a
specific task (or tasks plural), what might that/those be?
SDCC is used to program 8-bit systems, including USB devices. Some
protocols supported on some 8-bit devices, such as USB, use UTF-16. I
want to provide a good way of dealing with UTF-16 for the users of SDCC.
IMO, convenient and efficient (code, size, memory usage, performance)
conversion functions from and to multibyte strings would be useful there.
A need to provide a way to deal with UTF-16 is sufficient to justify
creation of interfaces using char16_t; it is not sufficient to justify
needing routines to convert between char16_t strings and multi-byte
character strings. I'm not saying that you can't justify it - I expect
that to be easy - but you haven't done so yet. What kind of information
do you expect SDDC users to receive in mbs format, and need to convert
to c16s format, or vice versa? Why couldn't it be wcs rather than mbs?
mbs is what typically needs the least amount of memory, what developers
understand best, and what is best supported by the standard library and
third-party libraries. Having good conversions functions between mbs and
UTF-16 easily lets developers do most of their processing in mbs and
then just convert to / from UTF-16 where UTF-16 is needed.
ISTM that having a function for converting between wchar_t and
UTF-16 might be a reasonable design choice here. Obviously there
are tradeoffs between speed, transient space usage, long-term
space usage, and library size. What is important to optimize?
Tim Rentsch
2016-12-01 19:24:54 UTC
Permalink
Raw Message
Post by Tim Rentsch
Do you have a specific task that you are hoping to accomplish, or
is the motivation just one of general goodness? If there is a
specific task (or tasks plural), what might that/those be?
SDCC is used to program 8-bit systems, including USB devices. Some
protocols supported on some 8-bit devices, such as USB, use UTF-16. I
want to provide a good way of dealing with UTF-16 for the users of SDCC.
IMO, convenient and efficient (code, size, memory usage, performance)
conversion functions from and to multibyte strings would be useful there.
Is it important for the conversion functions you want to be
locale-dependent? Is there some reason you couldn't just write a
function (either locale-dependent or not) as part of a library
extension and advise your users to use that?
Philipp Klaus Krause
2016-12-09 14:39:45 UTC
Permalink
Raw Message
Post by Tim Rentsch
Post by Tim Rentsch
Do you have a specific task that you are hoping to accomplish, or
is the motivation just one of general goodness? If there is a
specific task (or tasks plural), what might that/those be?
SDCC is used to program 8-bit systems, including USB devices. Some
protocols supported on some 8-bit devices, such as USB, use UTF-16. I
want to provide a good way of dealing with UTF-16 for the users of SDCC.
IMO, convenient and efficient (code, size, memory usage, performance)
conversion functions from and to multibyte strings would be useful there.
Is it important for the conversion functions you want to be
locale-dependent? Is there some reason you couldn't just write a
function (either locale-dependent or not) as part of a library
extension and advise your users to use that?
There are no locale-dependencies in SDCC at all. But the new functions
for char16_t and char32_t would nicely fit into the C standard libary,
as the standard library already has similar functions for wchar_t.

Philipp
Tim Rentsch
2016-12-10 07:08:53 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
Post by Tim Rentsch
Post by Tim Rentsch
Do you have a specific task that you are hoping to accomplish, or
is the motivation just one of general goodness? If there is a
specific task (or tasks plural), what might that/those be?
SDCC is used to program 8-bit systems, including USB devices. Some
protocols supported on some 8-bit devices, such as USB, use UTF-16. I
want to provide a good way of dealing with UTF-16 for the users of SDCC.
IMO, convenient and efficient (code, size, memory usage, performance)
conversion functions from and to multibyte strings would be useful there.
Is it important for the conversion functions you want to be
locale-dependent? Is there some reason you couldn't just write a
function (either locale-dependent or not) as part of a library
extension and advise your users to use that?
There are no locale-dependencies in SDCC at all.
It isn't obvious to me that this is even conforming, assuming
that the "" locale is UTF-8 or something like that. Isn't it the
case that the "C" locale is supposed to be a minimal environment
for C translation? In an operating environment where files
have 8-bit bytes, I would expect the C locale would be limited
to 8-bit characters.
Post by Philipp Klaus Krause
But the new functions
for char16_t and char32_t would nicely fit into the C standard libary,
as the standard library already has similar functions for wchar_t.
Here are my reservations.

1. I'm not sure they do. The behavior for these types (and
char16_t in particular) may be markedly different than that
of wchar_t (eg, having multi-unit characters).

2. The functions you're proposing are meant to fill an
application need, but it isn't obvious what those needs
actually are. Any proposal to add such functions to the
Standard should include use cases from actual applications to
show their appropriateness and effectiveness.

3. I didn't see in your write-up any mention of what happens
when shifted states or multi-unit characters (eg in char16_t)
come up, especially near the end of a buffer. The concerns
don't arise with wchar_t but they do with char16_t (and in
principle with char32_t) so they must be addressed in the
semantic descriptions.

By the way, is it not the case that your implementation is
a free-standing implementation, not a hosted one? If that
is true you could just add whatever functions you want to
the header(s) in question. In fact doing that may be the
best way to start down the path of getting these functions
accepted in a future Standard.

Don't get me wrong, I'm not saying what you're suggesting is a
bad idea. But what I've seen doesn't yet convince me it's a
good idea either, which means the proposal needs more work.
My intention is to identify some areas that are weak and in
need of stronger arguments.
Philipp Klaus Krause
2016-12-09 11:20:25 UTC
Permalink
Raw Message
Here is a first proposal for the semantics of these functions:

http://colecovision.eu/stuff/proposal-mbstoc16s

Philipp
Loading...