Post by Philipp Klaus Krause
On reading this posting I went back and looked again (and somewhat
more carefully this time) through the descriptions of these
functions. My view has softened a bit. I still think the c16rtomb()
function is meant to work only on single complete characters, and not
encoded sequences of characters. However, I agree that there does
seem to be room for argument, so I can't put my confidence at 100%.
So if someone wants the question resolved definitively, as of right
now I think that means writing a Defect Report to get an official
I intend to submit a defect report, here is the current draft summary
Section 7.28.1 describes the function c16rtomb(). In particular, it
states "When c16 is not a valid wide character, an encoding error occurs".
"wide character" is defined in section 3.7.3 as "value representable by
an object of type wchar_t, capable of representing any character in the
This wording seems to imply that, e.g. for the common case of UTF-8 char
and UTF-16 char16_t, c16rtomb() will return -1 when it enounters a
character that is encoded as multiple char16_t. In particular,
c16rtomb() will not be able to process strings genrated by mbrtoc16().
On the other hand, the desription of mbrtoc16() described in section
7.28.1 states "If the function determines that the next multibyte
character is complete
and valid, it determines the values of the corresponding wide
characters". So it considers it possible that a single mutibyte
character translates into multiple wide characters. So maybe the meaning
of "wide character" in section 7.28.1 is different from definition of
"wide character" in section 3.7.3.
In either case, the intended behaviour of c16rtomb() for characters
encoded as mutiple char16_t is unclear.
I have two suggestions, both of which are offered to adopt or not
entirely as you see fit.
First, if I were writing this DR myself, I would probably start of
something like this:
I'm trying to understand the behavior of restartable conversion
functions in <uchar.h>, described in section 7.28.1. My question
is about c16rtomb() but let me start with mbrtoc16(). Suppose
we have in implementation that defines __STDC_UTF_16__, and a
program operating with a current locale that uses UTF8. For
some UTF8 sequences, mbrtoc16() will produce two successive 16-bit
characters (so-called "surrogate pairs") for a single multi-byte
input character. What happens when c16rtomb() is called to process
these characters? More specifically, ...etc...
Second, I would try to write an implementation of c16rtomb() that
works in the specific case mentioned in the last paragraph, ie,
processes a surrogate pair sequence "correctly", and also complies
(if possible) with the description in the Standard. If I manage to
come up with an implementation that is at least plausibly
conforming, I would present it (possibly in summary) and ask if it
is conforming. If I can't find such an implmentation, I would say
something like "I would like to implement c16rtomb() so that it
works on two character sequences like those mentioned above, but I
don't see how to do that so that it is conforming. Is this what is
intended for this function? Does it only work on single char16_t
that are complete characters by themselves?"
So, for what it's worth, there are my ideas and suggestions.