Discussion:
A function to convert char16_t strings to char strings
(too old to reply)
Philipp Klaus Krause
2015-11-14 14:25:39 UTC
Permalink
Raw Message
For the usual scenario of UTF-16 char16_t strings and UTF-32 char32_t
strings, the C11 standard has functions to convert

char to char16_t : mbrtoc16()
char to char32_t : mbrtoc32()
char32_t to char : c32rtomb()

However what is missing is a function to convert from char16_t to char.
The function c16rtomb() cannot convert characters consisting of more
than one char16_t to char, so for Unicode it can only handle characters
in the basic multilingual plane.

I suggest to add a function to convert char16_t to char.
And maybe also function to convert between char16_t and char32_t.

Philipp
Tijl Coosemans
2015-11-15 14:12:06 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
For the usual scenario of UTF-16 char16_t strings and UTF-32 char32_t
strings, the C11 standard has functions to convert
char to char16_t : mbrtoc16()
char to char32_t : mbrtoc32()
char32_t to char : c32rtomb()
However what is missing is a function to convert from char16_t to char.
The function c16rtomb() cannot convert characters consisting of more
than one char16_t to char, so for Unicode it can only handle characters
in the basic multilingual plane.
I suggest to add a function to convert char16_t to char.
And maybe also function to convert between char16_t and char32_t.
If two char16_t are needed to complete one multibyte character then
you need to call c16rtomb() twice. The first call only updates the
mbstate and does not produce any output.
James Kuyper
2015-11-15 14:51:21 UTC
Permalink
Raw Message
Post by Tijl Coosemans
Post by Philipp Klaus Krause
For the usual scenario of UTF-16 char16_t strings and UTF-32 char32_t
strings, the C11 standard has functions to convert
char to char16_t : mbrtoc16()
char to char32_t : mbrtoc32()
char32_t to char : c32rtomb()
However what is missing is a function to convert from char16_t to char.
The function c16rtomb() cannot convert characters consisting of more
than one char16_t to char, so for Unicode it can only handle characters
in the basic multilingual plane.
I suggest to add a function to convert char16_t to char.
And maybe also function to convert between char16_t and char32_t.
If two char16_t are needed to complete one multibyte character then
you need to call c16rtomb() twice. The first call only updates the
mbstate and does not produce any output.
There is wording to that effect in the description of mbtoc16r():
"If the function determines that the next multibyte character is
complete and valid, it determines the values of the corresponding wide
characters and then, if pc16 is not a null pointer, stores the value of
the first (or only) such character in the object pointed to by pc16.
Subsequent calls will store successive wide characters without consuming
any additional input until all the characters have been stored."
(7.28.1.1p3)

Noes the use of "values" and "wide characters" and the entire sentence
starting with "Subsequently". The value returned by mbtoc16r() gives you
the information that you need to determine whether additional calls are
needed.

There is no such wording in the description of c16rtomb():
"Ifs is not a null pointer, the c16rtomb function determines the number
of bytes needed to represent the multibyte character that corresponds to
the wide character given by c16 (including any shift sequences), and
stores the multibyte character representation in the array whose first
element is pointed to by s." (7.28.1.2p3)

Note, in particular, that it refers to "... THE wide character ..."
(emphasis added), and that the interface to c16rtomb() only gives it
access to a single char16_t value. There value returned by c16rtomb()
doesn't give you any clue about whether additional char16_t values need
to be processed in order to complete a single character.
--
James Kuyper
Philipp Klaus Krause
2015-11-16 10:35:01 UTC
Permalink
Raw Message
Post by James Kuyper
[…]
Note, in particular, that it refers to "... THE wide character ..."
(emphasis added), and that the interface to c16rtomb() only gives it
access to a single char16_t value. There value returned by c16rtomb()
doesn't give you any clue about whether additional char16_t values need
to be processed in order to complete a single character.
But a wide character, by definition 3.7.3 is "[…] capable of
representing any character in the current locale".

Philipp
James Kuyper
2015-11-16 13:36:32 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
Post by James Kuyper
[…]
Note, in particular, that it refers to "... THE wide character ..."
(emphasis added), and that the interface to c16rtomb() only gives it
access to a single char16_t value. There value returned by c16rtomb()
doesn't give you any clue about whether additional char16_t values need
to be processed in order to complete a single character.
But a wide character, by definition 3.7.3 is "[…] capable of
representing any character in the current locale".
That is immediately preceded by "value representable by an object of
type wchar_t,". Given that the standard very specifically allows an
implementation to pre#define __STDC_UTF_16__, with the meaning that
char16_t uses UTF-16 representation, 3.7.3 cannot be taken as requiring
that a single char16_t can represent all of the values that wchar_t is
required to represent.
Note that when 7.28.1.2p3 talks about "the wide character", it continues
by saying "given by c16". c16 is, in this context, a single char16_t
value. if c16rtomb() is intended to be able to deal with UTF-16
surrogate pairs, that's not at all clear from the description.
--
James Kuyper
Philipp Klaus Krause
2015-11-15 16:37:05 UTC
Permalink
Raw Message
Post by Tijl Coosemans
Post by Philipp Klaus Krause
For the usual scenario of UTF-16 char16_t strings and UTF-32 char32_t
strings, the C11 standard has functions to convert
char to char16_t : mbrtoc16()
char to char32_t : mbrtoc32()
char32_t to char : c32rtomb()
However what is missing is a function to convert from char16_t to char.
The function c16rtomb() cannot convert characters consisting of more
than one char16_t to char, so for Unicode it can only handle characters
in the basic multilingual plane.
I suggest to add a function to convert char16_t to char.
And maybe also function to convert between char16_t and char32_t.
If two char16_t are needed to complete one multibyte character then
you need to call c16rtomb() twice. The first call only updates the
mbstate and does not produce any output.
But (from the publicly available C11 draft, description of c16rtomb()):

"When c16 is not a valid wide character, an encoding error occurs: the
function stores the value of the macro EILSEQ in errno and returns
(size_t)(-1); the conversion state is unspecified."

To me, this seems to imply that the char16_t passed to c16rtomb() needs
to be a valid character by itself, not just part of some sequence
encoding a character.

Philipp
Tijl Coosemans
2015-11-15 22:34:39 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
Post by Tijl Coosemans
Post by Philipp Klaus Krause
For the usual scenario of UTF-16 char16_t strings and UTF-32 char32_t
strings, the C11 standard has functions to convert
char to char16_t : mbrtoc16()
char to char32_t : mbrtoc32()
char32_t to char : c32rtomb()
However what is missing is a function to convert from char16_t to char.
The function c16rtomb() cannot convert characters consisting of more
than one char16_t to char, so for Unicode it can only handle characters
in the basic multilingual plane.
I suggest to add a function to convert char16_t to char.
And maybe also function to convert between char16_t and char32_t.
If two char16_t are needed to complete one multibyte character then
you need to call c16rtomb() twice. The first call only updates the
mbstate and does not produce any output.
"When c16 is not a valid wide character, an encoding error occurs: the
function stores the value of the macro EILSEQ in errno and returns
(size_t)(-1); the conversion state is unspecified."
To me, this seems to imply that the char16_t passed to c16rtomb() needs
to be a valid character by itself, not just part of some sequence
encoding a character.
The word character is used with two different meanings here I believe.
There's the common definition of character as in a character of a
character set and there's the term wide character which is defined in
3.7 as a wchar_t value. There's no mention of charXX_t in 3.7 so the
meaning of wide character in the description of the unicode functions
is not entirely clear but it seems to me that it is meant to mean
charXX_t value and not a character from a character set. Otherwise
the description of mbrtocXX() doesn't make any sense because it says
that one multibyte character (a character from a character set) can
produce multiple wide characters as output (charXX_t values which
together encode a character from a character set).
James Kuyper
2015-11-15 23:08:15 UTC
Permalink
Raw Message
Post by Tijl Coosemans
Post by Philipp Klaus Krause
Post by Tijl Coosemans
Post by Philipp Klaus Krause
For the usual scenario of UTF-16 char16_t strings and UTF-32 char32_t
strings, the C11 standard has functions to convert
char to char16_t : mbrtoc16()
char to char32_t : mbrtoc32()
char32_t to char : c32rtomb()
However what is missing is a function to convert from char16_t to char.
The function c16rtomb() cannot convert characters consisting of more
than one char16_t to char, so for Unicode it can only handle characters
in the basic multilingual plane.
I suggest to add a function to convert char16_t to char.
And maybe also function to convert between char16_t and char32_t.
If two char16_t are needed to complete one multibyte character then
you need to call c16rtomb() twice. The first call only updates the
mbstate and does not produce any output.
"When c16 is not a valid wide character, an encoding error occurs: the
function stores the value of the macro EILSEQ in errno and returns
(size_t)(-1); the conversion state is unspecified."
To me, this seems to imply that the char16_t passed to c16rtomb() needs
to be a valid character by itself, not just part of some sequence
encoding a character.
The word character is used with two different meanings here I believe.
There's the common definition of character as in a character of a
character set and there's the term wide character which is defined in
3.7 as a wchar_t value. There's no mention of charXX_t in 3.7 so the
meaning of wide character in the description of the unicode functions
is not entirely clear but it seems to me that it is meant to mean
charXX_t value and not a character from a character set. Otherwise
the description of mbrtocXX() doesn't make any sense because it says
that one multibyte character (a character from a character set) can
produce multiple wide characters as output (charXX_t values which
together encode a character from a character set).
What makes you think that this is not in fact the case? The standard
explicitly allows an implementation to pre#define __STDC_UTF_16__,
indicating that char16_t uses UTF-16 encoding, which means that all
Unicode code points from the supplementary planes are encoded using
surrogate pairs.

What would you expect the following code to do on a platform that
pre#defines __STDC_UTF_16__?

mbstate_t state = {0};
char mb[] = "\U10437";
char16_t c16[2];

mbrtoc16(&c16, mb, sizeof mb - 1, &state);
mbrtoc16(c16+1, mb, sizeof mb - 1, &state);

If I understand 7.28.1.1 correctly (which is not guaranteed), I would
expect c16[0] == 0xD801 && c16[1] == 0xDC37, based upon
<https://en.wikipedia.org/wiki/UTF-16#Examples>. I'd expect both calls
to return "sizeof mb - 1".
James Kuyper
2015-11-16 02:14:07 UTC
Permalink
Raw Message
On 11/15/2015 06:08 PM, James Kuyper wrote:
...
Post by James Kuyper
char16_t c16[2];
mbrtoc16(&c16, mb, sizeof mb - 1, &state);
The '&' is, of course, an error.
--
James Kuyper
Tijl Coosemans
2015-11-16 09:12:16 UTC
Permalink
Raw Message
Post by James Kuyper
Post by Tijl Coosemans
Post by Philipp Klaus Krause
Post by Tijl Coosemans
Post by Philipp Klaus Krause
For the usual scenario of UTF-16 char16_t strings and UTF-32 char32_t
strings, the C11 standard has functions to convert
char to char16_t : mbrtoc16()
char to char32_t : mbrtoc32()
char32_t to char : c32rtomb()
However what is missing is a function to convert from char16_t to char.
The function c16rtomb() cannot convert characters consisting of more
than one char16_t to char, so for Unicode it can only handle characters
in the basic multilingual plane.
I suggest to add a function to convert char16_t to char.
And maybe also function to convert between char16_t and char32_t.
If two char16_t are needed to complete one multibyte character then
you need to call c16rtomb() twice. The first call only updates the
mbstate and does not produce any output.
"When c16 is not a valid wide character, an encoding error occurs: the
function stores the value of the macro EILSEQ in errno and returns
(size_t)(-1); the conversion state is unspecified."
To me, this seems to imply that the char16_t passed to c16rtomb() needs
to be a valid character by itself, not just part of some sequence
encoding a character.
The word character is used with two different meanings here I believe.
There's the common definition of character as in a character of a
character set and there's the term wide character which is defined in
3.7 as a wchar_t value. There's no mention of charXX_t in 3.7 so the
meaning of wide character in the description of the unicode functions
is not entirely clear but it seems to me that it is meant to mean
charXX_t value and not a character from a character set. Otherwise
the description of mbrtocXX() doesn't make any sense because it says
that one multibyte character (a character from a character set) can
produce multiple wide characters as output (charXX_t values which
together encode a character from a character set).
What makes you think that this is not in fact the case?
I do think this is the case.
Post by James Kuyper
What would you expect the following code to do on a platform that
pre#defines __STDC_UTF_16__?
mbstate_t state = {0};
mbstate_t isn't necessarily a struct. You need to initialise it to zero
using memset.
Post by James Kuyper
char mb[] = "\U10437";
char16_t c16[2];
mbrtoc16(&c16, mb, sizeof mb - 1, &state);
mbrtoc16(c16+1, mb, sizeof mb - 1, &state);
If I understand 7.28.1.1 correctly (which is not guaranteed), I would
expect c16[0] == 0xD801 && c16[1] == 0xDC37, based upon
<https://en.wikipedia.org/wiki/UTF-16#Examples>. I'd expect both calls
to return "sizeof mb - 1".
The first call reads sizeof(mb)-1 bytes that together form one multibyte
character (here character means member of a character set). Then it
determines how many wide characters (here character means char16_t object)
are needed to encode this character (from a character set). It stores
the first wide character in c16[0] and returns sizeof(mb)-1 (the number
of bytes read). The second call sees that there are more wide characters
in mbstate and stores the next one in c16[1]. Then it returns -3.
m***@yahoo.co.uk
2015-11-16 12:36:54 UTC
Permalink
Raw Message
Post by Tijl Coosemans
Post by James Kuyper
mbstate_t state = {0};
mbstate_t isn't necessarily a struct. You need to initialise it to zero
using memset.
Hmm. memset is problematic too. The standard doesn't guarantee that it
sets pointers (or doubles) to anything meaningful. (I have used a
machine where memset'ing a pointer would not set it to NULL). Does the
standard guarantee that mbstate_t won't contain pointers? (Or does it
guarantee that memset'ing it will work?)
James Kuyper
2015-11-16 13:33:08 UTC
Permalink
Raw Message
Post by m***@yahoo.co.uk
Post by Tijl Coosemans
Post by James Kuyper
mbstate_t state = {0};
mbstate_t isn't necessarily a struct. You need to initialise it to zero
using memset.
Hmm. memset is problematic too. The standard doesn't guarantee that it
sets pointers (or doubles) to anything meaningful. (I have used a
machine where memset'ing a pointer would not set it to NULL). Does the
standard guarantee that mbstate_t won't contain pointers? (Or does it
guarantee that memset'ing it will work?)
No, it does not. Which is one reason why initializing with {0} is better.
--
James Kuyper
Kaz Kylheku
2015-11-16 20:31:10 UTC
Permalink
Raw Message
Post by James Kuyper
Post by m***@yahoo.co.uk
Post by Tijl Coosemans
Post by James Kuyper
mbstate_t state = {0};
mbstate_t isn't necessarily a struct. You need to initialise it to zero
using memset.
Hmm. memset is problematic too. The standard doesn't guarantee that it
sets pointers (or doubles) to anything meaningful. (I have used a
machine where memset'ing a pointer would not set it to NULL). Does the
standard guarantee that mbstate_t won't contain pointers? (Or does it
guarantee that memset'ing it will work?)
No, it does not. Which is one reason why initializing with {0} is better.
{ 0 } sometimes elicits warnings about the wrong nesting of initializers
and some such, because the struct begins with a struct member.

Over the past decade, perhaps longer, I've been using this trick,
taking advantage of the semantics of a static object being initialized
to all zero values without an explicit initializer:

{
static const some_type_t blank;
some_type_t instance = blank;

(The blank could be at file scope or in a block scope, depending on
where it is needed.)
--
Music DIY Mailing List: http://www.kylheku.com/diy
ADA MP-1 Mailing List: http://www.kylheku.com/mp1
Keith Thompson
2015-11-16 20:48:59 UTC
Permalink
Raw Message
Kaz Kylheku <***@kylheku.com> writes:
[...]
Post by Kaz Kylheku
{ 0 } sometimes elicits warnings about the wrong nesting of initializers
and some such, because the struct begins with a struct member.
[...]

Yes, it does.

gcc up to 4.9 will warn about this with -Wall. gcc 5.2.0 does not;
apparently the gcc maintainers decided it's a sufficiently common idiom.
I don't know the situation with other C compilers.

(The Linux distribution I use doesn't provide gcc-5.x in its package
management system; I've installed it myself from source.)
--
Keith Thompson (The_Other_Keith) kst-***@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
Tijl Coosemans
2015-11-16 15:16:52 UTC
Permalink
Raw Message
Post by m***@yahoo.co.uk
Post by Tijl Coosemans
Post by James Kuyper
mbstate_t state = {0};
mbstate_t isn't necessarily a struct. You need to initialise it to zero
using memset.
Hmm. memset is problematic too. The standard doesn't guarantee that it
sets pointers (or doubles) to anything meaningful. (I have used a
machine where memset'ing a pointer would not set it to NULL). Does the
standard guarantee that mbstate_t won't contain pointers? (Or does it
guarantee that memset'ing it will work?)
The standard says that a zero valued mbstate_t is one of the ways to
describe the initial state. Zero valued doesn't imply that pointers
compare equal to NULL.

But anyway, initialising with {0} works too, so memset isn't needed here.
Keith Thompson
2015-11-16 18:50:26 UTC
Permalink
Raw Message
Tijl Coosemans <***@coosemans.org> writes:
[...]
Post by Tijl Coosemans
The standard says that a zero valued mbstate_t is one of the ways to
describe the initial state. Zero valued doesn't imply that pointers
compare equal to NULL.
The phrase used by the standard is "initialized to zero". That doesn't
mean all-bits-zero, which can differ from the value set by initializing
to {0}.
Post by Tijl Coosemans
But anyway, initialising with {0} works too, so memset isn't needed here.
If a null pointer is all-bits-zero, and if mbstate_t contains one or
more pointer members, then memset will not necessarily work.
--
Keith Thompson (The_Other_Keith) kst-***@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
James Kuyper
2015-11-16 18:59:53 UTC
Permalink
Raw Message
...
Post by Keith Thompson
Post by Tijl Coosemans
But anyway, initialising with {0} works too, so memset isn't needed here.
If a null pointer is all-bits-zero, and if mbstate_t contains one or
more pointer members, then memset will not necessarily work.
Is there a 'not' missing there?
Keith Thompson
2015-11-16 19:08:27 UTC
Permalink
Raw Message
Post by James Kuyper
...
Post by Keith Thompson
Post by Tijl Coosemans
But anyway, initialising with {0} works too, so memset isn't needed here.
If a null pointer is all-bits-zero, and if mbstate_t contains one or
more pointer members, then memset will not necessarily work.
Is there a 'not' missing there?
Yes.

If a null pointer is *not* all-bits-zero ...
--
Keith Thompson (The_Other_Keith) kst-***@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
James Kuyper
2015-11-16 13:31:59 UTC
Permalink
Raw Message
...
Post by Tijl Coosemans
Post by James Kuyper
Post by Tijl Coosemans
The word character is used with two different meanings here I believe.
There's the common definition of character as in a character of a
character set and there's the term wide character which is defined in
3.7 as a wchar_t value. There's no mention of charXX_t in 3.7 so the
meaning of wide character in the description of the unicode functions
is not entirely clear but it seems to me that it is meant to mean
charXX_t value and not a character from a character set. Otherwise
the description of mbrtocXX() doesn't make any sense because it says
that one multibyte character (a character from a character set) can
produce multiple wide characters as output (charXX_t values which
together encode a character from a character set).
What makes you think that this is not in fact the case?
I do think this is the case.
Sorry - I guess I got confused as to what you were claiming.
Post by Tijl Coosemans
Post by James Kuyper
What would you expect the following code to do on a platform that
pre#defines __STDC_UTF_16__?
mbstate_t state = {0};
mbstate_t isn't necessarily a struct. You need to initialise it to zero
using memset.
What data type could mbstate_t be that it can't be zero-initialized by
using {0}? mbstate_t is required to be "a complete object type other
than an array type" (7.29.1p2). 0 is permitted initializer for any
arithmetic or pointer type. Braces are optionally permitted around the
initializer for any scalar type (6.7.9p11). mbstate_t is not allowed to
be an array type, but if it were, {0} would be allowed for that too.

The only types for which {0} is not a permitted initializer are, as far
as I can figure out, incomplete types and function types - mbstate_t is
not allowed to be either of those.

Note that the standard frequently uses the phrase "an mbstate_t object
initialized to zero". If this could be achieved by initialization, but
only by a call to memset(), that phrase wouldn't seem appropriate.
Post by Tijl Coosemans
Post by James Kuyper
char mb[] = "\U10437";
char16_t c16[2];
mbrtoc16(&c16, mb, sizeof mb - 1, &state);
mbrtoc16(c16+1, mb, sizeof mb - 1, &state);
If I understand 7.28.1.1 correctly (which is not guaranteed), I would
expect c16[0] == 0xD801 && c16[1] == 0xDC37, based upon
<https://en.wikipedia.org/wiki/UTF-16#Examples>. I'd expect both calls
to return "sizeof mb - 1".
The first call reads sizeof(mb)-1 bytes that together form one multibyte
character (here character means member of a character set). Then it
determines how many wide characters (here character means char16_t object)
are needed to encode this character (from a character set). It stores
the first wide character in c16[0] and returns sizeof(mb)-1 (the number
of bytes read). The second call sees that there are more wide characters
in mbstate and stores the next one in c16[1]. Then it returns -3.
Not quite: it returns (size_t)(-3), or in other words, SIZE_MAX-2.
I was confused by the description of the return values from mbrtoc16(),
but your assertion that second call returns -3 seems to make sense
(after correction).

I'm still unclear about how you're supposed to know whether a second
call to mbrtoc16() with the same input will be needed.
--
James Kuyper
Tijl Coosemans
2015-11-16 15:03:45 UTC
Permalink
Raw Message
Post by James Kuyper
Post by Tijl Coosemans
Post by James Kuyper
mbstate_t state = {0};
mbstate_t isn't necessarily a struct. You need to initialise it to zero
using memset.
What data type could mbstate_t be that it can't be zero-initialized by
using {0}? mbstate_t is required to be "a complete object type other
than an array type" (7.29.1p2). 0 is permitted initializer for any
arithmetic or pointer type. Braces are optionally permitted around the
initializer for any scalar type (6.7.9p11).
Ah, I wasn't aware of that.
Post by James Kuyper
Post by Tijl Coosemans
Post by James Kuyper
char mb[] = "\U10437";
char16_t c16[2];
mbrtoc16(&c16, mb, sizeof mb - 1, &state);
mbrtoc16(c16+1, mb, sizeof mb - 1, &state);
If I understand 7.28.1.1 correctly (which is not guaranteed), I would
expect c16[0] == 0xD801 && c16[1] == 0xDC37, based upon
<https://en.wikipedia.org/wiki/UTF-16#Examples>. I'd expect both calls
to return "sizeof mb - 1".
The first call reads sizeof(mb)-1 bytes that together form one multibyte
character (here character means member of a character set). Then it
determines how many wide characters (here character means char16_t object)
are needed to encode this character (from a character set). It stores
the first wide character in c16[0] and returns sizeof(mb)-1 (the number
of bytes read). The second call sees that there are more wide characters
in mbstate and stores the next one in c16[1]. Then it returns -3.
Not quite: it returns (size_t)(-3), or in other words, SIZE_MAX-2.
I was confused by the description of the return values from mbrtoc16(),
but your assertion that second call returns -3 seems to make sense
(after correction).
I'm still unclear about how you're supposed to know whether a second
call to mbrtoc16() with the same input will be needed.
You need to keep calling mbrtoc16 until it returns 0, -1, or -2. If it
returns a value from 1 to n you should update the string pointer and
remaining size with it. If it returns -3 you call the function again
with the same string pointer and size.
Martin Str|mberg
2015-11-16 10:49:48 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
Post by Tijl Coosemans
If two char16_t are needed to complete one multibyte character then
you need to call c16rtomb() twice. The first call only updates the
mbstate and does not produce any output.
"When c16 is not a valid wide character, an encoding error occurs: the
function stores the value of the macro EILSEQ in errno and returns
(size_t)(-1); the conversion state is unspecified."
To me, this seems to imply that the char16_t passed to c16rtomb() needs
to be a valid character by itself, not just part of some sequence
encoding a character.
I agree. As written in n1570 if c16rtomb() is for converting UTF-16 it
can only encode characters in BMP, not surrogate pairs. If c16 is a
surrogate it's required to return -1.
--
MartinS
Tijl Coosemans
2015-11-16 15:20:22 UTC
Permalink
Raw Message
Post by Martin Str|mberg
Post by Philipp Klaus Krause
Post by Tijl Coosemans
If two char16_t are needed to complete one multibyte character then
you need to call c16rtomb() twice. The first call only updates the
mbstate and does not produce any output.
"When c16 is not a valid wide character, an encoding error occurs: the
function stores the value of the macro EILSEQ in errno and returns
(size_t)(-1); the conversion state is unspecified."
To me, this seems to imply that the char16_t passed to c16rtomb() needs
to be a valid character by itself, not just part of some sequence
encoding a character.
I agree. As written in n1570 if c16rtomb() is for converting UTF-16 it
can only encode characters in BMP, not surrogate pairs. If c16 is a
surrogate it's required to return -1.
I disagree. It should store the first part of the surrogate in mbstate
and return 0. The term "valid wide character" does not imply a complete
character. It simply means a valid value depending on the current
mbstate. If mbstate is in the initial state then the first half of a
surrogate is a valid value.
Martin Str|mberg
2015-11-16 16:03:46 UTC
Permalink
Raw Message
Post by Tijl Coosemans
Post by Martin Str|mberg
Post by Philipp Klaus Krause
"When c16 is not a valid wide character, an encoding error occurs: the
function stores the value of the macro EILSEQ in errno and returns
(size_t)(-1); the conversion state is unspecified."
To me, this seems to imply that the char16_t passed to c16rtomb() needs
to be a valid character by itself, not just part of some sequence
encoding a character.
I agree. As written in n1570 if c16rtomb() is for converting UTF-16 it
can only encode characters in BMP, not surrogate pairs. If c16 is a
surrogate it's required to return -1.
I disagree. It should store the first part of the surrogate in mbstate
and return 0. The term "valid wide character" does not imply a complete
character. It simply means a valid value depending on the current
mbstate. If mbstate is in the initial state then the first half of a
surrogate is a valid value.
Perhaps the real standard has improved their writing (I hope so), but
it says clearly in n1570 (as quoted above) that a not valid character
results in an encoding error. In UTF-16 one half of a surrogate pair is
not a valid character.

So has the C standard invented yet another meaning of character?
--
MartinS
Tijl Coosemans
2015-11-17 23:01:08 UTC
Permalink
Raw Message
Post by Martin Str|mberg
Post by Tijl Coosemans
Post by Martin Str|mberg
Post by Philipp Klaus Krause
"When c16 is not a valid wide character, an encoding error occurs: the
function stores the value of the macro EILSEQ in errno and returns
(size_t)(-1); the conversion state is unspecified."
To me, this seems to imply that the char16_t passed to c16rtomb() needs
to be a valid character by itself, not just part of some sequence
encoding a character.
I agree. As written in n1570 if c16rtomb() is for converting UTF-16 it
can only encode characters in BMP, not surrogate pairs. If c16 is a
surrogate it's required to return -1.
I disagree. It should store the first part of the surrogate in mbstate
and return 0. The term "valid wide character" does not imply a complete
character. It simply means a valid value depending on the current
mbstate. If mbstate is in the initial state then the first half of a
surrogate is a valid value.
Perhaps the real standard has improved their writing (I hope so), but
it says clearly in n1570 (as quoted above) that a not valid character
results in an encoding error. In UTF-16 one half of a surrogate pair is
not a valid character.
So has the C standard invented yet another meaning of character?
The description of mbrtoc16 contains the following sentence:

"If the function determines that the next multibyte character is complete
and valid, it determines the values of the corresponding wide characters"

The uses of "character" in this sentence cannot have the same meaning
because the first is singular and the second is plural. Both "multibyte
character" and "wide character" are defined in 3.7. The problem is that
"wide character" is defined in terms of wchar_t so its meaning in the
description of the unicode functions is unclear.

The above sentence does make clear that multiple wide characters may be
needed to represent one multibyte character and that therefore a wide
character is not actually a character, but just an instance of char16_t
(which is similar to the definition in 3.7.3 which says it is an instance
of wchar_t).

Surely the meaning of "wide character" in the description of mbrtoc16
and that of c16rtomb is the same. Given the above then I would say that
the first half of a surrogate pair is a valid "wide character" when the
mbstate_t object is in the initial state and that c16rtomb should accept
it and return 0.
Martin Str|mberg
2015-11-19 06:22:39 UTC
Permalink
Raw Message
Post by Tijl Coosemans
"If the function determines that the next multibyte character is complete
and valid, it determines the values of the corresponding wide characters"
The uses of "character" in this sentence cannot have the same meaning
because the first is singular and the second is plural. Both "multibyte
character" and "wide character" are defined in 3.7. The problem is that
"wide character" is defined in terms of wchar_t so its meaning in the
description of the unicode functions is unclear.
The above sentence does make clear that multiple wide characters may be
needed to represent one multibyte character and that therefore a wide
character is not actually a character, but just an instance of char16_t
(which is similar to the definition in 3.7.3 which says it is an instance
of wchar_t).
Surely the meaning of "wide character" in the description of mbrtoc16
and that of c16rtomb is the same. Given the above then I would say that
the first half of a surrogate pair is a valid "wide character" when the
mbstate_t object is in the initial state and that c16rtomb should accept
it and return 0.
You've convinced me. If I read c16rtomb() together with mbrtoc16()
(which seems like a sensible thing to do) evidently c16rtomb() can
need several calls to complete the multibyte character.

However if C11 is exactly the same as n1570, there is room for
significant improvement. It should make clear that if c16rtomb()
returns 0 (this is what I think it returns when it needs more input),
it needs to be called again with more char16_t characters. The list of
return values for mbrtoc16() is good. There should be a similar list
for c16rtomb() or such.


Now if I look at the paragraph preceding mbrtoc16() (in n1570 it's
7.28.1:1) it only says "These functions have a parameter, ps, of type
pointer to mb_state_t that points to an object that can completely
describe the current conversion state of the associated multibyte
character sequence, which the functions alter as necessary".

Note the lack of "wide characters state". If this paragraph made clear
that ps might contain the wide characters state as well, I would more
readily accepted that c16rtomb() might accept UTF-16 surrogate
characters, as then I would have seen where to put the intermediate
value. As I read it, it was ok for multibyte character state to be in
it, but not wide character state.
--
MartinS
Tim Rentsch
2015-11-20 18:30:17 UTC
Permalink
Raw Message
Post by Tijl Coosemans
"If the function determines that the next multibyte character is complete
and valid, it determines the values of the corresponding wide characters"
The uses of "character" in this sentence cannot have the same meaning
because the first is singular and the second is plural. Both "multibyte
character" and "wide character" are defined in 3.7. The problem is that
"wide character" is defined in terms of wchar_t so its meaning in the
description of the unicode functions is unclear.
The above sentence does make clear that multiple wide characters may be
needed to represent one multibyte character and that therefore a wide
character is not actually a character, but just an instance of char16_t
(which is similar to the definition in 3.7.3 which says it is an instance
of wchar_t).
Surely the meaning of "wide character" in the description of mbrtoc16
and that of c16rtomb is the same. Given the above then I would say that
the first half of a surrogate pair is a valid "wide character" when the
mbstate_t object is in the initial state and that c16rtomb should accept
it and return 0.
You've convinced me. [snip elaboration]
On reading this posting I went back and looked again (and
somewhat more carefully this time) through the descriptions of
these functions. My view has softened a bit. I still think the
c16rtomb() function is meant to work only on single complete
characters, and not encoded sequences of characters. However, I
agree that there does seem to be room for argument, so I can't
put my confidence at 100%. So if someone wants the question
resolved definitively, as of right now I think that means writing
a Defect Report to get an official answer.
Martin Str|mberg
2015-11-21 10:04:41 UTC
Permalink
Raw Message
Post by Tim Rentsch
I still think the
c16rtomb() function is meant to work only on single complete
characters, and not encoded sequences of characters.
So in your opinion you don't think c16rtomb() should be able to
reverse what mbrtoc16() does?

I.e assuming char16_t is contains UTF-16 encoded characters:

1. Start with a string with some multibyte characters.
2. Call mbrtoc16() X times until it returns 0, which means that the
multibyte character string has been encoded.
3. Call c16rtomb() Y times, which should be until after the call where
you passed UTF-16 '\0'.
4. The newly encoded multibyte character string is "equal" to the
original multibyte character string. ("equal" because perhaps there
are several equivalent encodings for certain multibyte character
sets.)
--
MartinS
Tim Rentsch
2015-11-22 21:16:09 UTC
Permalink
Raw Message
Post by Martin Str|mberg
Post by Tim Rentsch
I still think the
c16rtomb() function is meant to work only on single complete
characters, and not encoded sequences of characters.
So in your opinion you don't think c16rtomb() should be able to
reverse what mbrtoc16() does?
Oh, I didn't say that. I think c16rtomb() /should/ be able to
reverse what mbrtoc16() does. But I think what the Standard
says, or at least is meant to say, is that c16rtomb() /is not/
able to reverse what mbrtoc16() does, in those cases where
mbrtoc16() produces more than one (16-bit) character of output.
And to repeat myself, I think the point is open to argument. I
don't think the Standard is clear enough to be sure what it
means, only that this interpretation seems the most likely; but
I can't rule out the possibility that your view is what was
intended either.
Post by Martin Str|mberg
1. Start with a string with some multibyte characters.
2. Call mbrtoc16() X times until it returns 0, which means that the
multibyte character string has been encoded.
3. Call c16rtomb() Y times, which should be until after the call where
you passed UTF-16 '\0'.
4. The newly encoded multibyte character string is "equal" to the
original multibyte character string. ("equal" because perhaps there
are several equivalent encodings for certain multibyte character
sets.)
Being a curious sort of character, I did some browing to find
sources for the glic c16rtomb() function, and some related code.
I looked at several pages, of which these two seem the most
relevant:

http://code.woboq.org/userspace/glibc/wcsmbs/c16rtomb.c.html
http://code.woboq.org/userspace/glibc/wcsmbs/wcrtomb.c.html

Based on what I read it looks like glibc takes the same view that
I have expressed above. Note in particular the behavior when the
condition 'status == __GCONV_INCOMPLETE_INPUT' holds true.

In a way I'd like to be proven wrong, because reversible behavior
seems better than non-reversible behavior, not to mention less
surprising. But when I read what the Standard actually says and
try to discern what the authors meant, that isn't how it comes
out.
Martin Str|mberg
2015-11-23 20:16:52 UTC
Permalink
Raw Message
Post by Tim Rentsch
In a way I'd like to be proven wrong, because reversible behavior
seems better than non-reversible behavior, not to mention less
surprising. But when I read what the Standard actually says and
try to discern what the authors meant, that isn't how it comes
out.
We are furiously in agreement here.

You shouldn't need to read the desciption of another function to
_perhaps_ understand what the first one does.
--
MartinS
Philipp Klaus Krause
2015-11-24 12:48:36 UTC
Permalink
Raw Message
Post by Martin Str|mberg
Post by Tim Rentsch
In a way I'd like to be proven wrong, because reversible behavior
seems better than non-reversible behavior, not to mention less
surprising. But when I read what the Standard actually says and
try to discern what the authors meant, that isn't how it comes
out.
We are furiously in agreement here.
You shouldn't need to read the desciption of another function to
_perhaps_ understand what the first one does.
Who can write defect reports to the C standard?

Philipp
Tim Rentsch
2015-11-27 17:44:07 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
Post by Martin Str|mberg
Post by Tim Rentsch
In a way I'd like to be proven wrong, because reversible behavior
seems better than non-reversible behavior, not to mention less
surprising. But when I read what the Standard actually says and
try to discern what the authors meant, that isn't how it comes
out.
We are furiously in agreement here.
You shouldn't need to read the desciption of another function to
_perhaps_ understand what the first one does.
Who can write defect reports to the C standard?
I'm pretty sure you can if you want. Send an email to one
of the people on the contacts page -

http://www.open-std.org/jtc1/sc22/wg14/www/contacts

explaining that you want to submit a defect report and ask
what the right procedure for doing that is.
Philipp Klaus Krause
2015-12-04 22:30:32 UTC
Permalink
Raw Message
On reading this posting I went back and looked again (and somewhat
more carefully this time) through the descriptions of these
functions. My view has softened a bit. I still think the c16rtomb()
function is meant to work only on single complete characters, and not
encoded sequences of characters. However, I agree that there does
seem to be room for argument, so I can't put my confidence at 100%.
So if someone wants the question resolved definitively, as of right
now I think that means writing a Defect Report to get an official
answer.
I intend to submit a defect report, here is the current draft summary
(suggestions for improvements are welcome):

Section 7.28.1 describes the function c16rtomb(). In particular, it
states "When c16 is not a valid wide character, an encoding error occurs".
"wide character" is defined in section 3.7.3 as "value representable by
an object of type wchar_t, capable of representing any character in the
current locale".
This wording seems to imply that, e.g. for the common case of UTF-8 char
and UTF-16 char16_t, c16rtomb() will return -1 when it enounters a
character that is encoded as multiple char16_t. In particular,
c16rtomb() will not be able to process strings genrated by mbrtoc16().
On the other hand, the desription of mbrtoc16() described in section
7.28.1 states "If the function determines that the next multibyte
character is complete
and valid, it determines the values of the corresponding wide
characters". So it considers it possible that a single mutibyte
character translates into multiple wide characters. So maybe the meaning
of "wide character" in section 7.28.1 is different from definition of
"wide character" in section 3.7.3.
In either case, the intended behaviour of c16rtomb() for characters
encoded as mutiple char16_t is unclear.

Philipp
Philipp Klaus Krause
2015-12-04 22:49:05 UTC
Permalink
Raw Message
And here is a draft of the suggested change:

There are two possible options

* State clearly that passing a char16_t that is not a valid character
on its own to c16rtomb() is an error. In this case, another function to
convert char16_t strings to char strings should be provided by the
standard. The term "wide character" should then not be used in the
description of mbrtoc16() the wa it currently is
* State clearly that c16rtomb() handles characters consisting of
multiple char16_t. For such characters the first call woud return 0, and
only once all char16_t encoding the character hab been seen, c16rtomb()
could write the character as multibyte character.

Philipp
Jakob Bohm
2015-12-07 13:01:29 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
There are two possible options
* State clearly that passing a char16_t that is not a valid character
on its own to c16rtomb() is an error. In this case, another function to
convert char16_t strings to char strings should be provided by the
standard. The term "wide character" should then not be used in the
description of mbrtoc16() the wa it currently is
*way*
Post by Philipp Klaus Krause
* State clearly that c16rtomb() handles characters consisting of
multiple char16_t. For such characters the first call woud return 0, and
only once all char16_t encoding the character hab been seen, c16rtomb()
could write the character as multibyte character.
This description would exclude reasonable implementations that output
the first constituent char values as soon as they are fully known while
retaining enough state to continue the conversion of the remainder
output char values. Such implementations would obviously need the
option of later rejecting an invalid multi-char16_t input character
after having output initial char values based on an assumption of valid
input.

In general, though, the entire conversion function family should have
identical behavior conventions, such that any future need to handle
e.g. characters encoded as a sequence of multiple whar32_t values (the
obvious case would be to accept "Normalization form A" UTF-16/UCS-4 as
input and output to a char encoding that has precomposed characters or
encodes modifier characters in a different order than UNICODE).


Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
Tim Rentsch
2015-12-09 18:25:57 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
On reading this posting I went back and looked again (and somewhat
more carefully this time) through the descriptions of these
functions. My view has softened a bit. I still think the c16rtomb()
function is meant to work only on single complete characters, and not
encoded sequences of characters. However, I agree that there does
seem to be room for argument, so I can't put my confidence at 100%.
So if someone wants the question resolved definitively, as of right
now I think that means writing a Defect Report to get an official
answer.
I intend to submit a defect report, here is the current draft summary
Section 7.28.1 describes the function c16rtomb(). In particular, it
states "When c16 is not a valid wide character, an encoding error occurs".
"wide character" is defined in section 3.7.3 as "value representable by
an object of type wchar_t, capable of representing any character in the
current locale".
This wording seems to imply that, e.g. for the common case of UTF-8 char
and UTF-16 char16_t, c16rtomb() will return -1 when it enounters a
character that is encoded as multiple char16_t. In particular,
c16rtomb() will not be able to process strings genrated by mbrtoc16().
On the other hand, the desription of mbrtoc16() described in section
7.28.1 states "If the function determines that the next multibyte
character is complete
and valid, it determines the values of the corresponding wide
characters". So it considers it possible that a single mutibyte
character translates into multiple wide characters. So maybe the meaning
of "wide character" in section 7.28.1 is different from definition of
"wide character" in section 3.7.3.
In either case, the intended behaviour of c16rtomb() for characters
encoded as mutiple char16_t is unclear.
I have two suggestions, both of which are offered to adopt or not
entirely as you see fit.

First, if I were writing this DR myself, I would probably start of
something like this:

I'm trying to understand the behavior of restartable conversion
functions in <uchar.h>, described in section 7.28.1. My question
is about c16rtomb() but let me start with mbrtoc16(). Suppose
we have in implementation that defines __STDC_UTF_16__, and a
program operating with a current locale that uses UTF8. For
some UTF8 sequences, mbrtoc16() will produce two successive 16-bit
characters (so-called "surrogate pairs") for a single multi-byte
input character. What happens when c16rtomb() is called to process
these characters? More specifically, ...etc...

Second, I would try to write an implementation of c16rtomb() that
works in the specific case mentioned in the last paragraph, ie,
processes a surrogate pair sequence "correctly", and also complies
(if possible) with the description in the Standard. If I manage to
come up with an implementation that is at least plausibly
conforming, I would present it (possibly in summary) and ask if it
is conforming. If I can't find such an implmentation, I would say
something like "I would like to implement c16rtomb() so that it
works on two character sequences like those mentioned above, but I
don't see how to do that so that it is conforming. Is this what is
intended for this function? Does it only work on single char16_t
that are complete characters by themselves?"

So, for what it's worth, there are my ideas and suggestions.
Philipp Klaus Krause
2015-12-10 13:51:25 UTC
Permalink
Raw Message
Thanks, your comments helped me in improving the defect report
yesterday, which I then submitted (N1991).

Philipp
Tim Rentsch
2016-01-27 19:36:18 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
Thanks, your comments helped me in improving the defect report
yesterday, which I then submitted (N1991).
Oh, that's great. I hope you get a good response.

Tim Rentsch
2015-11-16 15:15:23 UTC
Permalink
Raw Message
Post by Tijl Coosemans
Post by Philipp Klaus Krause
For the usual scenario of UTF-16 char16_t strings and UTF-32 char32_t
strings, the C11 standard has functions to convert
char to char16_t : mbrtoc16()
char to char32_t : mbrtoc32()
char32_t to char : c32rtomb()
However what is missing is a function to convert from char16_t to char.
The function c16rtomb() cannot convert characters consisting of more
than one char16_t to char, so for Unicode it can only handle characters
in the basic multilingual plane.
I suggest to add a function to convert char16_t to char.
And maybe also function to convert between char16_t and char32_t.
If two char16_t are needed to complete one multibyte character then
you need to call c16rtomb() twice. The first call only updates the
mbstate and does not produce any output.
FWIW I had a similar reaction when I first looked into this. On
further reading however I came around to the point of view that
others in this thread espouse. In particular James Kuyper's
analysis appears to be sound and sufficient to be convincing.
Loading...