Discussion:
An environment macro for UTF-8?
(too old to reply)
Philipp Klaus Krause
2016-11-29 09:18:20 UTC
Permalink
Raw Message
We have macros for the encodings typically used by char16_t, char32_t,
wchar_t.
But not for char. How about introducing __STDC_UTF_8__ to indicate that
values of char are UTF-8 encoded?

These days encodings other than UTF-8 are becoming uncommon for char, so
I'd assume that many implementation will support UTF-8 locales only.

Philipp
Keith Thompson
2016-11-29 17:45:50 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
We have macros for the encodings typically used by char16_t, char32_t,
wchar_t.
But not for char. How about introducing __STDC_UTF_8__ to indicate that
values of char are UTF-8 encoded?
These days encodings other than UTF-8 are becoming uncommon for char, so
I'd assume that many implementation will support UTF-8 locales only.
Values of type char can encode anything you like in the range
CHAR_MIN..CHAR_MAX. Any interpretation as UTF-8 is specific to any
functions that operate on arrays of char. Most of the functions
declared in <string.h> (e.g. strlen()) specifically do *not* pay
attention to UTF-8 encoding.

Other functions, such as strcoll(), pay attention to the current
locale, which can't be specified via a macro since it can change
at run time.
--
Keith Thompson (The_Other_Keith) kst-***@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
s***@casperkitty.com
2016-11-29 18:32:37 UTC
Permalink
Raw Message
Post by Keith Thompson
Values of type char can encode anything you like in the range
CHAR_MIN..CHAR_MAX. Any interpretation as UTF-8 is specific to any
functions that operate on arrays of char. Most of the functions
declared in <string.h> (e.g. strlen()) specifically do *not* pay
attention to UTF-8 encoding.
If someone is writing software which needs to output the byte sequence
0x48, 0x65, 0x6C, 0x6C, 0x6F, does it make more sense to write out the
string as "\x48\x65\x6C\x6C\x6F" (which would be portable to all platforms)
or to write it as "Hello" and require that the software only be used on
platforms whose native character set is ASCII or a superset thereof? I
would suggest the latter would make more sense. Is there any reason the
same principle shouldn't be equally applicable for programs that may
contain characters beyond the ASCII range?

Perhaps it might be useful to allow a compile-time equality comparison
between strings literals, such that code could then say something like:

#if "Hello" != "\x48\x65\x6C\x6C\x6F"
#error ASCII required
#endif

but allowing the programmer to include any characters of interest within the
comparison. If the comparison includes non-ASCII characters, the programmer
could then verify not only how the compiler was processing the source, but
also the behavior of any translations that occurred to the source before it
was compiled. For example, if a file was loaded into an editor which auto-
detected that a file used a certain 8-bit character set but then wrote out
UTF-8 when it saved the file, it might not look like anything had changed
but a comparison between a string literal using the 8-bit character set
and the \x-escaped representation of it would catch the change.
Richard Bos
2016-11-29 19:14:54 UTC
Permalink
Raw Message
Post by s***@casperkitty.com
Post by Keith Thompson
Values of type char can encode anything you like in the range
CHAR_MIN..CHAR_MAX. Any interpretation as UTF-8 is specific to any
functions that operate on arrays of char. Most of the functions
declared in <string.h> (e.g. strlen()) specifically do *not* pay
attention to UTF-8 encoding.
If someone is writing software which needs to output the byte sequence
0x48, 0x65, 0x6C, 0x6C, 0x6F, does it make more sense to write out the
string as "\x48\x65\x6C\x6C\x6F" (which would be portable to all platforms)
or to write it as "Hello" and require that the software only be used on
platforms whose native character set is ASCII or a superset thereof?
The former, obviously.
Post by s***@casperkitty.com
I would suggest the latter would make more sense.
I cannot conceive of a reason why.

If you want the _values_ 0x48, 0x65, 0x6c, 0x6c and 0x6f, you output
those values, exactly as you want them. This can be done, portably, on
any platform regardless of character encoding, by outputting either a
series of individual bytes, or the string "\x48\x65\x6c\x6c\x6f". The
same code will work on any computer.
There could even be a good reason for wanting to do so. For instance,
you could be writing an HTML editor for a Burroughs MCP. In that case,
limiting your code to run on ASCII systems would be counterproductive;
what you need is code that, on an EBCDIC system, outputs ASCII codes. In
other words, you want the values, not the text as it would be expressed
natively.

If you want the _text_ "Hello", you output that string, exactly as it is
written here. This, too, can be done using the same code on any
platform, with complete disregard of the character set.

Only if you want to output the native string "Hello" _and_ demand that
this string equates to the values 0x48, 0x65, 0x6c, 0x6c and 0x6f does
it make sense to artificially limit your code to ASCII systems. But
frankly, I can't think of a reasonable argument for wanting that.

Richard
s***@casperkitty.com
2016-11-29 20:28:39 UTC
Permalink
Raw Message
Post by Richard Bos
Post by s***@casperkitty.com
If someone is writing software which needs to output the byte sequence
0x48, 0x65, 0x6C, 0x6C, 0x6F, does it make more sense to write out the
string as "\x48\x65\x6C\x6C\x6F" (which would be portable to all platforms)
or to write it as "Hello" and require that the software only be used on
platforms whose native character set is ASCII or a superset thereof?
The former, obviously.
Post by s***@casperkitty.com
I would suggest the latter would make more sense.
I cannot conceive of a reason why.
Perhaps because the bytes may be interpreted as characters "Hello"
regardless of how the machine doing the translation sees them. They may
be going to an external device, or they may be rendered as shapes using
data tables stored within the code (and which are set up for ASCII), etc.

The string "Hello" is a lot easier for humans to write or read than the
string "\x48\x65\x6C\x6C\x6F", and a lot of code will never be used on
systems that use anything other than ASCII (or a superset thereof). If
the string *means* "Hello", but needs to be represented as a particular
sequence of bytes, I'd say that being able to ensure that an implementation
will process characters as expected would be rather more helpful than
having to explicitly spell things out as character codes.
Post by Richard Bos
There could even be a good reason for wanting to do so. For instance,
you could be writing an HTML editor for a Burroughs MCP. In that case,
limiting your code to run on ASCII systems would be counterproductive;
what you need is code that, on an EBCDIC system, outputs ASCII codes. In
other words, you want the values, not the text as it would be expressed
natively.
If there is a need to have code which handles ASCII received from external
sources but can run on an EBCDIC implementation, then using hex escapes
in string literals would make sense. What fraction of code that handles
ASCII text received from elsewhere would ever plausibly have to run on an
EBCDIC system?
Post by Richard Bos
If you want the _text_ "Hello", you output that string, exactly as it is
written here. This, too, can be done using the same code on any
platform, with complete disregard of the character set.
In many cases, code exists for the purpose of exchanging information with
other systems. If another system needs to be sent what *it* will regard as
the text "Hello", being able to write that text as "Hello" in source would
seem rather convenient, would it not?
Post by Richard Bos
Only if you want to output the native string "Hello" _and_ demand that
this string equates to the values 0x48, 0x65, 0x6c, 0x6c and 0x6f does
it make sense to artificially limit your code to ASCII systems. But
frankly, I can't think of a reasonable argument for wanting that.
What fraction of C99 implementations don't use ASCII? What fraction of C
code would serve any purpose on such systems?
Philipp Klaus Krause
2016-11-29 18:34:41 UTC
Permalink
Raw Message
Post by Keith Thompson
Post by Philipp Klaus Krause
We have macros for the encodings typically used by char16_t, char32_t,
wchar_t.
But not for char. How about introducing __STDC_UTF_8__ to indicate that
values of char are UTF-8 encoded?
These days encodings other than UTF-8 are becoming uncommon for char, so
I'd assume that many implementation will support UTF-8 locales only.
Values of type char can encode anything you like in the range
CHAR_MIN..CHAR_MAX. Any interpretation as UTF-8 is specific to any
functions that operate on arrays of char. Most of the functions
declared in <string.h> (e.g. strlen()) specifically do *not* pay
attention to UTF-8 encoding.
Other functions, such as strcoll(), pay attention to the current
locale, which can't be specified via a macro since it can change
at run time.
While that is true, it holds equally for char, char16_t, char32_t and
wchar_t (except for char16_t and char32_t lacking string functions that
are there for char and wchar_t). But char is the only among them that
doesn't have such a macro.

Philipp
Tim Rentsch
2016-12-10 07:13:13 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
Post by Keith Thompson
Post by Philipp Klaus Krause
We have macros for the encodings typically used by char16_t, char32_t,
wchar_t.
But not for char. How about introducing __STDC_UTF_8__ to indicate that
values of char are UTF-8 encoded?
These days encodings other than UTF-8 are becoming uncommon for char, so
I'd assume that many implementation will support UTF-8 locales only.
Values of type char can encode anything you like in the range
CHAR_MIN..CHAR_MAX. Any interpretation as UTF-8 is specific to any
functions that operate on arrays of char. Most of the functions
declared in <string.h> (e.g. strlen()) specifically do *not* pay
attention to UTF-8 encoding.
Other functions, such as strcoll(), pay attention to the current
locale, which can't be specified via a macro since it can change
at run time.
While that is true, it holds equally for char, char16_t, char32_t and
wchar_t (except for char16_t and char32_t lacking string functions that
are there for char and wchar_t). But char is the only among them that
doesn't have such a macro.
Isn't char the only one whose encoding is locale specific? ISTM
that having a macro doesn't make sense if the encoding used isn't
necessarily fixed, which is always true even in implementations
whose locales all use the same encoding.
Jakob Bohm
2016-12-13 16:19:18 UTC
Permalink
Raw Message
Post by Philipp Klaus Krause
We have macros for the encodings typically used by char16_t, char32_t,
wchar_t.
But not for char. How about introducing __STDC_UTF_8__ to indicate that
values of char are UTF-8 encoded?
These days encodings other than UTF-8 are becoming uncommon for char, so
I'd assume that many implementation will support UTF-8 locales only.
Philipp
Having thought about this for some time, I would suggest something like
the following

__STDC_EXAMPLE_IMPL_CHAR_UTF8__

This would be:

= Undefined if an implementation has not yet provided this feature,
in which case a conforming program may define it (correctly) based
on implementation documentation etc.
= 0 if an implementation makes no promise if current or future
locales will use UTF-8 or not for the char type.
= 1 if an implementation promises that all current a future locales
will use only UTF-8 for the char type
= -1 if an implementation promises that no current or future locale
will use UTF-8 for the char type.

For example, a hosted implementation based on the current GNU C library
with its large collection of locales, some UTF-8, some not, would
define this as 0.

C implementations exclusively for older versions of Microsoft Windows
would define this as -1, as those older host environment guarantee that
no character is encoded as more than 2 chars (1 if _SBCS_ is defined in
compiler options), which rules out UTF-8

Hosted implementations for some recent systems that use UTF-8 for
all locales (I think Android, and Google NaCl are among them) would
define this as 1.

To facilitate separation of C library and compiler, these shall be
defined by an implementation after including <ctype.h>, but may be
defined even if not. Conforming programs may not test for their
undefinedness, nor define them locally before including said standard
header.

Other similar defines could be:

..._CHAR_ASCII__ (superset of ASCII, UTF-8 qualifies).
..._CHAR_ISO???__ (superset of the ASCII subset shared with derived
national 7 bit character sets from the 1970s and before, ASCII and
UTF-8 qualifies).
..._CHAR_ISO8859__ (any of the ISO8859-x series of character encodings)
..._CHAR_ISO8859_1__ (the specific encoding where each byte is the same
numbered UNICODE codepoint, thus translation to UNICODE is simple zero
extension).
..._CHAR_ISO???_PURE__ (_ISO??? and none of those char values occur
inside longer logical chars, example UTF-8)
..._CHAR_ASCII_PURE__ (ASCII and no char in 0..127 occur inside longer
logical chars, example UTF-8).
..._CHAR_1CHAR__ (each char is its own logical character, like in
ISO8859-x, but also many others, implies all higher).
..._CHAR_2CHAR__ (each logical char is at most 2 char long).
..._CHAR_3CHAR__ (...., example UTF-8 of BMP only).
..._CHAR_4CHAR__ (...., example UTF-8 of current UNICODE).
..._CHAR_5CHAR__ (...., placeholder for consistency)
..._CHAR_6CHAR__ (...., example UTF-8 of full 31 bit UNICODE).
..._SOURCE_CHAR__ (the char type represents the actual source character
set of the implementation, even for character values whose meaning may
be locale dependent, in other words, the compiler passes through the
encoding of characters found in string and character constants).

..._WCHAR_CHAR16__ (wchar_t is same size and same encoding as char16_t,
for all locales).
..._WCHAR_CHAR32__ (wchar_t is same size and same encoding as char32_t,
for all locales).
..._CHAR16_UCS2__ (char16_t is straight encoding of the first 64K
UNICODE characters, no UTF-16 surrogates allowed).

There could also be dynamic functions to get this value for the current
locale (where the define is 0, otherwise those would return the compile
time constant).

Many of these have one way implication relationships with each other:
_CHAR_UTF8__ == 1 implies _CHAR_6CHAR__, _CHAR_ASCII_PURE__,
CHAR_ISO???_PURE__, _CHAR_ASCII__ and _CHAR_ISO???__ . With the -1
values having opposite implication chains.


Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
Loading...