Post by Philipp Klaus Krause
We have macros for the encodings typically used by char16_t, char32_t,
But not for char. How about introducing __STDC_UTF_8__ to indicate that
values of char are UTF-8 encoded?
These days encodings other than UTF-8 are becoming uncommon for char, so
I'd assume that many implementation will support UTF-8 locales only.
Having thought about this for some time, I would suggest something like
This would be:
= Undefined if an implementation has not yet provided this feature,
in which case a conforming program may define it (correctly) based
on implementation documentation etc.
= 0 if an implementation makes no promise if current or future
locales will use UTF-8 or not for the char type.
= 1 if an implementation promises that all current a future locales
will use only UTF-8 for the char type
= -1 if an implementation promises that no current or future locale
will use UTF-8 for the char type.
For example, a hosted implementation based on the current GNU C library
with its large collection of locales, some UTF-8, some not, would
define this as 0.
C implementations exclusively for older versions of Microsoft Windows
would define this as -1, as those older host environment guarantee that
no character is encoded as more than 2 chars (1 if _SBCS_ is defined in
compiler options), which rules out UTF-8
Hosted implementations for some recent systems that use UTF-8 for
all locales (I think Android, and Google NaCl are among them) would
define this as 1.
To facilitate separation of C library and compiler, these shall be
defined by an implementation after including <ctype.h>, but may be
defined even if not. Conforming programs may not test for their
undefinedness, nor define them locally before including said standard
Other similar defines could be:
..._CHAR_ASCII__ (superset of ASCII, UTF-8 qualifies).
..._CHAR_ISO???__ (superset of the ASCII subset shared with derived
national 7 bit character sets from the 1970s and before, ASCII and
..._CHAR_ISO8859__ (any of the ISO8859-x series of character encodings)
..._CHAR_ISO8859_1__ (the specific encoding where each byte is the same
numbered UNICODE codepoint, thus translation to UNICODE is simple zero
..._CHAR_ISO???_PURE__ (_ISO??? and none of those char values occur
inside longer logical chars, example UTF-8)
..._CHAR_ASCII_PURE__ (ASCII and no char in 0..127 occur inside longer
logical chars, example UTF-8).
..._CHAR_1CHAR__ (each char is its own logical character, like in
ISO8859-x, but also many others, implies all higher).
..._CHAR_2CHAR__ (each logical char is at most 2 char long).
..._CHAR_3CHAR__ (...., example UTF-8 of BMP only).
..._CHAR_4CHAR__ (...., example UTF-8 of current UNICODE).
..._CHAR_5CHAR__ (...., placeholder for consistency)
..._CHAR_6CHAR__ (...., example UTF-8 of full 31 bit UNICODE).
..._SOURCE_CHAR__ (the char type represents the actual source character
set of the implementation, even for character values whose meaning may
be locale dependent, in other words, the compiler passes through the
encoding of characters found in string and character constants).
..._WCHAR_CHAR16__ (wchar_t is same size and same encoding as char16_t,
for all locales).
..._WCHAR_CHAR32__ (wchar_t is same size and same encoding as char32_t,
for all locales).
..._CHAR16_UCS2__ (char16_t is straight encoding of the first 64K
UNICODE characters, no UTF-16 surrogates allowed).
There could also be dynamic functions to get this value for the current
locale (where the define is 0, otherwise those would return the compile
Many of these have one way implication relationships with each other:
_CHAR_UTF8__ == 1 implies _CHAR_6CHAR__, _CHAR_ASCII_PURE__,
CHAR_ISO???_PURE__, _CHAR_ASCII__ and _CHAR_ISO???__ . With the -1
values having opposite implication chains.
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded