Discussion:
What does it mean "to ignore case" in Korean, chinese or hebrew?
(too old to reply)
jacob navia
2015-03-18 23:25:59 UTC
Permalink
Raw Message
The strtold function should look for a sequence of letters "NAN" or
"INF" and return a corresponding float IGNORING CASE.

But what does that mean in languages that do not have "case"?

Are those 3 letter sequences hardwired to english (then the correct way
would be to translate with wcstombs into a buffer, then compare ignoring
case) but that supposes that programmer's keyboards in Korea can write
"NAN" and "INF" and that all supporing software accepts latin letters
and displays them correctly, etc...

Shouldn't be those letters part of a "locale" convention?
Kaz Kylheku
2015-03-19 00:39:18 UTC
Permalink
Raw Message
Post by jacob navia
The strtold function should look for a sequence of letters "NAN" or
"INF" and return a corresponding float IGNORING CASE.
But what does that mean in languages that do not have "case"?
"INF" and "NAN" have case even if written by a Korean person in Korea.
Post by jacob navia
Are those 3 letter sequences hardwired to english (then the correct way
would be to translate with wcstombs into a buffer, then compare ignoring
case) but that supposes that programmer's keyboards in Korea can write
"NAN" and "INF" and that all supporing software accepts latin letters
and displays them correctly, etc...
I haven't seen any syntactic description where it is allowed that NaN
and Inf can be translated to other languages.

Keywords like while, for, if, and switch also don't get translatd to Korean;
nor standard function names like main or printf. Why worry specifically about
inf and NaN.

(Also, what of the E in floating-point exponents.)
Post by jacob navia
Shouldn't be those letters part of a "locale" convention?
Inf and NaN are specialized jargon related to numeric analysis.

They are not comparable to cultural matters like what symbol to use to separate
digits (and in what groups), what the currency symbol is, or how dates are
written.

An end-user other than a scientist or engineer should ideally never see Inf or
NaN in a user interface.
m***@yahoo.co.uk
2015-03-19 14:17:37 UTC
Permalink
Raw Message
Post by jacob navia
The strtold function should look for a sequence of letters "NAN" or
"INF" and return a corresponding float IGNORING CASE.
But what does that mean in languages that do not have "case"?
Worse. What about languages that *do* have case, but upper and lower
are not the same as English.

I am thinking specifically of Turkish, where the lower case version
of "INF" would be <LATIN SMALL LETTER DOTLESS I><LATIN SMALL LETTER N><LATIN SMALL LETTER F>

(and the upper case version of 'i' would be <LATIN CAPITAL LETTER I
WITH DOT ABOVE>)
Keith Thompson
2015-03-19 15:34:41 UTC
Permalink
Raw Message
Post by m***@yahoo.co.uk
Post by jacob navia
The strtold function should look for a sequence of letters "NAN" or
"INF" and return a corresponding float IGNORING CASE.
But what does that mean in languages that do not have "case"?
Worse. What about languages that *do* have case, but upper and lower
are not the same as English.
I am thinking specifically of Turkish, where the lower case version
of "INF" would be <LATIN SMALL LETTER DOTLESS I><LATIN SMALL LETTER N><LATIN SMALL LETTER F>
(and the upper case version of 'i' would be <LATIN CAPITAL LETTER I
WITH DOT ABOVE>)
That's not relevant. The 'I' in "INF" is a <LATIN CAPITAL LETTER
I>, not a <LATIN SMALL LETTER DOTLESS I>. Its lowercase version
is <LATIN SMALL LETTER I> and nothing else.

There are a number of other cases where letters are used in numeric
literals: 'e' to introduce an exponent, 'p' to introduce a binary
exponent, 'x' (in 0x) introduce a hexadecimal constant, 'a'..'f'
hexadecimal digits, 'u', 'l', and 'll' suffixes on integer constants,
and 'f' and 'l' suffixes on floating-point constants.

In all these cases, the letters are specifically Latin letters that
can be used in either lower or upper case.

And don't forget that "NAN" can optionally be followed by an
optional n-char-sequence (zero or more uppercase or lowercase
letters, digits, or underscores), and "INF" can also be written as
"INFINITY". N1570 7.22.1.3p3.
--
Keith Thompson (The_Other_Keith) kst-***@mib.org <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"
James Kuyper
2015-03-19 16:34:45 UTC
Permalink
Raw Message
Post by jacob navia
The strtold function should look for a sequence of letters "NAN" or
"INF" and return a corresponding float IGNORING CASE.
But what does that mean in languages that do not have "case"?
C doesn't care about human languages - the closest it comes to doing so
is caring about the locale.
No matter what the locale is, the characters "AFINTYafinty" are part of
the basic character set, and must therefore be representable in both the
source character set and the execution character set. (5.2.1p1) Korean,
chinese, and hebrew characters can only be part of the extended
character set, not the basic character set.

The terms "uppercase letter" and "lowercase letter" are defined for the
Latin alphabet in 5.2.1p3, and they apply to each of the letters in
"AFINTY" and "afinty", respectively, regardless of locale. The behavior
of isupper() (7.4.1.11), islower() (7.4.1.7), toupper() (7.4.2.2) and
tolower() (7.4.2.1) are tied to those definitions.
Post by jacob navia
Are those 3 letter sequences hardwired to english (then the correct way
would be to translate with wcstombs into a buffer, then compare ignoring
case) but that supposes that programmer's keyboards in Korea can write
"NAN" and "INF" and that all supporing software accepts latin letters
and displays them correctly, etc...
Any platform where you can create C code must have those capabilities.
That doesn't rule out the possibility of cross-compiling for a platform
lacking those capabilities. However, I think it's not an issue worth
worrying about. A platform where none of those characters can be typed
couldn't even handle base==11, which presents an even more fundamental
problem for strtold().
Post by jacob navia
Shouldn't be those letters part of a "locale" convention?
If base == 36, all of the letters from 'a'-'z' and 'A' - 'Z' are
supposed to recognized as valid parts of the subject sequence. If
base==16, 'x' and 'X' must be recognized if part of the hexadecimal
prefix, and 'e', 'E', 'p' and 'P' must be recognized as distinguishing
the exponent..None of those letters are supposed to be interpreted in a
locale-dependent fashion, so I don't see a need to treat NAN or INF
differently.
Jakob Bohm
2015-03-20 13:52:25 UTC
Permalink
Raw Message
Post by James Kuyper
Post by jacob navia
The strtold function should look for a sequence of letters "NAN" or
"INF" and return a corresponding float IGNORING CASE.
But what does that mean in languages that do not have "case"?
C doesn't care about human languages - the closest it comes to doing so
is caring about the locale.
No matter what the locale is, the characters "AFINTYafinty" are part of
the basic character set, and must therefore be representable in both the
source character set and the execution character set. (5.2.1p1) Korean,
chinese, and hebrew characters can only be part of the extended
character set, not the basic character set.
The terms "uppercase letter" and "lowercase letter" are defined for the
Latin alphabet in 5.2.1p3, and they apply to each of the letters in
"AFINTY" and "afinty", respectively, regardless of locale. The behavior
of isupper() (7.4.1.11), islower() (7.4.1.7), toupper() (7.4.2.2) and
tolower() (7.4.2.1) are tied to those definitions.
Post by jacob navia
Are those 3 letter sequences hardwired to english (then the correct way
would be to translate with wcstombs into a buffer, then compare ignoring
case) but that supposes that programmer's keyboards in Korea can write
"NAN" and "INF" and that all supporing software accepts latin letters
and displays them correctly, etc...
Any platform where you can create C code must have those capabilities.
That doesn't rule out the possibility of cross-compiling for a platform
lacking those capabilities. However, I think it's not an issue worth
worrying about. A platform where none of those characters can be typed
couldn't even handle base==11, which presents an even more fundamental
problem for strtold().
Post by jacob navia
Shouldn't be those letters part of a "locale" convention?
If base == 36, all of the letters from 'a'-'z' and 'A' - 'Z' are
supposed to recognized as valid parts of the subject sequence. If
base==16, 'x' and 'X' must be recognized if part of the hexadecimal
prefix, and 'e', 'E', 'p' and 'P' must be recognized as distinguishing
the exponent..None of those letters are supposed to be interpreted in a
locale-dependent fashion, so I don't see a need to treat NAN or INF
differently.
If applicable, one might (in a future standard edition) extend this by
referring to the case-mapping work done by the UNICODE committee and
standard, with the additional caveat that where available, such a future
C standard should explicitly require those functions to recognize
"lowercase dotless I" and "Uppercase I with dot above" as
case-independent equivalents of "latin letter I" even outside the
Turkish locales. Similarly such a future standard might require all
implementation parts to recognize the special "single-width" and
"double-width" code points for each character of the basic character
set, if those code points exist in the extended character set.

Such a future standard might also require implementations to recognize
any special extended character set code points that directly represent
INF or NAN, such as the traditional infinity mathematical symbol which
exists in many extended character sets.

This would all involve a balance between keeping those functions small
and simple and catering to extended character set support. Many C
programs implicitly rely on those functions to *only* accept the
ASCII code points when called with untrusted outside data, and might
suffer unpleasant failures if invalid machine-machine messages are
suddenly recognized as something else.

Thus, in general, locale sensitive or otherwise extended-char-accepting
runtime function variants should preferably be given new names such as
loc_strtold() or uni_strtold() (with the former being locale dependent,
and the latter using an international consistent UNICODE
interpretation chosen by the C standard).

Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
Loading...