...
Post by Vincent LefevrePost by James KuyperPost by Vincent LefevrePost by James KuyperI wasn't talking about overflow for it's own sake, but only in the
context of what the standard says about the value of floating point
constants. What value does that constant have? Is it one of the three
values permitted by 6.4.4.2p4? Is it, in particular, the value required
by IEEE 754? If the answers to both questions are yes, it's consistent
with everything I said.
The second answer is not "yes", in case nextdown(DBL_MAX) would be
returned.
I'm asking what value you observed - was it nextdown(DBL_MAX), DBL_MAX,
+infinity, or something else? The first three are permitted by the C
standard, the second one is mandated by IEEE 754, so I would expect an
implementation that claimed conformance to both standards to choose
DBL_MAX, and NOT nextdown(DBL_MAX). So - which value did you see?
This issue is not what one can observe on a subset of implementations,
but what is possible.
Why does it matter to you that such implementations are possible? No
such implementation can qualify as conforming to IEEE 754 - so what? The
C standard very deliberately does NOT require conformance to IEEE 754,
and what it requires in areas that are also covered by IEEE 754 is
deliberately more lenient than what IEEE 754 requires, precisely so C
can be implemented on platforms where floating point hardware that can't
meet IEEE 754's accuracy is installed. That's why the __STDC_IEC_*
macros exist - to allow a program to determine whether a implementation
claims to conform to some or all of the requirements of IEC 60559
(==IEEE 754). That's why those macros are described in the section
titled "Conditional feature macros."
Two standards do not (as you claim in the Subject: header of this
thread) contradict each other just because they say different things
about the same situation. If one standard provides a set containing one
or more options, and the other standard provides a different set of one
or more options, the two standards contradict each other only if there's
no overlap between the two sets of options. So long as there is at least
one option that meets the requirements of both standards, they don't
contradict each other.
People do not create full implementations of C just for the fun of it
(well, most people don't). In particular, they don't create an
implementation that conforms to the C standard but not to IEC 60559 by
accident or laziness. In general, you can safely assume that any such
implementation did so because there was some inconvenience associated
with conforming to IEC 60559 that they wished to avoid. If the C
standard were changed to mandate conformance with IEC 60559, some of
those implementations might change to conform with that standard, but
many (possibly most) such implementations would respond by deciding to
not bother conforming to that version of the C standard, because
conforming would be too inconvenient.
Post by Vincent Lefevre... The value nextdown(DBL_MAX) does not make much
sense when the implementation *knows* that the value is larger than
DBL_MAX because it exceeds the range (there is a diagnostic to tell
that to the user because of 6.4.4p2).
You misunderstand the purpose of the specification in 6.4.4.2p4. It was
not intended that a floating point implementation would generate the
nearest representable value, and that the implementation of C would then
arbitrarily chose to pick one of the other two adjacent representable
values. The reason was to accommodate floating point implementations
that couldn't meet the accuracy requirements of IEC 60559. The
implementation asks the floating point hardware to calculate what the
value is, the hardware does it's best to accurately calculate the value,
but it's best isn't good enough to qualify as conforming to IEC 60559.
It might take some shortcuts or simplifications that make it faster or
simpler than an IEC 60559, at the cost of being less accurate. It
returns a value that, incorrectly, is not greater than DBL_MAX, and the
wording in 6.4.4.2p4 gives the implementation permission to use that
incorrect number, so long as it isn't smaller than nextdown(DBL_MAX).
...
Post by Vincent LefevreActually it is when the mathematical result exceeds the range. 6.5p5
says: "If an /exceptional condition/ occurs during the evaluation of
an expression (that is, if the result is not mathematically defined or
not in the range of representable values for its type), the behavior
is undefined." So this appears to be an issue when infinity is not
supported.
Conversion of a floating point constant into a floating point value is
not "evaluation of an expression", and therefore is not covered by
6.5p5. Such conversions are required to occur "as-if at translation
time", and exceptional conditions are explicitly prohibited.
Post by Vincent LefevreI suppose that when the standard defines something, it assumes the
case where such an exceptional condition does not occur, unless
explicitly said otherwise (that's the whole point of 6.5p5). And in
the definitions concerning floating-point expressions, the standard
never distinguishes between an exceptional condition or not. For
instance, for addition, the standatd just says "The result of the
binary + operator is the sum of the operands." (on the real numbers,
this operation is always mathematically well-defined, so the only
issue is results that exceed the range, introduced by 6.5p5).
The standard is FAR more lenient with regard to floating point
operations than it is for floating point constants:
"The accuracy of the floating-point operations ( + , - , * , / ) and of
the library functions in <math.h> and <complex.h> that return
floating-point results is implementation-defined, as is the accuracy of
the conversion between floating-point internal representations and
string representations performed by the library functions in <stdio.h> ,
<stdlib.h> , and <wchar.h> . The implementation may state that the
accuracy is unknown." (5.2.4.2.2p8).
That wording allows an implementation to implement floating point
arithmetic so inaccurately that it can conclude that the expression
LDBL_MAX - LDBL_MIN < LDBL_MIN - LDBL_MAX is true. Note: the comparison
operators (== != < > <= >=) are not covered by 5.2.4.2.2p8, but the
subtraction operator is.
I don't approve of this situation; I can't imagine any good reason for
implementing floating point operations as inaccurately as the standard
allows them to be implemented. The standard should provide some more
meaningful requirements, They don't have to be very strong - they could
be weak enough that every known serious floating point implementation
could meet them, and still be immensely stronger than the current
requirements. Any platform where floating point isn't actually needed
should simply be allowed to opt out of supporting floating point
entirely, rather than being required to support it but allowed to
implement it that badly. That would be safer for all concerned.
However, those incredibly loose requirements are what the standard
actually says.
...
Post by Vincent LefevrePost by James KuyperPost by Vincent LefevrePost by James KuyperFor an implementation that supports infinities (in other words, an
implementation where infinities are representable), how do infinities
fail to qualify as being within the range of representable values? Where
is that exclusion specified?
5.2.4.2.2p5. Note that it seems that it is intended to exclude
some representable values from the range. Otherwise such a long
specification of the range would not be needed.
That clause correctly states that infinities do NOT qualify as
floating point numbers.
Note that there are inconsistencies in the standard about what
it means by "floating-point numbers". It is sometimes used to
mean the value of a floating type. For instance, the standard
says for fabs: "The fabs functions compute the absolute value
of a floating-point number x." But I really don't think that
this function is undefined on infinities.
If __STDC_IEC_60559_BFP__ is pre#defined by the implementation, F10.4.3
not only allows fabs (±∞), it explicitly mandates that it return +∞.
Note: if you see odd symbols on the previous line, they were supposed to
be infinities).
Post by Vincent LefevrePost by James KuyperHowever, it also correctly refers to them as values. The relevant
clauses refer to the range of representable values, not the range of
representable floating point numbers. On such an implementation,
infinities are representable and they are values.
My point is that it says *real* numbers. And infinities are not
real numbers.
In n2731.pdf, 5.2.4.2.2p5 says "An implementation may give zero and
values that are not floating-point numbers (such as infinities
and NaNs) a sign or may leave them unsigned. Wherever such values are
unsigned, any requirement in this document to retrieve the sign shall
produce an unspecified sign, and any requirement to set the sign shall
be ignored."
Nowhere in that clause does it use the term "real".
Are you perhaps referring to 5.2.4.2.2p7?
...
Post by Vincent LefevrePost by James KuyperPost by Vincent LefevrePost by James KuyperPost by Vincent LefevrePost by James KuyperHowever, I'm confused about how this connects to the standard's
definition of normalized floating-point numbers: "f_1 > 0"
(5.2.4.2.2p4). It seems to me that, even for the pair-of-doubles format,
LDBL_MAX is represented by a value with f_1 = 1, and therefore is a
normalized floating point number that is larger than LDBL_NORM_MAX,
which strikes me as a contradiction.
Note that there is a requirement on the exponent: e ≤ e_max.
Yes, and DBL_MAX has e==e_max.
No, not necessarily. DBL_NORM_MAX has e == e_max. But DBL_MAX may
maximum representable finite floating-point number; if that number
is normalized, its value is (1 − b^(−p)) b^(e_max).
So, what is the value of e for LDBL_MAX in the pair-of-doubles format?
It should be DBL_MAX_EXP. What happens with double-double is that
for the maximum exponent of double, not all precision-p numbers
are representable (here, p = 106 = 2 * 53 historically, though
107 could actually be used thanks to the constraint below and the
limitation on the exponent discussed here).
The reason is that there is a constraint on the format in order
to make the double-double algorithms fast enough: if (x1,x2) is
a valid double-double number, then x1 must be equal to x1 + x2
.111...1110111...111 * 2^(DBL_MAX_EXP)
where both sequences 111...111 have 53 bits. Values above this
number would increase the exponent of x1 to DBL_MAX_EXP + 1,
which is above the maximum exponent for double; thus such values
are not representable.
The consequence is that e_max < DBL_MAX_EXP.
Post by James KuyperWhat is the value of e_max?
DBL_MAX_EXP - 1
Post by James KuyperIf LDBL_MAX does not have e==e_max,
(LDBL_MAX has exponent e = e_max + 1.)
That doesn't work. 5.2.4.2.2p2 and p3 both specify that floating point
numbers must have e_min <= e && e <= e_max. LDBL_MAX is defined as the
"maximum finite floating point number". A value for which e > e_max
can't qualify as a floating point number, and therefore in particular
can't qualify as the maximum finite floating point number. An
implementation that uses the sum-of-pair-of-doubles floating point
format has two options: increase e_max high enough to include the value
you specify for LDBL_MAX, or decrease LDBL_MAX to a value low enough to
have e<=e_max.
Key point: most items in 5.2.4.2.2 have two parts: a description, and an
expression involving the parameters of the floating point format. For
formats that are a good fit to the C standard's floating point model,
those formulas give the exactly correct result. For other formats, the
description is what specifies what the result must be, the formula
should be treated only as an example that might not apply.
Those formulas were written on an implicit assumption that becomes
obvious only when you try to apply them to a format that violates the
assumption: every base_b digit from f_1 to f_p can freely be set to any
value from 0 to b-1. In particular, the formula for LDBL_MAX was based
upon the assumption that all of those values were set to b-1, and e was
set to e_max.
A pair-of-doubles format could fit that assumption if a restriction were
imposed that says that a pair (x1, x2) is allowed only if x2 == 0 || (
1ulp on x1 > x2 && x2 >= 0.5 ulp on x1). (that condition needs to be
modified to give the right requirements for negative numbers). Such an
implementation could, with perfect accuracy, be described using
LDBL_MANT_DIG == 2*DBL_MANT_DIG and LDBL_MAX_EXP == DBL_MAX_EXP.
However, the pair-of-doubles format you've described doesn't impose such
requirements. The value of p must be high enough that, for any pair (x1,
x2) where x1 is finite and x2 is non-zero which is meant to qualify as
representing a floating point number, p covers both the most significant
digit of x1, and the least significant digit of any non-zero x2, no
matter how large the ratio x1/x2 is. Whenever that ratio is high enough,
f_k for most values of k can only be 0. As a result, one of the
assumptions behind the formulas in 5.2.4.2.2 isn't met. so those
formulas aren't always valid for such a format - but the descriptions
still apply.