Discussion:
Does reading an uninitialized object have undefined behavior?
(too old to reply)
Keith Thompson
2023-07-21 05:16:01 UTC
Permalink
N3096 is the last public draft of the upcoming C23 standard.

N3096 J.2 says:

The behavior is undefined in the following circumstances:
[...]
(11) The value of an object with automatic storage duration is
used while the object has an indeterminate representation
(6.2.4, 6.7.10, 6.8).

I'll use an `int` object in my example.

Reading an object that holds a non-value representation has undefined
behavior, but not all integer types have non-value representations
-- and if an implementation has certain characteristics, we can
reliably infer that int has no non-value representations (called
"trap representations" in C99, C11, and C17).

Consider this program:
```
#include <limits.h>
int main(void) {
int foo;
if (sizeof (int) == 4 &&
CHAR_BIT == 8 &&
INT_MAX == 2147483647 &&
INT_MIN == -INT_MAX-1)
{
int bar = foo;
}
}
```

If the condition is true (as it is for many real-world
implementations), then int has no padding bits and no trap
representations. The object `foo` has an indeterminate representation
when it's used to initialize `bar`. Since it cannot have a non-value
representation, it has an unspecified value.

If J.2(11) is correct, then the use of the value results in undefined
behavior.

But Annex J is non-normative, and as far as I can tell there is no
normative text in the standard that says the behavior is undefined.

6.2.4 discusses storage duration.

6.7.10 discusses initialization; p11 implies that the representation of
`foo` is indeterminate. It does not say

6.8 discusses statements and blocks, and repeats that "the
representation of objects without an initializer becomes
indeterminate".

None of these discuss what happens when the value of an object with
an indeterminate representation is used -- nor does any other text
I found by searching the standard for "indeterminate representation".

I see no relevant changes between C11 and C23 (except that C23 changes
the term "trap representation" to "non-value representation").

I suggest there are three possible resolutions:

1. J.2(11) is correct and I've missed something (always a possibility,
but so far nobody in comp.lang.c has come up with anything).

2. J.2(11) reflects the intent, and normative text somewhere else
in the standard needs to be updated or added to make it clear
that using the value of an object with automatic storage duration
while the object has an indeterminate representation has undefined
behavior.

3. J.2(11) is incorrect and needs to be modified or deleted.
(This would also imply that compilers may not perform certain
optimizations. I have no idea whether any compilers would actually
be affected.)

I'm going to post this to comp.std.c and email it to the C23 editors.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */
Ben Bacarisse
2023-07-21 15:33:53 UTC
Permalink
Post by Keith Thompson
N3096 is the last public draft of the upcoming C23 standard.
[...]
(11) The value of an object with automatic storage duration is
used while the object has an indeterminate representation
(6.2.4, 6.7.10, 6.8).
I'll use an `int` object in my example.
Reading an object that holds a non-value representation has undefined
behavior, but not all integer types have non-value representations
-- and if an implementation has certain characteristics, we can
reliably infer that int has no non-value representations (called
"trap representations" in C99, C11, and C17).
```
#include <limits.h>
int main(void) {
int foo;
if (sizeof (int) == 4 &&
CHAR_BIT == 8 &&
INT_MAX == 2147483647 &&
INT_MIN == -INT_MAX-1)
{
int bar = foo;
}
}
```
If the condition is true (as it is for many real-world
implementations), then int has no padding bits and no trap
representations. The object `foo` has an indeterminate representation
when it's used to initialize `bar`. Since it cannot have a non-value
representation, it has an unspecified value.
If J.2(11) is correct, then the use of the value results in undefined
behavior.
But Annex J is non-normative, and as far as I can tell there is no
normative text in the standard that says the behavior is undefined.
6.3.2.1 p2:

"[...] If the lvalue designates an object of automatic storage
duration that could have been declared with the register storage class
(never had its address taken), and that object is uninitialized (not
declared with an initializer and no assignment to it has been
performed prior to use), the behavior is undefined."

seems to cover it. The restriction on not having it's address taken
seems odd.
--
Ben.
Keith Thompson
2023-07-21 18:56:00 UTC
Permalink
Post by Ben Bacarisse
Post by Keith Thompson
N3096 is the last public draft of the upcoming C23 standard.
[...]
(11) The value of an object with automatic storage duration is
used while the object has an indeterminate representation
(6.2.4, 6.7.10, 6.8).
I'll use an `int` object in my example.
Reading an object that holds a non-value representation has undefined
behavior, but not all integer types have non-value representations
-- and if an implementation has certain characteristics, we can
reliably infer that int has no non-value representations (called
"trap representations" in C99, C11, and C17).
```
#include <limits.h>
int main(void) {
int foo;
if (sizeof (int) == 4 &&
CHAR_BIT == 8 &&
INT_MAX == 2147483647 &&
INT_MIN == -INT_MAX-1)
{
int bar = foo;
}
}
```
If the condition is true (as it is for many real-world
implementations), then int has no padding bits and no trap
representations. The object `foo` has an indeterminate representation
when it's used to initialize `bar`. Since it cannot have a non-value
representation, it has an unspecified value.
If J.2(11) is correct, then the use of the value results in undefined
behavior.
But Annex J is non-normative, and as far as I can tell there is no
normative text in the standard that says the behavior is undefined.
"[...] If the lvalue designates an object of automatic storage
duration that could have been declared with the register storage class
(never had its address taken), and that object is uninitialized (not
declared with an initializer and no assignment to it has been
performed prior to use), the behavior is undefined."
seems to cover it. The restriction on not having it's address taken
seems odd.
Good find.

That sentence was added in C11 (it doesn't appear in C99 or in
N1256, which consists of C99 plus the three Technical Corrigenda)
in response to DR #338. Since the wording in Annex J goes back to
C99 in its current form, and to C90 in a slightly different form,
that can't be what Annex J is referring to. And the statement
in Annex J is more general, so we can't quite use 6.3.2.1p2 as a
retroactive justification.

https://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_338.htm

Yes, that restriction does seem strange. It was inspired by the
IA64 (Itanium) architecture, which has an extra trap bit in each
CPU register (NaT, "not a thing"). The "could have been declared
with the register storage class" wording is there because the IA64
NaT bit exists only in CPU registers, not in memory.

An object with automatic storage duration might be stored in an IA64
CPU register. If the object is not initialized, the register's
NaT bit would be set. Any attempt to read it would cause a trap.
Writing it would clear the NaT bit.

Which means that a hypothetical CPU with something like a NaT bit
on each word of memory (iAPX 432? i960?) might cause a trap in
circumstances not covered by that wording -- but it *is* covered
by the wording in Annex J.

(Normally, an object whose address is taken can still be stored in
a CPU register for part of its lifetime. The effect is to forbid
certain optimizations on I64-like systems.)

It's tempting to conclude that reading an uninitialized automatic
object whose address is taken is *not* undefined behavior
(https://en.wikipedia.org/wiki/Exception_that_proves_the_rule),
but the standard doesn't say so.

C90's Annex G (renamed to Annex J in later editions) says:

The behavior in the following circumstances is undefined:
[...]
- The value of an uninitialized object that has automatic storage
duration is used before a value is assigned (6.5.7).

6.5.7 discusses initialization, but doesn't say that reading an
uninitialized object has undefined behave, so the issue is an old one.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */
Ben Bacarisse
2023-07-21 19:54:36 UTC
Permalink
Post by Keith Thompson
Post by Ben Bacarisse
Post by Keith Thompson
N3096 is the last public draft of the upcoming C23 standard.
[...]
(11) The value of an object with automatic storage duration is
used while the object has an indeterminate representation
(6.2.4, 6.7.10, 6.8).
I'll use an `int` object in my example.
Reading an object that holds a non-value representation has undefined
behavior, but not all integer types have non-value representations
-- and if an implementation has certain characteristics, we can
reliably infer that int has no non-value representations (called
"trap representations" in C99, C11, and C17).
```
#include <limits.h>
int main(void) {
int foo;
if (sizeof (int) == 4 &&
CHAR_BIT == 8 &&
INT_MAX == 2147483647 &&
INT_MIN == -INT_MAX-1)
{
int bar = foo;
}
}
```
If the condition is true (as it is for many real-world
implementations), then int has no padding bits and no trap
representations. The object `foo` has an indeterminate representation
when it's used to initialize `bar`. Since it cannot have a non-value
representation, it has an unspecified value.
If J.2(11) is correct, then the use of the value results in undefined
behavior.
But Annex J is non-normative, and as far as I can tell there is no
normative text in the standard that says the behavior is undefined.
"[...] If the lvalue designates an object of automatic storage
duration that could have been declared with the register storage class
(never had its address taken), and that object is uninitialized (not
declared with an initializer and no assignment to it has been
performed prior to use), the behavior is undefined."
seems to cover it. The restriction on not having it's address taken
seems odd.
Good find.
That sentence was added in C11 (it doesn't appear in C99 or in
N1256, which consists of C99 plus the three Technical Corrigenda)
in response to DR #338. Since the wording in Annex J goes back to
C99 in its current form, and to C90 in a slightly different form,
that can't be what Annex J is referring to. And the statement
in Annex J is more general, so we can't quite use 6.3.2.1p2 as a
retroactive justification.
Thanks for looking into the history. I was going to do that when I had
some time.

There are three relevant clauses in Annex J, and I think we should keep
them all in mind. Sadly, they are not numbered (until C23) so I've
given then 'UB' numbers taken from the similar wording in C23.

— The value of an object with automatic storage duration is used while
it is indeterminate (6.2.4, 6.7.9, 6.8). [UB-11]

— A trap representation is read by an lvalue expression that does not
have character type (6.2.6.1). [UB-12]

— An lvalue designating an object of automatic storage duration that
could have been declared with the register storage class is used in
a context that requires the value of the designated object, but the
object is uninitialized. (6.3.2.1). [UB-20]

Clearly, UB-20 is explained by the quote I posted, but UB-11 (the one we
are talking about) is there as well and, as you say, can't be fully
explained by that normative quote.
Post by Keith Thompson
https://www.open-std.org/jtc1/sc22/wg14/www/docs/dr_338.htm
Yes, that restriction does seem strange. It was inspired by the
IA64 (Itanium) architecture, which has an extra trap bit in each
CPU register (NaT, "not a thing"). The "could have been declared
with the register storage class" wording is there because the IA64
NaT bit exists only in CPU registers, not in memory.
Thanks. I wondered if might have been some hardware consideration...
Post by Keith Thompson
An object with automatic storage duration might be stored in an IA64
CPU register. If the object is not initialized, the register's
NaT bit would be set. Any attempt to read it would cause a trap.
Writing it would clear the NaT bit.
Which means that a hypothetical CPU with something like a NaT bit
on each word of memory (iAPX 432? i960?) might cause a trap in
circumstances not covered by that wording -- but it *is* covered
by the wording in Annex J.
It's covered by UB-12 and that's backed up by normative text,
specifically paragraph 5 of the section cited in UB-12.
Post by Keith Thompson
(Normally, an object whose address is taken can still be stored in
a CPU register for part of its lifetime. The effect is to forbid
certain optimizations on I64-like systems.)
It's tempting to conclude that reading an uninitialized automatic
object whose address is taken is *not* undefined behavior
(https://en.wikipedia.org/wiki/Exception_that_proves_the_rule),
but the standard doesn't say so.
But it doesn't say that it is UB either, does it? That case is excluded
in 6.3.2.1 p2, but there's not else covering it but the non-normative
Annex J.
Post by Keith Thompson
[...]
- The value of an uninitialized object that has automatic storage
duration is used before a value is assigned (6.5.7).
6.5.7 discusses initialization, but doesn't say that reading an
uninitialized object has undefined behave, so the issue is an old one.
--
Ben.
Keith Thompson
2023-07-21 21:26:20 UTC
Permalink
[...]
Post by Ben Bacarisse
There are three relevant clauses in Annex J, and I think we should keep
them all in mind. Sadly, they are not numbered (until C23) so I've
given then 'UB' numbers taken from the similar wording in C23.
— The value of an object with automatic storage duration is used while
it is indeterminate (6.2.4, 6.7.9, 6.8). [UB-11]
— A trap representation is read by an lvalue expression that does not
have character type (6.2.6.1). [UB-12]
— An lvalue designating an object of automatic storage duration that
could have been declared with the register storage class is used in
a context that requires the value of the designated object, but the
object is uninitialized. (6.3.2.1). [UB-20]
[...]
Post by Ben Bacarisse
Post by Keith Thompson
An object with automatic storage duration might be stored in an IA64
CPU register. If the object is not initialized, the register's
NaT bit would be set. Any attempt to read it would cause a trap.
Writing it would clear the NaT bit.
Which means that a hypothetical CPU with something like a NaT bit
on each word of memory (iAPX 432? i960?) might cause a trap in
circumstances not covered by that wording -- but it *is* covered
by the wording in Annex J.
It's covered by UB-12 and that's backed up by normative text,
specifically paragraph 5 of the section cited in UB-12.
I don't think so. A "non-value representation" (formerly a "trap
representation") is determined by the bits making up the representation
of an object. For an integer type, such a representation can occur only
if the type has padding bits. The IA64 NaT bit is not part of the
representation; it's neither a value bit nor a padding bit.

For a 64-bit integer type, given CHAR_BIT==8, its *representation* is
defined as a set of 8 bytes that can be copied into an object of type
`unsigned char[8]`. The NaT bit does not contribute to the size of the
object.

I think the right way for C to permit NaT-like bits is, as Kaz
suggested, to define "indeterminate value" in terms of provenance,
not just the bits that make up its current representation.
An automatic object with no initialization, or a malloc()ed object,
starts with an indeterminate value, and accessing that value
(other than as an array of characters) has undefined behavior.
(This is a proposal, not what the standard currently says.)
IA64 happens to have a way of (partially) representing that
provenance in hardware, outside the object in question. Other or
future architectures might do a more complete job.

[...]
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */
Ben Bacarisse
2023-07-21 22:39:42 UTC
Permalink
Post by Keith Thompson
[...]
Post by Ben Bacarisse
There are three relevant clauses in Annex J, and I think we should keep
them all in mind. Sadly, they are not numbered (until C23) so I've
given then 'UB' numbers taken from the similar wording in C23.
— The value of an object with automatic storage duration is used while
it is indeterminate (6.2.4, 6.7.9, 6.8). [UB-11]
— A trap representation is read by an lvalue expression that does not
have character type (6.2.6.1). [UB-12]
— An lvalue designating an object of automatic storage duration that
could have been declared with the register storage class is used in
a context that requires the value of the designated object, but the
object is uninitialized. (6.3.2.1). [UB-20]
[...]
Post by Ben Bacarisse
Post by Keith Thompson
An object with automatic storage duration might be stored in an IA64
CPU register. If the object is not initialized, the register's
NaT bit would be set. Any attempt to read it would cause a trap.
Writing it would clear the NaT bit.
Which means that a hypothetical CPU with something like a NaT bit
on each word of memory (iAPX 432? i960?) might cause a trap in
circumstances not covered by that wording -- but it *is* covered
by the wording in Annex J.
It's covered by UB-12 and that's backed up by normative text,
specifically paragraph 5 of the section cited in UB-12.
I don't think so. A "non-value representation" (formerly a "trap
representation") is determined by the bits making up the representation
of an object. For an integer type, such a representation can occur only
if the type has padding bits. The IA64 NaT bit is not part of the
representation; it's neither a value bit nor a padding bit.
For a 64-bit integer type, given CHAR_BIT==8, its *representation* is
defined as a set of 8 bytes that can be copied into an object of type
`unsigned char[8]`. The NaT bit does not contribute to the size of the
object.
Ah, right. I thought you were including it as a padding bit.
Post by Keith Thompson
I think the right way for C to permit NaT-like bits is, as Kaz
suggested, to define "indeterminate value" in terms of provenance,
not just the bits that make up its current representation.
An automatic object with no initialization, or a malloc()ed object,
starts with an indeterminate value, and accessing that value
(other than as an array of characters) has undefined behavior.
(This is a proposal, not what the standard currently says.)
IA64 happens to have a way of (partially) representing that
provenance in hardware, outside the object in question. Other or
future architectures might do a more complete job.
[...]
That would work.
--
Ben.
Tim Rentsch
2023-08-13 00:00:40 UTC
Permalink
Post by Keith Thompson
I think the right way for C to permit NaT-like bits is, as Kaz
suggested, to define "indeterminate value" in terms of provenance,
not just the bits that make up its current representation. [...]
This idea is fundamentally wrong. NaT bits are associated with
particular areas of memory, which is to say objects. The point
of provenance is that non-viability is associated with /values/,
not with objects. Once an area of memory acquires an object
representation, the NaT bit or NaT bits for that memory are set
to zero, end of story. Note also that NaT bits are independent
of what type is used to access an object - if the NaT bit is set
then any access is illegal, no matter what type is used to do the
access. By contrast, provenance is used in situations where
non-viability is associated with values, not with objects. But
values are always type dependent; a pointer object that holds
a value that has been passed to free() is "indeterminate" when
accessed as a pointer type, but perfectly okay to access as an
unsigned char type. The two kinds of situations are essentially
different, and the theoretical models used to characterize the
rules in the two kinds of situations should therefore be
correspondingly essentially different.
Martin Uecker
2023-08-14 06:41:06 UTC
Permalink
Post by Keith Thompson
I think the right way for C to permit NaT-like bits is, as Kaz
suggested, to define "indeterminate value" in terms of provenance,
not just the bits that make up its current representation. [...]
This idea is fundamentally wrong. NaT bits are associated with
particular areas of memory, which is to say objects. The point
of provenance is that non-viability is associated with /values/,
not with objects. Once an area of memory acquires an object
representation, the NaT bit or NaT bits for that memory are set
to zero, end of story. Note also that NaT bits are independent
of what type is used to access an object - if the NaT bit is set
then any access is illegal, no matter what type is used to do the
access. By contrast, provenance is used in situations where
non-viability is associated with values, not with objects. But
values are always type dependent; a pointer object that holds
a value that has been passed to free() is "indeterminate" when
accessed as a pointer type, but perfectly okay to access as an
unsigned char type. The two kinds of situations are essentially
different, and the theoretical models used to characterize the
rules in the two kinds of situations should therefore be
correspondingly essentially different.
One could still consider the idea that "indeterminate" is an
abstract property that yields UB during read even for types
that do not have trap representations. There is no wording
in the C standard to support this, but I would not call this
idea "fundamentally wrong". You are right that this is different
to provenance provenance which is about values. What it would
have in common with pointer provenance is that there is hidden
state in the abstract machine associated with memory that
is not part of the representation. With effective types there
is another example of this.

Martin
Tim Rentsch
2023-08-16 04:06:37 UTC
Permalink
Post by Martin Uecker
Post by Tim Rentsch
Post by Keith Thompson
I think the right way for C to permit NaT-like bits is, as Kaz
suggested, to define "indeterminate value" in terms of provenance,
not just the bits that make up its current representation. [...]
This idea is fundamentally wrong. NaT bits are associated with
particular areas of memory, which is to say objects. The point
of provenance is that non-viability is associated with /values/,
not with objects. Once an area of memory acquires an object
representation, the NaT bit or NaT bits for that memory are set
to zero, end of story. Note also that NaT bits are independent
of what type is used to access an object - if the NaT bit is set
then any access is illegal, no matter what type is used to do the
access. By contrast, provenance is used in situations where
non-viability is associated with values, not with objects. But
values are always type dependent; a pointer object that holds
a value that has been passed to free() is "indeterminate" when
accessed as a pointer type, but perfectly okay to access as an
unsigned char type. The two kinds of situations are essentially
different, and the theoretical models used to characterize the
rules in the two kinds of situations should therefore be
correspondingly essentially different.
One could still consider the idea that "indeterminate" is an
abstract property that yields UB during read even for types
that do not have trap representations. There is no wording
in the C standard to support this, but I would not call this
idea "fundamentally wrong". You are right that this is different
to provenance provenance which is about values. What it would
have in common with pointer provenance is that there is hidden
state in the abstract machine associated with memory that
is not part of the representation. With effective types there
is another example of this.
My preceding comments were meant to be only about NaT bits (or
NaT-like bits) and provenance. There is an inherent mismatch
between the two, as I have tried to explain. It is only the idea
that provenence would provide a good foundation for defining the
semantics of "NaT everywhere" that I am saying is fundamentally
wrong.

I understand that you want to consider a broader topic, and that,
in the realm of that broader topic, something like provenance
could have a role to play. I think it is worth responding to
that thesis, and am expecting to do so in a separate reply (or
new thread?) although probably not right away.
Martin Uecker
2023-08-16 05:40:37 UTC
Permalink
Post by Tim Rentsch
Post by Martin Uecker
Post by Keith Thompson
I think the right way for C to permit NaT-like bits is, as Kaz
suggested, to define "indeterminate value" in terms of provenance,
not just the bits that make up its current representation. [...]
This idea is fundamentally wrong. NaT bits are associated with
particular areas of memory, which is to say objects. The point
of provenance is that non-viability is associated with /values/,
not with objects. Once an area of memory acquires an object
representation, the NaT bit or NaT bits for that memory are set
to zero, end of story. Note also that NaT bits are independent
of what type is used to access an object - if the NaT bit is set
then any access is illegal, no matter what type is used to do the
access. By contrast, provenance is used in situations where
non-viability is associated with values, not with objects. But
values are always type dependent; a pointer object that holds
a value that has been passed to free() is "indeterminate" when
accessed as a pointer type, but perfectly okay to access as an
unsigned char type. The two kinds of situations are essentially
different, and the theoretical models used to characterize the
rules in the two kinds of situations should therefore be
correspondingly essentially different.
One could still consider the idea that "indeterminate" is an
abstract property that yields UB during read even for types
that do not have trap representations. There is no wording
in the C standard to support this, but I would not call this
idea "fundamentally wrong". You are right that this is different
to provenance provenance which is about values. What it would
have in common with pointer provenance is that there is hidden
state in the abstract machine associated with memory that
is not part of the representation. With effective types there
is another example of this.
My preceding comments were meant to be only about NaT bits (or
NaT-like bits) and provenance. There is an inherent mismatch
between the two, as I have tried to explain. It is only the idea
that provenence would provide a good foundation for defining the
semantics of "NaT everywhere" that I am saying is fundamentally
wrong.
I understand that you want to consider a broader topic, and that,
in the realm of that broader topic, something like provenance
could have a role to play. I think it is worth responding to
that thesis, and am expecting to do so in a separate reply (or
new thread?) although probably not right away.
I would love to hear your comments, because some people
want to have such an abstract of "indeterminate" and
some already believe that this is how the standard should
be understood already today.

Martin
Tim Rentsch
2023-08-17 06:13:03 UTC
Permalink
Martin Uecker <***@gmail.com> writes:

[some unrelated passages removed]
[...]
Post by Martin Uecker
Post by Tim Rentsch
Post by Martin Uecker
One could still consider the idea that "indeterminate" is an
abstract property that yields UB during read even for types
that do not have trap representations. There is no wording
in the C standard to support this, but I would not call this
idea "fundamentally wrong". You are right that this is different
to provenance provenance which is about values. What it would
have in common with pointer provenance is that there is hidden
state in the abstract machine associated with memory that
is not part of the representation. With effective types there
is another example of this.
I understand that you want to consider a broader topic, and that,
in the realm of that broader topic, something like provenance
could have a role to play. I think it is worth responding to
that thesis, and am expecting to do so in a separate reply (or
new thread?) although probably not right away.
I would love to hear your comments, because some people
want to have such an abstract of "indeterminate" and
some already believe that this is how the standard should
be understood already today.
I've been thinking about this, and am close (I think) to having
something to say in response. Before I do that, thought, let me
ask this: what problem or problems are motivating the question?
What problems do you (or "some people") want to solve? I don't
want just examples here; I'm hoping to get a full list.
Kaz Kylheku
2023-08-17 07:08:45 UTC
Permalink
Post by Tim Rentsch
[some unrelated passages removed]
[...]
Post by Martin Uecker
Post by Tim Rentsch
Post by Martin Uecker
One could still consider the idea that "indeterminate" is an
abstract property that yields UB during read even for types
that do not have trap representations. There is no wording
in the C standard to support this, but I would not call this
idea "fundamentally wrong". You are right that this is different
to provenance provenance which is about values. What it would
have in common with pointer provenance is that there is hidden
state in the abstract machine associated with memory that
is not part of the representation. With effective types there
is another example of this.
I understand that you want to consider a broader topic, and that,
in the realm of that broader topic, something like provenance
could have a role to play. I think it is worth responding to
that thesis, and am expecting to do so in a separate reply (or
new thread?) although probably not right away.
I would love to hear your comments, because some people
want to have such an abstract of "indeterminate" and
some already believe that this is how the standard should
be understood already today.
I've been thinking about this, and am close (I think) to having
something to say in response. Before I do that, thought, let me
ask this: what problem or problems are motivating the question?
What problems do you (or "some people") want to solve? I don't
want just examples here; I'm hoping to get a full list.
I'm all about the diagnosis. Even on machines in which all
representations are values, and therefore safe, a program whose external
effect or output depends on unintialized data, and is therefore
nondeterministic (a bad form of nondeterministic), is a repugnant
program.

I'd like to have clear rules which allow an implementation to
to go great depths to diagnose all such situations, while
remaining conforming. (The language agrees that those situations
are erroneous, granting the tools license to diagnose.)

At the same time, certain situations in which uninitialized data are
used in ways that don't have a visible effect, would be nuisance if they
generated diagnostics, the primary example being the copying of objects.
I would like it so that memcpy isn't magic. I want it so that the
programmer can write a bytewise memcpy which doesn't violate the
rules even if it moves uninitialized data.

I would like a model of uninitialized data which usefully lends itself
to different depths with different trade-offs, like complexity of
analysis and use of run-time resources. Limits should be imposed by
implementations (what cases they want to diagnose) rather than by the
model.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Martin Uecker
2023-08-18 19:44:11 UTC
Permalink
Post by Kaz Kylheku
Post by Tim Rentsch
[some unrelated passages removed]
[...]
Post by Martin Uecker
Post by Tim Rentsch
Post by Martin Uecker
One could still consider the idea that "indeterminate" is an
abstract property that yields UB during read even for types
that do not have trap representations. There is no wording
in the C standard to support this, but I would not call this
idea "fundamentally wrong". You are right that this is different
to provenance provenance which is about values. What it would
have in common with pointer provenance is that there is hidden
state in the abstract machine associated with memory that
is not part of the representation. With effective types there
is another example of this.
I understand that you want to consider a broader topic, and that,
in the realm of that broader topic, something like provenance
could have a role to play. I think it is worth responding to
that thesis, and am expecting to do so in a separate reply (or
new thread?) although probably not right away.
I would love to hear your comments, because some people
want to have such an abstract of "indeterminate" and
some already believe that this is how the standard should
be understood already today.
I've been thinking about this, and am close (I think) to having
something to say in response. Before I do that, thought, let me
ask this: what problem or problems are motivating the question?
What problems do you (or "some people") want to solve? I don't
want just examples here; I'm hoping to get a full list.
I'm all about the diagnosis. Even on machines in which all
representations are values, and therefore safe,
I do not agree with the idea that "absence of UB = safe ".
Post by Kaz Kylheku
a program whose external
effect or output depends on unintialized data, and is therefore
nondeterministic (a bad form of nondeterministic), is a repugnant
program.
I would expect a debugger to output the memory as it seen
by the CPU. But yes, it would not be a strictly conforming program.
Post by Kaz Kylheku
I'd like to have clear rules which allow an implementation to
to go great depths to diagnose all such situations, while
remaining conforming. (The language agrees that those situations
are erroneous, granting the tools license to diagnose.)
An implementation does not need a license from the standard
to diagnose anything. I can already diagnose whatever seems
useful and this does not affect conformance at all.

But it becomes easier to usefully diagnose behavior which is
undefined, because then one can expect that in portable C it
is not used intentionally.
Post by Kaz Kylheku
At the same time, certain situations in which uninitialized data are
used in ways that don't have a visible effect, would be nuisance if they
generated diagnostics, the primary example being the copying of objects.
I would like it so that memcpy isn't magic. I want it so that the
programmer can write a bytewise memcpy which doesn't violate the
rules even if it moves uninitialized data.
Yes, I think for C this is rather important.
Post by Kaz Kylheku
I would like a model of uninitialized data which usefully lends itself
to different depths with different trade-offs, like complexity of
analysis and use of run-time resources. Limits should be imposed by
implementations (what cases they want to diagnose) rather than by the
model.
Tools can already do complex analysis and track down use of
uninitialized variables. But with respect to conformance, I think
the current standard has very good rules: memcpy/memcmp
and similar code works as expected. Locally, where a compiler
can be expected to give good diagnostics via static analysis
the use of uninitialized variables is UB. But this does not
spread via pointers elsewhere, where useful diagnostics
are unlikely and optimizer induced problems based on UB
might be far more difficult to debug.

Martin
Kaz Kylheku
2023-08-19 05:04:06 UTC
Permalink
Post by Martin Uecker
An implementation does not need a license from the standard
to diagnose anything. I can already diagnose whatever seems
useful and this does not affect conformance at all.
That's true about diagnostics at translation time. It's not clear
about that happen at run time and indistinguishable from the
program's output on stdout or stderr.

Also, it might be desirable for it to be conforming to terminate the
program if it has run afoul of the rules.
Post by Martin Uecker
Post by Kaz Kylheku
I would like a model of uninitialized data which usefully lends itself
to different depths with different trade-offs, like complexity of
analysis and use of run-time resources. Limits should be imposed by
implementations (what cases they want to diagnose) rather than by the
model.
Tools can already do complex analysis and track down use of
uninitialized variables. But with respect to conformance, I think
the current standard has very good rules: memcpy/memcmp
and similar code works as expected. Locally, where a compiler
can be expected to give good diagnostics via static analysis
the use of uninitialized variables is UB. But this does not
spread via pointers elsewhere, where useful diagnostics
are unlikely and optimizer induced problems based on UB
might be far more difficult to debug.
Dynamic instrumentation and tracking makes that possible
for that information to follow pointer data flows, globally
in the program.

E.g. under the Valgrind tool, if one module passes an unitialized
object into another, and that other one relies on it to make
a conditional branch, it will be diagnosed. You can get the
backtrace of where that object was created as well as where
the use took place.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Martin Uecker
2023-08-19 08:36:23 UTC
Permalink
Post by Kaz Kylheku
Post by Martin Uecker
An implementation does not need a license from the standard
to diagnose anything. I can already diagnose whatever seems
useful and this does not affect conformance at all.
That's true about diagnostics at translation time. It's not clear
about that happen at run time and indistinguishable from the
program's output on stdout or stderr.
The observable behavior has to stay the same, so yes, it could
not output to stdout or stderr. But there is nothing stopping it
to log debugging information somewhere else, where it could
be accessed.
Post by Kaz Kylheku
Also, it might be desirable for it to be conforming to terminate the
program if it has run afoul of the rules.
Yes, this is one main reason to make certain things UB. But
then it can have false positives and needs to be backward
compatible, which limits what is possible.
Post by Kaz Kylheku
Post by Martin Uecker
Post by Kaz Kylheku
I would like a model of uninitialized data which usefully lends itself
to different depths with different trade-offs, like complexity of
analysis and use of run-time resources. Limits should be imposed by
implementations (what cases they want to diagnose) rather than by the
model.
Tools can already do complex analysis and track down use of
uninitialized variables. But with respect to conformance, I think
the current standard has very good rules: memcpy/memcmp
and similar code works as expected. Locally, where a compiler
can be expected to give good diagnostics via static analysis
the use of uninitialized variables is UB. But this does not
spread via pointers elsewhere, where useful diagnostics
are unlikely and optimizer induced problems based on UB
might be far more difficult to debug.
Dynamic instrumentation and tracking makes that possible
for that information to follow pointer data flows, globally
in the program.
E.g. under the Valgrind tool, if one module passes an unitialized
object into another, and that other one relies on it to make
a conditional branch, it will be diagnosed. You can get the
backtrace of where that object was created as well as where
the use took place.
And valgrind exists and is a useful tool (I use it myself)
despite not everything it diagnoses is UB. But it also has
false positives, so using the same rules for deciding what
should be UB in the standard as valgrind uses seems difficult.

Also note that of the output of a program relies on
unspecified values, then it is already not strictly conforming
even when the behavior itself is not undefined. So if an
implementation is smart enough to see this, it could already
reject the program.

Making already the use of unspecified values in conditional
branches be UB seems problematic. E.g. you could not
compute a hash over data structures with padding and
then compare it later to see whether something has
changed (taking into account false positives). This seems
similar to memcpy / memcmp but involved conditions,
and such techniques would become non-conforming.

Martin
Richard Damon
2023-08-19 13:18:17 UTC
Permalink
Post by Martin Uecker
Post by Kaz Kylheku
Post by Martin Uecker
An implementation does not need a license from the standard
to diagnose anything. I can already diagnose whatever seems
useful and this does not affect conformance at all.
That's true about diagnostics at translation time. It's not clear
about that happen at run time and indistinguishable from the
program's output on stdout or stderr.
The observable behavior has to stay the same, so yes, it could
not output to stdout or stderr. But there is nothing stopping it
to log debugging information somewhere else, where it could
be accessed.
Post by Kaz Kylheku
Also, it might be desirable for it to be conforming to terminate the
program if it has run afoul of the rules.
Yes, this is one main reason to make certain things UB. But
then it can have false positives and needs to be backward
compatible, which limits what is possible.
Post by Kaz Kylheku
Post by Martin Uecker
Post by Kaz Kylheku
I would like a model of uninitialized data which usefully lends itself
to different depths with different trade-offs, like complexity of
analysis and use of run-time resources. Limits should be imposed by
implementations (what cases they want to diagnose) rather than by the
model.
Tools can already do complex analysis and track down use of
uninitialized variables. But with respect to conformance, I think
the current standard has very good rules: memcpy/memcmp
and similar code works as expected. Locally, where a compiler
can be expected to give good diagnostics via static analysis
the use of uninitialized variables is UB. But this does not
spread via pointers elsewhere, where useful diagnostics
are unlikely and optimizer induced problems based on UB
might be far more difficult to debug.
Dynamic instrumentation and tracking makes that possible
for that information to follow pointer data flows, globally
in the program.
E.g. under the Valgrind tool, if one module passes an unitialized
object into another, and that other one relies on it to make
a conditional branch, it will be diagnosed. You can get the
backtrace of where that object was created as well as where
the use took place.
And valgrind exists and is a useful tool (I use it myself)
despite not everything it diagnoses is UB. But it also has
false positives, so using the same rules for deciding what
should be UB in the standard as valgrind uses seems difficult.
Also note that of the output of a program relies on
unspecified values, then it is already not strictly conforming
even when the behavior itself is not undefined. So if an
implementation is smart enough to see this, it could already
reject the program.
Making already the use of unspecified values in conditional
branches be UB seems problematic. E.g. you could not
compute a hash over data structures with padding and
then compare it later to see whether something has
changed (taking into account false positives). This seems
similar to memcpy / memcmp but involved conditions,
and such techniques would become non-conforming.
Martin
My understanding is that there is no requirement that the values of the
padding bytes remains constant over time. I can't imagine a case where
they will just change at an arbitrary time, but setting a member of the
structure to a value (even if it is the same value it had) might easily
affect the value of the padding bytes, so the hash changes.
Martin Uecker
2023-08-19 18:12:53 UTC
Permalink
Post by Richard Damon
Post by Martin Uecker
Post by Kaz Kylheku
Post by Martin Uecker
An implementation does not need a license from the standard
to diagnose anything. I can already diagnose whatever seems
useful and this does not affect conformance at all.
That's true about diagnostics at translation time. It's not clear
about that happen at run time and indistinguishable from the
program's output on stdout or stderr.
The observable behavior has to stay the same, so yes, it could
not output to stdout or stderr. But there is nothing stopping it
to log debugging information somewhere else, where it could
be accessed.
Post by Kaz Kylheku
Also, it might be desirable for it to be conforming to terminate the
program if it has run afoul of the rules.
Yes, this is one main reason to make certain things UB. But
then it can have false positives and needs to be backward
compatible, which limits what is possible.
Post by Kaz Kylheku
Post by Martin Uecker
Post by Kaz Kylheku
I would like a model of uninitialized data which usefully lends itself
to different depths with different trade-offs, like complexity of
analysis and use of run-time resources. Limits should be imposed by
implementations (what cases they want to diagnose) rather than by the
model.
Tools can already do complex analysis and track down use of
uninitialized variables. But with respect to conformance, I think
the current standard has very good rules: memcpy/memcmp
and similar code works as expected. Locally, where a compiler
can be expected to give good diagnostics via static analysis
the use of uninitialized variables is UB. But this does not
spread via pointers elsewhere, where useful diagnostics
are unlikely and optimizer induced problems based on UB
might be far more difficult to debug.
Dynamic instrumentation and tracking makes that possible
for that information to follow pointer data flows, globally
in the program.
E.g. under the Valgrind tool, if one module passes an unitialized
object into another, and that other one relies on it to make
a conditional branch, it will be diagnosed. You can get the
backtrace of where that object was created as well as where
the use took place.
And valgrind exists and is a useful tool (I use it myself)
despite not everything it diagnoses is UB. But it also has
false positives, so using the same rules for deciding what
should be UB in the standard as valgrind uses seems difficult.
Also note that of the output of a program relies on
unspecified values, then it is already not strictly conforming
even when the behavior itself is not undefined. So if an
implementation is smart enough to see this, it could already
reject the program.
Making already the use of unspecified values in conditional
branches be UB seems problematic. E.g. you could not
compute a hash over data structures with padding and
then compare it later to see whether something has
changed (taking into account false positives). This seems
similar to memcpy / memcmp but involved conditions,
and such techniques would become non-conforming.
Martin
My understanding is that there is no requirement that the values of the
padding bytes remains constant over time.
The C standard specifies when they can change:

"When a value is stored in an object of structure or union type,
including in a member object, the bytes of the object representation
that correspond to any padding bytes take unspecified values"
Post by Richard Damon
I can't imagine a case where
they will just change at an arbitrary time, but setting a member of the
structure to a value (even if it is the same value it had) might easily
affect the value of the padding bytes, so the hash changes.
Sure, writing to object may change the padding and then the
hash changes. This is why I mentioned false positives.

Martin
Tim Rentsch
2023-08-19 03:20:05 UTC
Permalink
I'm all about the diagnosis. Even on machines in which all
representations are values, and therefore safe, a program whose
external effect or output depends on unintialized data, and is
therefore nondeterministic (a bad form of nondeterministic), is a
repugnant program.
I'd like to have clear rules which allow an implementation to to
go great depths to diagnose all such situations, while remaining
conforming. (The language agrees that those situations are
erroneous, granting the tools license to diagnose.)
The C standard allows compilers to do whatever analysis they
want and to issue diagnostics for whatever conditions or
circumstances they choose. What you want is orthogonal to
what is being discussed.
Kaz Kylheku
2023-08-19 05:23:29 UTC
Permalink
Post by Tim Rentsch
I'm all about the diagnosis. Even on machines in which all
representations are values, and therefore safe, a program whose
external effect or output depends on unintialized data, and is
therefore nondeterministic (a bad form of nondeterministic), is a
repugnant program.
I'd like to have clear rules which allow an implementation to to
go great depths to diagnose all such situations, while remaining
conforming. (The language agrees that those situations are
erroneous, granting the tools license to diagnose.)
The C standard allows compilers to do whatever analysis they
want and to issue diagnostics for whatever conditions or
circumstances they choose.
And stop translating? If some use of an uninitialized object
isn't undefined, and you make the diagnostic a fatal error,
then you don't have a conforming compiler at that point.
Post by Tim Rentsch
What you want is orthogonal to what is being discussed.
I'm mainly concerned about run-time.

If the program hasn't invoked undefined behavior, I don't thinkk it's
conforming to inject gratuitous diagnostics into the program's run-time,
such that they appear as if they were its output on stderr or stdout.
Those diagnostics have to go to some special debug port.

Also, not conforming to arbitrarily terminate the program. (Other
than in some weasly language lawyering way, by declaring that it
has exceeded an implementation limit or something.)
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Tim Rentsch
2023-08-19 05:56:34 UTC
Permalink
[...]
Post by Tim Rentsch
The C standard allows compilers to do whatever analysis they
want and to issue diagnostics for whatever conditions or
circumstances they choose.
And stop translating? If some use of an uninitialized object
isn't undefined, and you make the diagnostic a fatal error,
then you don't have a conforming compiler at that point.
[also]
If the program hasn't invoked undefined behavior, I don't thinkk
it's conforming to inject gratuitous diagnostics [..or..]
to arbitrarily terminate the program. [...]
You need to learn how to say what you mean. Your earlier
posting didn't say anything about failing to compile
or altering program behavior. If you can't learn how
to say what you mean then there is roughly a 1e-29 percent
chance that you'll get what you want.
Martin Uecker
2023-08-18 19:52:42 UTC
Permalink
Post by Tim Rentsch
[some unrelated passages removed]
[...]
Post by Martin Uecker
Post by Tim Rentsch
Post by Martin Uecker
One could still consider the idea that "indeterminate" is an
abstract property that yields UB during read even for types
that do not have trap representations. There is no wording
in the C standard to support this, but I would not call this
idea "fundamentally wrong". You are right that this is different
to provenance provenance which is about values. What it would
have in common with pointer provenance is that there is hidden
state in the abstract machine associated with memory that
is not part of the representation. With effective types there
is another example of this.
I understand that you want to consider a broader topic, and that,
in the realm of that broader topic, something like provenance
could have a role to play. I think it is worth responding to
that thesis, and am expecting to do so in a separate reply (or
new thread?) although probably not right away.
I would love to hear your comments, because some people
want to have such an abstract of "indeterminate" and
some already believe that this is how the standard should
be understood already today.
I've been thinking about this, and am close (I think) to having
something to say in response. Before I do that, thought, let me
ask this: what problem or problems are motivating the question?
What problems do you (or "some people") want to solve? I don't
want just examples here; I'm hoping to get a full list.
There are essentially two main interests driving this. First, there
is some interest to precisely formulate the semantics for C.
The provenance proposal came out of this.

Second, there is the issue of safety problems caused by
uninitialized reads, together with compiler support for zero
initialization etc. So there are various people who want to
change the semantics for uninitialized variables completely
in the interest of safety.

So far, there was no consensus in WG14 that the rules should
be changed or what the new rules should be.

Martin
Tim Rentsch
2023-08-27 02:25:55 UTC
Permalink
Post by Tim Rentsch
[some unrelated passages removed]
[...]
Post by Martin Uecker
Post by Tim Rentsch
Post by Martin Uecker
One could still consider the idea that "indeterminate" is an
abstract property that yields UB during read even for types
that do not have trap representations. There is no wording
in the C standard to support this, but I would not call this
idea "fundamentally wrong". You are right that this is different
to provenance provenance which is about values. What it would
have in common with pointer provenance is that there is hidden
state in the abstract machine associated with memory that
is not part of the representation. With effective types there
is another example of this.
I understand that you want to consider a broader topic, and that,
in the realm of that broader topic, something like provenance
could have a role to play. I think it is worth responding to
that thesis, and am expecting to do so in a separate reply (or
new thread?) although probably not right away.
I would love to hear your comments, because some people
want to have such an abstract of "indeterminate" and
some already believe that this is how the standard should
be understood already today.
I've been thinking about this, and am close (I think) to having
something to say in response. Before I do that, thought, let me
ask this: what problem or problems are motivating the question?
What problems do you (or "some people") want to solve? I don't
want just examples here; I'm hoping to get a full list.
There are essentially two main interests driving this. First,
there is some interest to precisely formulate the semantics for C.
The provenance proposal came out of this.
Second, there is the issue of safety problems caused by
uninitialized reads, together with compiler support for zero
initialization etc. So there are various people who want to
change the semantics for uninitialized variables completely
in the interest of safety.
This response doesn't answer my question. What are the problems,
specifically, that people want to solve? If there isn't a good
understanding of what the problem is, there is little hope of
finding a solution, let alone reaching agreement on whether a
proposed change does in fact solve the problem. If we don't know
where we're going, any choice of road is equally good.

That said, I understand that you are asking not on your own behalf
but on behalf (perhaps indirectly) of others, and the others might
not know what the problem(s) are that they want to solve. I think
it's worth asking the question explicitly, What is the problem
that we want to solve here? Start by simply trying to write a
clear statement of what the problem is; proceed on to looking for
a solution only after there is agreement (and I don't mean just a
majority vote) about what problem it is the group wants to solve.

(Note added after writing: I didn't realize when I started how
difficult this subject is and how much there is to say about it.
I hope readers will appreciate the amount of effort that has
been invested, and get some value out of what has been produced,
even if it spends too much time on some less important issues.)

(Also, after having written the whole posting, I see that there
are some aspects that I didn't relate to the indeterminate
question and so didn't address. If you want me to say more about
formalizing semantics or the issue of safety for uninitialized
variables, I really need some specifics before I can talk about
those.)

(One further thought: on reading through my comments one last
time, I may have more to say about uninitialized variables. But
I am deferring that for now, to get this beast out the door.)
So far, there was no consensus in WG14 that the rules should
be changed or what the new rules should be.
That's because they don't know what problem it is that they want
to solve.

Consider the question of what happens with padding bits/bytes,
and unnamed members, in structs (unions too of course, but for
now we consider only structs). The C standard says these bits of
memory take unspecified values whenever there is a store to any
member of the struct (and maybe also at other times, but let's
ignore that). I understand why this decision was made, namely,
to give more freedom to implementations as to how such operations
are actualized. But it leaves behind a problem. Speaking as a
developer, I want the values of these bits to be stable, at least
in certain cases (and I want to be able to choose which cases
those are). The C language doesn't give me any way to do that,
at least not one that isn't horribly inconvenient. In making the
decision about padding bits/bytes, the C committee answered the
/question/ but didn't address the /problem/. I expect that
something similar is going on with the current discussions.

To better understand the landscape, let's look at three different
kinds of undefined behavior. The illustrating constructions are
signed integer arithmetic, obsolete pointer values, and violating
effective type rules.

Situations where arithmetic on signed integers overflows might be
called /practical/ undefined behavior. Certainly it would be
possible to require a better-defined semantics (such as giving an
unspecified result), but presumably overflow doesn't come up very
often, it's not clear how useful the "better" result would be,
and the cost in some hardware environments might be prohibitive.
Furthermore there is a fairly easy workaround to avoid overflow:
simply convert to unsigned types, do the operations, and then
convert back. Overflow being undefined behavior isn't absolutely
necessary but in practical terms it's acceptable. (I acknowledge
that some people have different views on that last statement.)

An obsolete pointer value is a pointer to an object after the end
of the object's lifetime. Attempting to make use of an obsolete
pointer value, in any way whatsoever including simply loading it
by means of lvalue conversion, is undefined behavior. We can
imagine narrowing the scope a bit so simply loading an obsolete
pointer value or comparing one for equality could be better
defined, but any attempt to dereference an obsolete pointer value
is what might be called an /essential/ undefined behavior. The
problem here is both practical and theoretical: there is no way
to be sure the underlying hardware will be able to carry out the
asked-for operation (without a machine check, etc), and even if
there were, there is no way to describe what happens in a way
that can be expressed (usefully) in terms that relate to what's
going on in the abstract machine. There simply is no practical,
useful, sensible way to define the behavior of dereferencing an
obsolete pointer value.

At the other end of the spectrum, violating effective type rules is
what might be called /gratuitous/ undefined behavior. There is no
particular hardware motivation for choosing UB. And there is no
problem defining the semantics of a cross-type access, which can be
done definedly in the same way as accessing union members. So there
is no reason to think that adding cross-type restrictions is
necessary. An argument can be made that cross-type restrictions
are /desirable/, because they allow code transformations that
improve performance in some cases.

Incidentally, it might seem like effective type rules are similar in
some way to NaT bits or pointer provenance. They aren't. NaT bits
are hardware indicators that actually exist, and pointer provenances
are attached to values, not to objects. Neither of those conditions
hold for effective types. The seeming similarity to hidden memory
bits is a red herring.

(Also, effective type rules are a lot more complicated than they
seem at first blush, and have some peculiar properties as a result.
They seem to work okay if not looked at too closely, but a closer
look shows some serious shortcomings. But I digress.)

There are two significant problems with undefined behavior. The
smaller of the two is that there are no distinctions between the
different classes of undefined behavior. There is no way around
having some sort of undefined behavior for obsolete pointer values,
but cross-typing rules are a completely different story. Yet the C
standard puts all the different kinds of undefined behaviors into
the same absolute category. Sometimes people use compiler options
to turn off, for example, so-called "strict aliasing", and of course
the C standard allows us to do that. But compilers aren't required
to provide such an option, and if they do the option may not do
exactly what we expect it to do, because there is no standard
specification for it. The C standard should define officially
sanctioned mechanisms -- as for example standard #pragma's -- to
give standard-defined semantics to certain constructs of undefined
behavior that resemble, eg, -fno-strict-aliasing.

(Let me add in passing that this should be done for some cases of
unspecified behavior as well. To give one example, the C standard
should provide a way to direct a C compiler to maintain the values
of padding bits and bytes and unnamed members, taking away the
freedom for such things to assume unspecified values.)

The second problem is basically The Law of Unintended Consequences
smashing into The Law of Least Astonishment. As compiler writers
have gotten more and more clever at exploiting the implications of
"undefined behavior", we see more and more cases of code that looks
reasonable being turned into mush by overly clever "optimizing"
compilers. There is obviously something wrong with the way this
trend is going -- ever more clever "optimizations", followed by ever
more arcane compiler options to work around the problems caused by
the too-clever compilers. This problem must be addressed by the C
standard, for if it is not the ecosystem will transform into a
confused state that is exactly what the C standard was put in place
to avoid. (I do have some ideas about how to address this issue,
but I want to make sure everyone appreciates the extent of the
problem before we start talking about solutions.)

Before leaving the sub-topic of undefined behavior, let me mention
two success stories. The first is 'restrict': the performance
implications are local, the choice is under control of the program
(and programmer), and the default choice is to play safe. Good
show. The second is the improved sequencing rules introduced in
C11. A thorny problem, and since C11 handled very deftly. These
parts of the C language and C standard should be held up as examples
when considering how to go forward on other problems.

And now on to the question of "indeterminate". Following that, a
somewhat philosophical perspective concerning the nature of the C
standard and the people who work on it.

First an observation. The idea of "indeterminate values" is
actually two ideas in one: non-valid abstract /values/ (like
obsolete pointers), and "uninitialized" /objects/ (in quotes
because in some circumstances objects can become "uninitialized"
even after they have been stored into.) The word "indeterminate"
isn't really right for either of these ideas. I understand why it
was used in the first C standard, and in that context it seems
okay, but going forward a better word (or words) should be found.
I will keep using it here but please don't get overly attached to
the word, lest it confuse the discussion.

My very strong sense is that some general notion of indeterminate
values (or objects) is a solution in search of a problem. Let's
look at some different kinds of undefined behavior, while also
considering the lens of "indeterminate values (or objects)".

One: signed integer overflow. Could this situation somehow produce
an "indeterminate value" that could be stored so it could wreak
havoc later? Two problems: no sensible developer is going to want
the bad behavior deferred rather than happening right away, and
besides anything an "indeterminate value" could do can already be
done by virtue of the generating condition being undefined itself.

Two: obsolete pointers. These values are not indeterminate. They
start off as valid, become obsolete when their pointed-to object
ends its lifetime, and are always obsolete thereafter. It isn't
hard to make a formal model for "obsoleteness" (ignoring problems
such as converting pointers to and from integers, and other C-isms).
Of course the formal model doesn't map nicely onto real computer
hardware, because pointers would have far too many bits (and maybe
other problems as well, but let's ignore that). So we pretend the
extra bits are there, even though they aren't, with a strange
consequence that two pointer objects can have the same object
representation but still be different in that one is obsolete and
the other isn't. Also a pointer can start off with a non-valid
value, meaning "not null and points to no object". Here again the
badness remains until a valid pointer value is put into the object;
a pointer object with a non-valid value doesn't ever magically
become valid without having been assigned or stored into. (Note
that the same formal model for obsolete pointers can accommodate
non-valid pointers, which are simply obsolete at the start.)

Three: effective type rules. Broken. One of the weakest areas of
the C standard. This framework may have started off as not a bad
idea in C90, but looking at it now it's clear that we've gotten
ahead of our skis, sorely in need of a top-to-bottom reformulation,
similar at least in spirit with what was done with sequencing rules
in C11. Also there should be a standard-defined way of allowing
cross-type interference, with defined behavior, like what was
explained above. I expect a well-done reformulation of cross-type
(non-)interference rules would have no notion of assigning "magic
state" to objects, and so have no need of any idea of "indeterminate
objects (or values)".

Four: uninitialized objects. Here we have a question: Why? What
problem are we hoping to solve? Presumably the point of having
uninitialized objects be "indeterminate" is so that reading them
is undefined behavior. Let's explore that.

I realize of course that any object having a trap representation
(called a non-value representation in the C23 draft) causes
undefined behavior if read using a type in which the object
representation corresponds to a trap representation. Obviously
there is good reason to say trying to read a trap representation
is undefined behavior. Some types, notably unsigned char, don't
have any trap representations. Should reading an uninitialized
object using such a type be undefined behavior? Speaking as a
developer, I don't see any benefit. An implementation would have
to go out of its way to do anything other than deliver a valid
unspecified value; if there is to be undefined behavior, it is
/contrived/ undefined behavior. Consider:

A: such UB could allow trapping on any use of an uninitialized
object. But UB does not guarantee that, and if someone wants
it there are tools like valgrind to get it (and without any
special language support needed to do so).

B: such UB could allow "optimizations" by clever compiler writers.
The result would be more unexpected code scramblings and more
arcane compiler options to disable them. A better way to provide
such imagined benefits is by adding one or more new language
constructs, along lines similar to the 'restrict' qualifier, to
selectively enable such performance changes.

C: future hardware developments might need or take advantage of
such UB. If and when such things happen it's better to add
specific wording to reflect the new hardware behaviors. The last
sentence of 6.3.2.1 p2, added in C11, provides an excellent example
of how to accommodate such new hardware developments.

Indeterminate objects is a solution in search of a problem. To
make progress, first agree on a particular problem. Only after
that point should possible solutions be considered; I would be
surprised if some general notion of indeterminateness ever turned
out to be the solution of choice.

Now I would like to offer a perspective on how to view work that
is done in writing the C standard.

In some respects the ISO C committee resembles the US Supreme
Court. They consider issues, draw conclusions, and ultimately
issue "rulings" in the form of ISO-approved standards documents.
Like the Supreme Court, their decisions are final and cannot be
appealed.

However, the Supreme Court ultimately draws its authority from
how the public views its rulings. If the rulings get too far out
of line with what the general public believes, confidence in the
Court will decline and its opinions will carry less weight. (I
don't mean to make a political statement here - I am simply
repeating some analysis I have read recently regarding current
attitudes towards the Court.)

The same is true of the ISO C committee. They can make whatever
decisions they want, and those decisions will end up being what
goes into the C standard. At the same time, it's important - I
would say very important - to keep the confidence of people for
whom the C standard is regarded as an important document. If
that confidence is lost then the C standard will be on its way
to becoming irrelevant.

Unfortunately I have the sense that this trend has already
started. The most important constituency for the C language (and
so for the C standard) is developers. Many developers, but in
particular and very especially C developers, want stability. I
understand the desire to want to "improve" the language. Getting
agreement on a change has to mean more than a majority vote -- it
needs to be not just accepted but enthusiastically approved and
with overwhelming support. Too much of what is planned for C23
is coming from the implementation community without regard for
what is beneficial to the development community. I see the
reported desire for general "indeterminate"-ness as part of this
trend. It is my hope that those people who are part of the ISO C
committee reflect on this perspective and reconsider where the C
language should go for the next C standard.
Spiros Bousbouras
2023-08-27 08:31:26 UTC
Permalink
On Sat, 26 Aug 2023 19:25:55 -0700
Post by Tim Rentsch
Sometimes people use compiler options
to turn off, for example, so-called "strict aliasing", and of course
the C standard allows us to do that. But compilers aren't required
to provide such an option, and if they do the option may not do
exactly what we expect it to do, because there is no standard
specification for it. The C standard should define officially
sanctioned mechanisms -- as for example standard #pragma's -- to
give standard-defined semantics to certain constructs of undefined
behavior that resemble, eg, -fno-strict-aliasing.
Surely the starting point for this should be the documentation of the
compilers to specify precisely what -fno-strict-aliasing does. If
a consensus emerges out of these precise specifications or C programmers
indicate that they prefer the specification of some particular compiler
then this can become part of the standard. Adding a relevant #pragma
should be trivial.
Post by Tim Rentsch
The second problem is basically The Law of Unintended Consequences
smashing into The Law of Least Astonishment. As compiler writers
have gotten more and more clever at exploiting the implications of
"undefined behavior", we see more and more cases of code that looks
reasonable being turned into mush by overly clever "optimizing"
compilers. There is obviously something wrong with the way this
trend is going -- ever more clever "optimizations", followed by ever
more arcane compiler options to work around the problems caused by
the too-clever compilers. This problem must be addressed by the C
standard, for if it is not the ecosystem will transform into a
confused state that is exactly what the C standard was put in place
to avoid. (I do have some ideas about how to address this issue,
but I want to make sure everyone appreciates the extent of the
problem before we start talking about solutions.)
Without specific examples , it's impossible to comment on this. Why did
the "reasonable" code have the undefined behaviour ? Could the result
the programmer was aiming for have been achieved with defined behaviour
? For example it has been pointed out on comp.lang.c that it's
impossible to write a malloc() implementation in conforming C. This is
certainly a weakness which should be addressed with some appropriate
#pragma .
Post by Tim Rentsch
Before leaving the sub-topic of undefined behavior, let me mention
two success stories. The first is 'restrict': the performance
implications are local, the choice is under control of the program
(and programmer), and the default choice is to play safe. Good
show.
From my point of view , restrict is not a success because the
specification of restrict is the one part of the C1999 standard I have
given up trying to understand. I understand the underlying idea but the
specifics elude me. I remember many years ago someone asked on this
group about some code involving restrict and a member of the standard
committee replied and I found the reply counterintuitive. So I have
decided to not use restrict in my own code taking also into account
that I don't need the microoptimisations which restrict is intended to
allow. But for all I know , people who do need these optimisations find
the specification of restrict in the standard perfectly adequate.
--
It is not widely known that the "CPC" in "Amstrad CPC" actually stands
for "cool people club".
Tim Rentsch
2023-08-29 11:35:40 UTC
Permalink
Post by Spiros Bousbouras
On Sat, 26 Aug 2023 19:25:55 -0700
Sometimes people use compiler options to turn off, for example,
so-called "strict aliasing", and of course the C standard allows
us to do that. But compilers aren't required to provide such an
option, and if they do the option may not do exactly what we
expect it to do, because there is no standard specification for
it. The C standard should define officially sanctioned
mechanisms -- as for example standard #pragma's -- to give
standard-defined semantics to certain constructs of undefined
behavior that resemble, eg, -fno-strict-aliasing.
Surely the starting point for this should be the documentation of
the compilers to specify precisely what -fno-strict-aliasing does.
[...]
Not at all. It's easy to write a specification that says what we
want to do, along similar lines to what is said in the footnote
about union member access in section 6.5.2.3

If the member used to access the contents of a union object
is not the same as the member last used to store a value in
the object, the appropriate part of the object representation
of the value is reinterpreted as an object representation in
the new type as described in 6.2.6 (a process sometimes called
"type punning"). This might be a trap representation.

That behavior should be the default, for all accesses. For cases
where a developer wants to give permission to the compiler to
optimize based on cross-type non-interference assumptions, there
should be a #pragma to do something similar to what effective type
rules do now. The effective type rules are in need of re-writing
anyway, and making type punning be the default doesn't break any
programs, because compilers are already free to ignore the
implications of violating effective type conditions.
Post by Spiros Bousbouras
The second problem is basically The Law of Unintended Consequences
smashing into The Law of Least Astonishment. As compiler writers
have gotten more and more clever at exploiting the implications of
"undefined behavior", we see more and more cases of code that looks
reasonable being turned into mush by overly clever "optimizing"
compilers. There is obviously something wrong with the way this
trend is going -- ever more clever "optimizations", followed by
ever more arcane compiler options to work around the problems
caused by the too-clever compilers. This problem must be addressed
by the C standard, for if it is not the ecosystem will transform
into a confused state that is exactly what the C standard was put
in place to avoid. (I do have some ideas about how to address this
issue, but I want to make sure everyone appreciates the extent of
the problem before we start talking about solutions.)
Without specific examples , it's impossible to comment on this.
[...]
I feel that so much has been written about this issue that it
isn't necessary for me to elaborate.
Post by Spiros Bousbouras
For example it has been pointed out on comp.lang.c that it's
impossible to write a malloc() implementation in conforming
C. This is certainly a weakness which should be addressed with
some appropriate #pragma .
There isn't any reason to think malloc() should be writable in
completely portable C. That's the point of putting malloc() in
the system library in the first place. By the way, with type
punning semantics mentioned above being the default, and with the
alignment features added in C11, I think it is possible to write
malloc() in portable C without needed any additional language
changes. But even if it isn't that is no cause for concern; one
of the principal reasons for having a system library is to
provide functionality that the core language cannot express (or
cannot express conveniently).
Post by Spiros Bousbouras
Before leaving the sub-topic of undefined behavior, let me mention
two success stories. The first is 'restrict': the performance
implications are local, the choice is under control of the program
(and programmer), and the default choice is to play safe. Good
show.
From my point of view , restrict is not a success because the
specification of restrict is the one part of the C1999 standard I
have given up trying to understand. I understand the underlying
idea but the specifics elude me. [...]
I agree the formal definition of restrict is rather daunting. In
practice though I think using restrict with confidence is not
overly difficult. My working model for restrict is something
like this:

1. Use restrict only in the declarations of function
parameters.

2. For a declaration like const T *restrict foo ,
the compiler may assume that any objects that can be
accessed through 'foo' will not be modified.

3. For a declaration like T *restrict bas ,
the compiler may assume that any changes to objects
that can be accessed through 'bas' will be done
using 'bas' or a pointer value derived from 'bas'
(and in particular that no changes will happen
other than through 'bas' or 'bas'-derived pointer
values).

Is this summary description helpful?
Spiros Bousbouras
2023-08-30 19:53:40 UTC
Permalink
On Tue, 29 Aug 2023 04:35:40 -0700
Post by Tim Rentsch
Post by Spiros Bousbouras
On Sat, 26 Aug 2023 19:25:55 -0700
Sometimes people use compiler options to turn off, for example,
so-called "strict aliasing", and of course the C standard allows
us to do that. But compilers aren't required to provide such an
option, and if they do the option may not do exactly what we
expect it to do, because there is no standard specification for
it. The C standard should define officially sanctioned
mechanisms -- as for example standard #pragma's -- to give
standard-defined semantics to certain constructs of undefined
behavior that resemble, eg, -fno-strict-aliasing.
Surely the starting point for this should be the documentation of
the compilers to specify precisely what -fno-strict-aliasing does.
[...]
Not at all. It's easy to write a specification that says what we
want to do, along similar lines to what is said in the footnote
about union member access in section 6.5.2.3
If the member used to access the contents of a union object
is not the same as the member last used to store a value in
the object, the appropriate part of the object representation
of the value is reinterpreted as an object representation in
the new type as described in 6.2.6 (a process sometimes called
"type punning"). This might be a trap representation.
Works for me but it would be good to know that this is how compiler
writers actually understand -fno-strict-aliasing .Is there any compiler
documentation which says something like this ?
Post by Tim Rentsch
That behavior should be the default, for all accesses. For cases
where a developer wants to give permission to the compiler to
optimize based on cross-type non-interference assumptions, there
should be a #pragma to do something similar to what effective type
rules do now. The effective type rules are in need of re-writing
anyway, and making type punning be the default doesn't break any
programs, because compilers are already free to ignore the
implications of violating effective type conditions.
[...]
Post by Tim Rentsch
Post by Spiros Bousbouras
For example it has been pointed out on comp.lang.c that it's
impossible to write a malloc() implementation in conforming
C. This is certainly a weakness which should be addressed with
some appropriate #pragma .
There isn't any reason to think malloc() should be writable in
completely portable C. That's the point of putting malloc() in
the system library in the first place. By the way, with type
punning semantics mentioned above being the default, and with the
alignment features added in C11, I think it is possible to write
malloc() in portable C without needed any additional language
changes. But even if it isn't that is no cause for concern; one
of the principal reasons for having a system library is to
provide functionality that the core language cannot express (or
cannot express conveniently).
One might want to experiment with different allocation algorithms
and it seems to me that this sort of thing is within the "remit" of
C. So ideally one should be able to write it in C and prove , starting
from the standard or precise specifications in compiler documentation ,
that it works correctly. I don't necessarily mean prove the correctness
of the whole code but certain key parts.

Another application I have in mind is languages which get translated
to C and support garbage collection. Again one might want to use the
standard malloc() to allocate a large block of memory and use different
parts of this memory for different types of objects.

If with the semantics you propose these things are possible , I'm happy.
I'm not bothered which is the default as long as there is a precise
specification from which you can reason that you get the desired behaviour.
Post by Tim Rentsch
Post by Spiros Bousbouras
Before leaving the sub-topic of undefined behavior, let me mention
two success stories. The first is 'restrict': the performance
implications are local, the choice is under control of the program
(and programmer), and the default choice is to play safe. Good
show.
From my point of view , restrict is not a success because the
specification of restrict is the one part of the C1999 standard I
have given up trying to understand. I understand the underlying
idea but the specifics elude me. [...]
I agree the formal definition of restrict is rather daunting. In
practice though I think using restrict with confidence is not
overly difficult. My working model for restrict is something
1. Use restrict only in the declarations of function
parameters.
2. For a declaration like const T *restrict foo ,
the compiler may assume that any objects that can be
accessed through 'foo' will not be modified.
Wouldn't that also be the case with just const T * foo ?
Post by Tim Rentsch
3. For a declaration like T *restrict bas ,
the compiler may assume that any changes to objects
that can be accessed through 'bas' will be done
using 'bas' or a pointer value derived from 'bas'
(and in particular that no changes will happen
other than through 'bas' or 'bas'-derived pointer
values).
Is this summary description helpful?
It seems clear enough but , as I've said , I don't have any use for
restrict anyway and it's not worth it for me to expend the additional
mental effort to confirm that my code obeys the additional restrictions
of restrict .If I call a function with a preexisting interface which
involves restrict then it seems easy enough to obey the restrictions.
--
Carrie also narrates the film, providing useful guidelines for those
challenged by its intricacies. Sample: "Later that day, Big and I
arrived home."
http://www.rogerebert.com/reviews/sex-and-the-city-2-2010
Tim Rentsch
2023-08-31 00:40:52 UTC
Permalink
Post by Spiros Bousbouras
On Tue, 29 Aug 2023 04:35:40 -0700
Post by Tim Rentsch
Post by Spiros Bousbouras
On Sat, 26 Aug 2023 19:25:55 -0700
Sometimes people use compiler options to turn off, for example,
so-called "strict aliasing", and of course the C standard allows
us to do that. But compilers aren't required to provide such an
option, and if they do the option may not do exactly what we
expect it to do, because there is no standard specification for
it. The C standard should define officially sanctioned
mechanisms -- as for example standard #pragma's -- to give
standard-defined semantics to certain constructs of undefined
behavior that resemble, eg, -fno-strict-aliasing.
Surely the starting point for this should be the documentation of
the compilers to specify precisely what -fno-strict-aliasing does.
[...]
Not at all. It's easy to write a specification that says what we
want to do, along similar lines to what is said in the footnote
about union member access in section 6.5.2.3
If the member used to access the contents of a union object
is not the same as the member last used to store a value in
the object, the appropriate part of the object representation
of the value is reinterpreted as an object representation in
the new type as described in 6.2.6 (a process sometimes called
"type punning"). This might be a trap representation.
Works for me but it would be good to know that this is how compiler
writers actually understand -fno-strict-aliasing . [...]
No, it wouldn't. Implementations follow the C standard, not
the other way around. Looking at what implementations do for
the -fno-strict-aliasing flag is worse than a waste of time.
Post by Spiros Bousbouras
Post by Tim Rentsch
Post by Spiros Bousbouras
For example it has been pointed out on comp.lang.c that it's
impossible to write a malloc() implementation in conforming
C. This is certainly a weakness which should be addressed with
some appropriate #pragma .
There isn't any reason to think malloc() should be writable in
completely portable C. That's the point of putting malloc() in
the system library in the first place. By the way, with type
punning semantics mentioned above being the default, and with the
alignment features added in C11, I think it is possible to write
malloc() in portable C without needed any additional language
changes. But even if it isn't that is no cause for concern; one
of the principal reasons for having a system library is to
provide functionality that the core language cannot express (or
cannot express conveniently).
One might want to experiment with different allocation algorithms
and it seems to me that this sort of thing is within the "remit" of
C. So ideally one should be able to write it in C [...]
You're conflating writing something in C and writing something
in completely portable C. It's already possible to do these
things writing in C.
Post by Spiros Bousbouras
Post by Tim Rentsch
Post by Spiros Bousbouras
From my point of view , restrict is not a success because the
specification of restrict is the one part of the C1999 standard I
have given up trying to understand. I understand the underlying
idea but the specifics elude me. [...]
I agree the formal definition of restrict is rather daunting. In
practice though I think using restrict with confidence is not
overly difficult. My working model for restrict is something
1. Use restrict only in the declarations of function
parameters.
2. For a declaration like const T *restrict foo ,
the compiler may assume that any objects that can be
accessed through 'foo' will not be modified.
Wouldn't that also be the case with just const T * foo ?
No.
Post by Spiros Bousbouras
Post by Tim Rentsch
3. For a declaration like T *restrict bas ,
the compiler may assume that any changes to objects
that can be accessed through 'bas' will be done
using 'bas' or a pointer value derived from 'bas'
(and in particular that no changes will happen
other than through 'bas' or 'bas'-derived pointer
values).
Is this summary description helpful?
It seems clear enough but , as I've said , I don't have any use
for restrict anyway and it's not worth it for me to expend the
additional mental effort to confirm that my code obeys the
additional restrictions of restrict. [...]
If you don't want to use restrict that is quite okay. Part of
why I call restrict a success is that it can be ignored, with
only minimal effort, by any developer who doesn't want to use it.
Spiros Bousbouras
2023-08-31 18:18:59 UTC
Permalink
On Wed, 30 Aug 2023 17:40:52 -0700
Post by Tim Rentsch
Post by Spiros Bousbouras
On Tue, 29 Aug 2023 04:35:40 -0700
[...]
Post by Tim Rentsch
Post by Spiros Bousbouras
Post by Tim Rentsch
Not at all. It's easy to write a specification that says what we
want to do, along similar lines to what is said in the footnote
about union member access in section 6.5.2.3
If the member used to access the contents of a union object
is not the same as the member last used to store a value in
the object, the appropriate part of the object representation
of the value is reinterpreted as an object representation in
the new type as described in 6.2.6 (a process sometimes called
"type punning"). This might be a trap representation.
Works for me but it would be good to know that this is how compiler
writers actually understand -fno-strict-aliasing . [...]
No, it wouldn't. Implementations follow the C standard, not
the other way around. Looking at what implementations do for
the -fno-strict-aliasing flag is worse than a waste of time.
Actually the influence goes in both directions. In theory the standard is the
ultimate authority , in practice whatever C compilers one has access to. For
now the standard doesn't have something like -fno-strict-aliasing so if one
needs it then looking at what implementations do is the only option. But even
the standard committee should look at it and whether C programmers find it
useful to decide what around such lines (if anything) should go into the
standard.
Post by Tim Rentsch
Post by Spiros Bousbouras
Post by Tim Rentsch
There isn't any reason to think malloc() should be writable in
completely portable C. That's the point of putting malloc() in
the system library in the first place. By the way, with type
punning semantics mentioned above being the default, and with the
alignment features added in C11, I think it is possible to write
malloc() in portable C without needed any additional language
changes. But even if it isn't that is no cause for concern; one
of the principal reasons for having a system library is to
provide functionality that the core language cannot express (or
cannot express conveniently).
One might want to experiment with different allocation algorithms
and it seems to me that this sort of thing is within the "remit" of
C. So ideally one should be able to write it in C [...]
You're conflating writing something in C and writing something
in completely portable C. It's already possible to do these
things writing in C.
I wrote

One might want to experiment with different allocation algorithms and it
seems to me that this sort of thing is within the "remit" of C. So
ideally one should be able to write it in C and prove , starting from the
standard or precise specifications in compiler documentation , that it
works correctly. I don't necessarily mean prove the correctness of the
whole code but certain key parts.

.This doesn't conflate anything. One can do the writing but can one do the
proving or something close ?
--
vlaho.ninja/prog
Tim Rentsch
2023-09-05 12:39:57 UTC
Permalink
Post by Spiros Bousbouras
On Wed, 30 Aug 2023 17:40:52 -0700
[...]
Post by Spiros Bousbouras
Post by Tim Rentsch
You're conflating writing something in C and writing something
in completely portable C. It's already possible to do these
things writing in C.
One might want to experiment with different allocation
algorithms and it seems to me that this sort of thing is
within the "remit" of C. So ideally one should be able to
write it in C and prove , starting from the standard or
precise specifications in compiler documentation , that it
works correctly. I don't necessarily mean prove the
correctness of the whole code but certain key parts.
.This doesn't conflate anything. One can do the writing but
can one do the proving or something close ?
A substitute for malloc()/free() can be written in standard C.

A substitute for malloc()/free() can not be written in completely
portable standard C.

I hope this clarifies my earlier comments.
Tim Rentsch
2023-09-06 00:03:46 UTC
Permalink
Martin Uecker <***@gmail.com> writes:

[...]
There are essentially two main interests driving this. First,
there is some interest to precisely formulate the semantics for
C. The provenance proposal came out of this.
Second, there is the issue of safety problems caused by
uninitialized reads, together with compiler support for zero
initialization etc. So there are various people who want to
change the semantics for uninitialized variables completely
in the interest of safety.
So far, there was no consensus in WG14 that the rules should
be changed or what the new rules should be.
I have a second reply here, which I hope will come closer to
being relevant to the issues of interest.

What I think is being looked for is a way to describe the
language semantics in areas such as cross-type interference and
what is meant when an uninitialized object is read. I thought
about this question both while I was writing the longer earlier
reply and then more deeply afterwards.

What I think is most important is that these areas in particular
are not about language semantics in the same way as, for example,
array indexing. Rather they are about what transformations a
compiler is allowed to do in the presence of various combinations
of program constructs. That difference means the C standard
should express the rules in a way that more directly reflects
what's going on. More specifically, the standard should say or
explain what can be done, not by describing language semantics
(which is indirect), but explicitly in terms of what compiler
transformations are allowed (which is direct). Note that there
is precedent for this idea, in how the C standard talks about
looping constructs and when they may be assumed to terminate.

To give an example, take uninitialized objects, either automatic
variables without an initializer, or memory allocated by malloc or
added by realloc. The most natural semantics for such situations
is to say that newly "created" memory gets an unspecified object
representation at the start of its lifetime. (Yes I know that C
in its current form lets automatic objects be "uninitialized"
whenever their declaration points are reached, but let's ignore
that for now.) Now suppose a program has a read access where it
is easy to deduce that the object being read is still in the
"unspecified object representation" initial state. To simplify
the discussion, suppose the type of the access is a pointer type,
and so is known to have trap representations (the name is changed
in the C23 draft, but the idea is what's important).

What is a compiler allowed to do in such circumstances? One thing
it might reasonably be allowed to do is to cause the program to be
terminated if it ever reaches such an access. Or there might be
an option to initialize the pointer to NULL. Or, if a suitable
compiler option were invoked, the construct might be flagged with
a fatal error (or of course a warning). There are all sorts of
actions a developer might want the compiler to take, and a
compiler could offer many of those options, as choices selected
under control of command line switches (or equivalent). I think a
few points are worth making.

One, there must be some sort of default action that all compilers
have to support. The default action in this case might be to
issue a non-fatal diagnostic.

Two, there must be a way for the developer to tell the compiler to
"proceed blindly" - saying, in effect, I accept that the compiled
code might misbehave, but let me take that risk, and generate code
like it's going to work. (In other words, for the read access, go
ahead and load whatever unspecified object representation happens
to be there.) A "proceed blindly" choice probably shouldn't be
the default, but it must be available.

Three, the consequence must never be "undefined behavior", unless
there is an explicit stipulation to that effect. The stipulation
might take the form of a #pragma, or a compiler option, or a code
decoration using "attribute" (whatever the syntax for such things
is).

I know my comments here are somewhat sketchy, but hopefully a
general sense of the ideas gets across. The suggestions should at
least serve to stimulate further discussion.
Jakob Bohm
2023-09-07 15:09:56 UTC
Permalink
Post by Tim Rentsch
[...]
There are essentially two main interests driving this. First,
there is some interest to precisely formulate the semantics for
C. The provenance proposal came out of this.
Second, there is the issue of safety problems caused by
uninitialized reads, together with compiler support for zero
initialization etc. So there are various people who want to
change the semantics for uninitialized variables completely
in the interest of safety.
So far, there was no consensus in WG14 that the rules should
be changed or what the new rules should be.
I have a second reply here, which I hope will come closer to
being relevant to the issues of interest.
What I think is being looked for is a way to describe the
language semantics in areas such as cross-type interference and
what is meant when an uninitialized object is read. I thought
about this question both while I was writing the longer earlier
reply and then more deeply afterwards.
What I think is most important is that these areas in particular
are not about language semantics in the same way as, for example,
array indexing. Rather they are about what transformations a
compiler is allowed to do in the presence of various combinations
of program constructs. That difference means the C standard
should express the rules in a way that more directly reflects
what's going on. More specifically, the standard should say or
explain what can be done, not by describing language semantics
(which is indirect), but explicitly in terms of what compiler
transformations are allowed (which is direct). Note that there
is precedent for this idea, in how the C standard talks about
looping constructs and when they may be assumed to terminate.
To give an example, take uninitialized objects, either automatic
variables without an initializer, or memory allocated by malloc or
added by realloc. The most natural semantics for such situations
is to say that newly "created" memory gets an unspecified object
representation at the start of its lifetime. (Yes I know that C
in its current form lets automatic objects be "uninitialized"
whenever their declaration points are reached, but let's ignore
that for now.) Now suppose a program has a read access where it
is easy to deduce that the object being read is still in the
"unspecified object representation" initial state. To simplify
the discussion, suppose the type of the access is a pointer type,
and so is known to have trap representations (the name is changed
in the C23 draft, but the idea is what's important).
What is a compiler allowed to do in such circumstances? One thing
it might reasonably be allowed to do is to cause the program to be
terminated if it ever reaches such an access. Or there might be
an option to initialize the pointer to NULL. Or, if a suitable
compiler option were invoked, the construct might be flagged with
a fatal error (or of course a warning). There are all sorts of
actions a developer might want the compiler to take, and a
compiler could offer many of those options, as choices selected
under control of command line switches (or equivalent). I think a
few points are worth making.
One, there must be some sort of default action that all compilers
have to support. The default action in this case might be to
issue a non-fatal diagnostic.
Two, there must be a way for the developer to tell the compiler to
"proceed blindly" - saying, in effect, I accept that the compiled
code might misbehave, but let me take that risk, and generate code
like it's going to work. (In other words, for the read access, go
ahead and load whatever unspecified object representation happens
to be there.) A "proceed blindly" choice probably shouldn't be
the default, but it must be available.
Three, the consequence must never be "undefined behavior", unless
there is an explicit stipulation to that effect. The stipulation
might take the form of a #pragma, or a compiler option, or a code
decoration using "attribute" (whatever the syntax for such things
is).
Agreed so far!

As a developer of programs in C with practical but not infinite
portability, I very much abhore the mad optimizations that use
language lawyering to state that any code path that might,
hypothetically, exceed the boundaries of standard-enforced behavior
is allowed to be arbitrarily mangled to get a faster bad result.

For example, I have one function which intentionally reads an
uninitialized variable to get a somewhat arbitrary value of a type
with no known trap representation. I have a number of other
programs which extensively process a block of data before deciding
in some other way if the data is garbage or useful. This is done
for sound technical reasons but requires that the compiler doesn't
plant landmines all over virgin land.

As another example, I have speed critical code that relies on running
on 2s complement machines with wraparound on signed integer overflow,
and that code is being very clear and explicit in doing so, but there
is no C90 notation to tell all ISO-C implementation that this is the
intention, thus it is explicit only in comments, not in the tokens
passed to the C compiler.
Post by Tim Rentsch
I know my comments here are somewhat sketchy, but hopefully a
general sense of the ideas gets across. The suggestions should at
least serve to stimulate further discussion.
I am writing from a similar perspective .

Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
Ben Bacarisse
2023-09-07 16:19:56 UTC
Permalink
Post by Jakob Bohm
As another example, I have speed critical code that relies on running
on 2s complement machines with wraparound on signed integer overflow, and
that code is being very clear and explicit in doing so, but there
is no C90 notation to tell all ISO-C implementation that this is the
intention, thus it is explicit only in comments, not in the tokens
passed to the C compiler.
You can tell the compiler you want 2s complement by using the intN_t
types if you can find one that suits your portability requirements.

And can you not use unsigned arithmetic, re-interpreting as signed for
those places where it matters? The "overflow" can only happen in
the arithmetic, not in the re-interpretation.

I know this is a deviation from the topic, so feel free to ignore if you
don't want to get into it.
--
Ben.
Jakob Bohm
2023-09-08 21:12:00 UTC
Permalink
Post by Ben Bacarisse
Post by Jakob Bohm
As another example, I have speed critical code that relies on running
on 2s complement machines with wraparound on signed integer overflow, and
that code is being very clear and explicit in doing so, but there
is no C90 notation to tell all ISO-C implementation that this is the
intention, thus it is explicit only in comments, not in the tokens
passed to the C compiler.
You can tell the compiler you want 2s complement by using the intN_t
types if you can find one that suits your portability requirements.
And can you not use unsigned arithmetic, re-interpreting as signed for
those places where it matters? The "overflow" can only happen in
the arithmetic, not in the re-interpretation.
I know this is a deviation from the topic, so feel free to ignore if you
don't want to get into it.
The code in question has as explicit design condition that the compiler
implements signed versions with wraparound for each unsigned int type .

The code cannot rely on the intN_t types because they were not part of
C90 and thus do not exist as separate types in some targeted compilers.

In the world of C90 compilers, stdint.h was a non-standard system header
that provided convenience names for the most closely matching C90 types
on the platform, and some platforms simply didn't provide that header,
instead documenting how each C90 type mapped to data sizes.

Excessive casting where directly using the desired type seems possible
is highly counter-intuitive and thus it is inherently wrong for an
optimizer to presume the right to mangle code using types such as "int",
"short int", "long int" and "signed char".

Once again this comes down to a language drift from "undefined" meaning
"not defined by this standard" to "An extremely toxic trap condition" .


Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
Ben Bacarisse
2023-09-08 21:31:04 UTC
Permalink
Post by Jakob Bohm
Post by Ben Bacarisse
Post by Jakob Bohm
As another example, I have speed critical code that relies on running
on 2s complement machines with wraparound on signed integer overflow, and
that code is being very clear and explicit in doing so, but there
is no C90 notation to tell all ISO-C implementation that this is the
intention, thus it is explicit only in comments, not in the tokens
passed to the C compiler.
You can tell the compiler you want 2s complement by using the intN_t
types if you can find one that suits your portability requirements.
And can you not use unsigned arithmetic, re-interpreting as signed for
those places where it matters? The "overflow" can only happen in
the arithmetic, not in the re-interpretation.
I know this is a deviation from the topic, so feel free to ignore if you
don't want to get into it.
The code in question has as explicit design condition that the compiler
implements signed versions with wraparound for each unsigned int type .
The code cannot rely on the intN_t types because they were not part of
C90 and thus do not exist as separate types in some targeted
compilers.
Ah, I didn't know targetting C90 was still a thing. I've been out of
the business for many years.
Post by Jakob Bohm
Excessive casting where directly using the desired type seems possible
is highly counter-intuitive and thus it is inherently wrong for an
optimizer to presume the right to mangle code using types such as "int",
"short int", "long int" and "signed char".
I wasn't suggesting casts as they don't remove the undefined behaviour.
But you have a design that suits your needs so it's all good.
--
Ben.
Kaz Kylheku
2023-07-22 06:40:39 UTC
Permalink
Post by Ben Bacarisse
"[...] If the lvalue designates an object of automatic storage
duration that could have been declared with the register storage class
(never had its address taken), and that object is uninitialized (not
declared with an initializer and no assignment to it has been
performed prior to use), the behavior is undefined."
seems to cover it. The restriction on not having it's address taken
seems odd.
Wording like that looks like someone's solo documentation effort,
not peer-reviewed by an expert commitee.

That looks as if the intent is to allow some diagnoses of uses of
uninitialized variables, while discouraging others.

However, it doesn't seem a good idea to be constraining
implementations in how clever they can be in identifying
an erroneous situation.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Martin Uecker
2023-07-22 13:03:53 UTC
Permalink
Post by Kaz Kylheku
Post by Ben Bacarisse
"[...] If the lvalue designates an object of automatic storage
duration that could have been declared with the register storage class
(never had its address taken), and that object is uninitialized (not
declared with an initializer and no assignment to it has been
performed prior to use), the behavior is undefined."
seems to cover it. The restriction on not having it's address taken
seems odd.
Wording like that looks like someone's solo documentation effort,
not peer-reviewed by an expert commitee.
That looks as if the intent is to allow some diagnoses of uses of
uninitialized variables, while discouraging others.
However, it doesn't seem a good idea to be constraining
implementations in how clever they can be in identifying
an erroneous situation.
I personally like this rule (but I am speaking about me. there is
no full consensus about the exact interpretation of the standard
nor about what it should say). I will try to explain why.

In C, we also can access objects using character points. This
should work in all cases, even for non-value (trap) representations,
and is also used in practice a lot to copy uninitialized or partially
initialized objects. If one makes all reads of objects with
indeterminate representation have undefined behavior, than
this would not work anymore.

If one wants to allow this (and a lot of real-world programs rely
on this), then one has to invent rules how this works with an
abstract (provenance-based) notion of indeterminate values.
This turns out to be difficult.

But if we keep this rule, it becomes very simple: On the one
hand, all reads of uninitialized automatic variables whose
address is not taken are undefined behavior. This is the most
useful behavior for detecting bugs and/or optimization.

On the other hand, taking an address and working with character
pointer to copy or manipulate an object is always defined, one
simply gets unspecified representation bytes (which may be
a non-value representation for some type and it is UB to
read them using a lvalue of this type). So low-level operations
with partially initialized objects work as expected without having
to introduce complicated rules.

It will cost a tiny bit of optimization opportunities, but avoid
a lot of trouble.

Martin
Tim Rentsch
2023-07-26 04:53:06 UTC
Permalink
Post by Ben Bacarisse
"[...] If the lvalue designates an object of automatic storage
duration that could have been declared with the register storage
class (never had its address taken), and that object is
uninitialized (not declared with an initializer and no
assignment to it has been performed prior to use), the behavior
is undefined."
seems to cover it. The restriction on not having it's address
taken seems odd.
[...]
I personally like this rule (but I am speaking about me. there is
no full consensus about the exact interpretation of the standard
nor about what it should say). I will try to explain why. [...]
It's a good rule. I agree with your comments. I guess it's
possible the wording could be improved, but compared to other
parts of the C standard the clarity of this passage is closer to
the top than it is to the bottom.
Tim Rentsch
2023-08-16 18:11:41 UTC
Permalink
Post by Kaz Kylheku
Post by Ben Bacarisse
"[...] If the lvalue designates an object of automatic storage
duration that could have been declared with the register storage class
(never had its address taken), and that object is uninitialized (not
declared with an initializer and no assignment to it has been
performed prior to use), the behavior is undefined."
seems to cover it. The restriction on not having it's address taken
seems odd.
Wording like that looks like someone's solo documentation effort,
not peer-reviewed by an expert commitee.
That looks as if the intent is to allow some diagnoses of uses of
uninitialized variables, while discouraging others.
That isn't at all what this passage is about.
Kaz Kylheku
2023-07-21 17:42:23 UTC
Permalink
Post by Keith Thompson
N3096 is the last public draft of the upcoming C23 standard.
[...]
(11) The value of an object with automatic storage duration is
used while the object has an indeterminate representation
(6.2.4, 6.7.10, 6.8).
Personally, I think that the root cause of this whole issue is
the defective definition of indeterminate value.

Indeterminacy must be an abstract concept that is not encoded
in the bits of the object; it is a matter of provenance.

An indeterminate integer could have a valid bit pattern,
such as all zero, yet the implementation should be free to terminate
with a diagnostic (or behave in other ways) when it is accessed.

It should not be possible to tell whether an object is indeterminate
by looking at its bits.

An implementation can track this with meta data. Translation time
flow-analysis data can catch some uses of uninitialized objects;
that's how we get classic uninitialized variable warnings.

An implementation can track uninitialized bits at run-time with
hidden meta-data. The Valgrind debugging tool does this; for
every bit, whose value is necessarily always 0 or 1, it tracks
whether the bit is initialized.

That poor definition of indeterminate value should go.

Otherwise the standard is contradicting itself and doing
silly things like asserting that using an indeterminate value
is undefined behavior if it is a local variable with automatic
storage.

A reasonable definition of indeterminate might be:

indeterminate

an abstract status indicating that a value is invalid,
irrespective of the content of the bits which constitute
that value.

An improperly obtained value is indeterminate(1).

A previously valid value may lapse into indeterminate status.(2)

Any use of an indeterminate value is undefined behavior.

--
(1) For example, a value obtained accessing an uninitialized
object defined in automatic storage, or in an uninitializeed
region of memory obtained from malloc

(2) For example, a pointer to an object becomes indeterminate
if that object is deallocated.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Jakob Bohm
2023-07-24 05:53:59 UTC
Permalink
Post by Kaz Kylheku
Post by Keith Thompson
N3096 is the last public draft of the upcoming C23 standard.
[...]
(11) The value of an object with automatic storage duration is
used while the object has an indeterminate representation
(6.2.4, 6.7.10, 6.8).
Personally, I think that the root cause of this whole issue is
the defective definition of indeterminate value.
The problem is much deeper than that. It all boils down to the
obsession in the official C community to abuse the concept of
"undefined" to cover everything from "arbitrary natural semantics
of the hardware" to "optimizing away code unexpectedly" . It would
be highly beneficial to a cleanup in C30 or even a corrective TR to
split up the concept into explicit cases that vary for each
situation. For example, runtime error reporting should be very
different from optimizing away code that may encounter runtime
errors on different hardware than the one it is actually run on.

From a simplified conceptual machine model that resembles a modern
von Neumann architecture with only floating point types having
actual trap representations, a lot of rules that have at various
times been rephrased using the word "undefined" seem utterly absurd,
and applying the current meaning of "undefined" back to the
actual machines that inspired them will tend to cause even more absurdities.

For example that ability of the IA64 CPUs to raise an actual trap
exception in response to reading an uninitialized register is very
different from aggressively optimizing away code that might use an
unknown stray value, especially with the aggressive optimization
settings required by the IA64 Explicitly Parallel design.


Some of the things that "undefined" in the current text could map
to:

- anyof(A,B,C) = An implementation specific and possibly uncontrolled
choice between A, B and C (with no others permitted).
- Continuing as if nothing happened
- Aborting execution, possibly with an error indication.
- raise(X) where X is specified in the standard.
- An implementation specific value to be listed in the
implementation documentation.
- A standard specified value.
- Executing machine code at a specified memory address in accordance
with the actual machine behavior (This is common for calling
a function pointer that isn't set to a C function of proper type).
- Causing the code to be eliminated (think assume(0);)
- Reserved for future standardization in future editions.
- Reserved for standardization in other ISO documents (such as POSIX
or C++).
- Reserved for implementation specific behavior to be listed in the
implementation documentation.

For example, the effect of calling assert() with a false value is
"anyof(continuing as if nothing, abort with error)", with it being
implementation defined how to force either choice (many
implementations will use the status of the DEBUG define).


There should also be a way for limits.h (one of the few headers
required in free-standing implementations) to specify via new
standard defines if the implementation conforms to common sets
of implementation specific behaviors such as "twos complement int
with wraparound", "ones complement int with wraparound", "sign
and magnitude int with wraparound", "unsigned with wraparound",
"IEEE nnnn floating point with/without overflow exceptions",
"negative int division by positive int rounds towards zero"
(and the other possibilities for division special cases) etc. etc.


Enjoy

Jakob
--
Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com
Transformervej 29, 2860 Søborg, Denmark. Direct +45 31 13 16 10
This public discussion message is non-binding and may contain errors.
WiseMo - Remote Service Management for PCs, Phones and Embedded
Tim Rentsch
2023-07-26 04:57:20 UTC
Permalink
Post by Jakob Bohm
Post by Kaz Kylheku
Post by Keith Thompson
N3096 is the last public draft of the upcoming C23 standard.
[...]
(11) The value of an object with automatic storage duration is
used while the object has an indeterminate representation
(6.2.4, 6.7.10, 6.8).
Personally, I think that the root cause of this whole issue is
the defective definition of indeterminate value.
The problem is much deeper than that. It all boils down to the
obsession in the official C community to abuse the concept of
"undefined" to cover everything from "arbitrary natural semantics
of the hardware" to "optimizing away code unexpectedly" . [...]
This discussion looks interesting but it seems better that
there be a separate thread to take it up.
Tim Rentsch
2023-08-03 20:13:26 UTC
Permalink
Repeating the question stated in the Subject line:

Does reading an uninitialized object [always] have undefined
behavior?

Background: Annex J part 2 says (in various phrasings in
different revisions of the C standard, with the one below
being taken from C90):

The value of an uninitialized object that has automatic
storage duration is used before a value is assigned [is
undefined behavior] (6.5.7)

Remembering that Annex J is informative rather than normative,
is this statement right even for a type that has no trap
representations? To ask that question another way, is this
statement always right or is it just a (perhaps useful)
approximation?

I think this question can be answered convincingly by reviewing
the subject's history in each revision of the ISO C standard.


We start in C90.

In C90 reading the value of an uninitialized object is always
undefined behavior (and that includes malloc()ed storage as well
as automatic storage duration objects). The C90 standard says,
in 6.5.7:

If an object that has automatic storage duration is not
initialized explicitly, its value is indeterminate.

and in 7.10.3.3:

The malloc function allocates space for an object whose size
is specified by size and whose value is indeterminate.

The term "indeterminate" is not defined in C90, but accessing
storage that is indeterminate is explicitly undefined behavior.
Indeed such uses are part of the /definition/ of undefined
behavior - C90 says in 3.16 (which is an entry in Definitions):

undefined behavior: Behavior, upon use of a nonportable or
erroneous program construct, of erroneous data, or of
indeterminately valued objects, for which this International
Standard imposes no requirements.

So for C90 we have a clear answer: always undefined behavior for
accessing any uninitialized object.

Unfortunately the C90 scheme has some serious issues. There is
no exception for reading using a character type. More seriously,
although C90 gives some situations that cause values to be
indeterminate, it doesn't say anything about making them /not/
be indeterminate. We can guess (but only guess) that assigning
a value to the object as a whole removes indeterminate-ness, but
what about these cases (and other similar ones):

int x;
*(char*)&x = 0;
// is the value of x now indeterminate or not?

struct { int x, y; } s;
s.x = 0;
// is the value of s now indeterminate or not?

Again, we can make guesses about what these answers should be,
but the C90 standard doesn't say. Clearly C90 has some
significant deficiencies.


Next we look at C99.

(Actually, before we do that, I should mention that C90 was
amended and corrected in 1994, 1995, and 1996, by the three
intermediate documents ISO/IEC 9899/COR1, ISO/IEC 9899/AMD1, and
ISO/IEC 9899/COR2. As far as I am aware these revisions have no
bearing on the matter at hand.)

The C99 standard represents a substantial revision and expansion
of the C90 standard. The relationship between uninitialized
memory and undefined behavior is nearly completely rewritten, and
also made more concrete. There's lots to look at here. Starting
at the top, the definition of undefined behavior is revised not
to give any mention of indeterminately valued objects. Here is
section 3.4.3 paragraph 1:

undefined behavior
behavior, upon use of a nonportable or erroneous program
construct or of erroneous data, for which this International
Standard imposes no requirements

(Incidentally the section and paragraph references given in this
part of the discussion are relative to the ISO N1256 document.)

The next most prominent change is that "indeterminate value" is
explicitly defined, in section 3.17.2 paragraph 1:

indeterminate value
either an unspecified value or a trap representation

This definition makes use of two new terms, "unspecified value"
and "trap representation", that were not used in C90. The term
unspecified value is defined immediately following, in 3.17.3 p1:

unspecified value
valid value of the relevant type where this International
Standard imposes no requirements on which value is chosen in
any instance

There is also an informative note in p2:

NOTE An unspecified value cannot be a trap representation.

The term "trap representation" is defined in 6.2.6.1 p5:

Certain object representations need not represent a value of
the object type. If the stored value of an object has such a
representation and is read by an lvalue expression that does
not have character type, the behavior is undefined. If such
a representation is produced by a side effect that modifies
all or any part of the object by an lvalue expression that
does not have character type, the behavior is undefined.41)
Such a representation is called a /trap representation/.

The slant characters around "trap representation" indicate
italics, which the C standard uses to denote a term being
defined. Also there is a '41)' footnote reference

41) Thus, an automatic variable can be initialized to a trap
representation without causing undefined behavior, but the
value of the variable cannot be used until a proper value is
stored in it.

which underscores the non-undefined-behavior aspect of using
character types to change the object representation (and hence
the value) of an object.

The C99 text doesn't use the term "trap representation" very
often. There are several cases where certain types are ruled out
from having trap representations; a few cases where a result
/might be/ a trap representation; and a case involving integer
types where there is an implementation-defined choice as to
whether a specific combination of value bits is a valid value or
a trap representation. Also, in Annex J part 2, the list of
undefined behaviors, there are these summary items:

A trap representation is read by an lvalue expression that
does not have character type (6.2.6.1).

A trap representation is produced by a side effect that
modifies any part of the object using an lvalue expression
that does not have character type (6.2.6.1).

which of course correspond directly to what is said in the
definition of trap representation. Based on various passages in
section 6.2.6, which describes the representation of types, we
can deduce that for some integer types all bit combinations must
be a valid value, and so no trap representations are possible for
those types. Such types always include 'unsigned char', and may
also include other integer types depending on the size of the
type, the value of CHAR_BIT, and the values given in <limits.h>
for the range of the type in question. (More concretely, if the
set of distinct values for type T has 2**(sizeof(T)*CHAR_BIT)
elements, then all object representations are valid values, and
thus type T cannot have any trap representations.)

There are three points worth mentioning regarding unspecified
values and trap representations. One is that unspecified values
are always valid values, and never by themselves cause undefined
behavior. Two is that the distinction between an unspecified
value and a trap representation depends on the type used to
access the object. Three is that, once we know the type of an
access, whether a given object holds a valid value or a trap
representation depends only on the bits and bytes that make up
the object representation of the object, and in particular not on
any hidden "magic" state associated with the object. (There is
one case though that deserves a closer look, which is explained
further on.)

The rule for trap representations is simple and clear: any
access of an object whose object representation is a trap
representation of the access's type is undefined behavior, and
this consequence is accurately portrayed in Annex J part 2.

Having settled the question for trap representations, how about
indeterminate values?

Ruling out the definition and an entry in the index, the term
"indeterminate value" (or values plural) appears in just six
places in the C99 standard: three in informative passages
(usually examples), and three normative passages, those being
6.7.8 paragraph 9 (about unnamed members), 6.8 paragraph 3 (about
declarations for objects with automatic storage duration), and
7.20.3.4 paragraph 2 (about bytes added by a call to realloc()).
The sentence in 6.8 paragraph 3 deserves quoting:

The initializers of objects that have automatic storage
duration, and the variable length array declarators of
ordinary identifiers with block scope, are evaluated and the
values are stored in the objects (including storing an
indeterminate value in objects without an initializer) each
time the declaration is reached in the order of execution, as
if it were a statement, and within each declaration in the
order that declarators appear.

Section 7 has many places where the word "indeterminate" appears
without being followed by "value". I think most of these can be
safely skipped over, but the description of malloc() deserves
quoting (it is 7.20.3.3 paragraph 2):

The malloc function allocates space for an object whose size
is specified by size and whose value is indeterminate.

Presumably the sentence here is meant to express the same idea
as the parallel passage describing the results from realloc(),
which says (in 7.20.3.4 paragraph 2):

Any bytes in the new object beyond the size of the old object
have indeterminate values.

The word "indeterminate" without being followed by "value"
is used in just six other places in the standard: five in the
main body (all of which are part of normative text), plus one
entry in Annex J part 2 (which is of course informative). The
normative uses may be seen to be in two categories, as follows.

Four of the five normative uses are basically restatements of the
long sentence from 6.8 paragraph 3; they are in 6.2.4 paragraph 5
(two uses) and paragraph 6, and 6.7.8 paragraph 10. Here are
excerpts showing these four occurrences (all of which refer to
objects with automatic storage duration):

The initial value of the object is indeterminate.

[if an object had no initializer] the value becomes
indeterminate each time the declaration is reached.

The initial value of the object is indeterminate.

If an object that has automatic storage duration is not
initialized explicitly, its value is indeterminate.

Although these passages use different phrasing, it seems clear
they are meant to mirror the parenthetical phrase in 6.8 p3,
"storing an indeterminate value in objects without an
initializer"; presumably the difference in phrasing simply
reflects the styles of the respective sections: 6.8 gives an
imperative description, whereas 6.2.4 and 6.7 tend to be more
declarative in style. (The last of these excerpts matches
word-for-word with the analogous sentence in C90.) That the C99
standard considers these five passages as expressing the same
idea can be seen by them all being referenced in a single entry
given in Annex J part 2:

The value of an object with automatic storage duration is
used while it is indeterminate (6.2.4, 6.7.8, 6.8).

Compare this text with the corresponding entry in C90. One
reason for the difference is that in C99, unlike in C90, an
object can become "unassigned" after it is first assigned (which
is a consequence in C99 of being able to mix declarations and
statements). So rather than say "before a value is assigned"
the C99 standard says "while it is indeterminate".

The one other place where the word "indeterminate" is used
without being followed by "value" is in 6.2.4 paragraph 2:

The value of a pointer becomes indeterminate when the object
it points to reaches the end of its lifetime.

(The analogous sentence in C90 says basically the same but using
different phrasing, partly because C90 doesn't have any explicit
definition of "lifetime", which of course C99 does.)

There is a corresponding entry for this passage in Annex J part 2
(and which actually doesn't use the word indeterminate):

The value of a pointer to an object whose lifetime has ended
is used (6.2.4).

There is a subtle but important difference between this rule and
the other passages mentioned above. In all of the other cases
there is a specific object being referenced. In the rule here,
we aren't talking about a particular object, nor even just one
object necessarily (there could be many), but possibly about
values that aren't in an object at all. Consider this code
fragment:

char *p = malloc( 1 );
char *q = p + (free(p),0);

It seems clear that the second line is meant to be undefined
behavior /even if the (leftmost) access of p has already taken
place before the call to free() is done/. It isn't an access to
an object (whether indeterminate or not) that is causing the
problem. Rather, it is the use of a value -- valid at the time
the value was obtained -- that has been rendered /invalid/
between the time the value was loaded from p and the time the
value is used in a '+' operation.

Of course, we all understand what's really going on here. In
real computer hardware, the bits of a pointer value don't
magically change when a free() is done (or when an object goes
out of scope and its lifetime ends, etc). Instead, the bits stay
the same, but whether the bits are meaningful or not (or whether
they have the same meaning as before) depends on the state of the
"memory system" as a whole. The term "memory system" is in
quotes because it is meant to include not just state in the
actual hardware but also assumptions made by the compiled code;
a pointer to memory in a departed stack frame may be perfectly
fine as far as the hardware is concerned, but it violates an
assumption made by the compiler that the associated memory may
be (or already have been) reused for another purpose.

One problem with this understanding is that it isn't amenable to
being expressed in the language of the abstract machine. So C99
glosses over the problem by saying "the value of a pointer
becomes indeterminate when ...", disregards what the definition
of "indeterminate value" says, and then pretends (in Annex J.2)
that using any such value is undefined behavior. The text in the
standard is very clear: reading a trap representation is always
undefined behavior (unless accessed using a character type).
There is nothing in the normative text of the standard that says
accessing an indeterminate value is undefined behavior. In fact,
if we take the text of the standard at its word, /every/ object
has an indeterminate value, because every object representation
is either a valid value or a trap representation.

If we ignore pointer types we have an answer to our question:
any type that has no trap representations never causes undefined
behavior by being accessed. Then why does the entry in Annex J.2
give a blanket statement that any use is undefined behavior? A
reasonable guess is that entries in Annex J are meant to provide
useful shorthands without necessarily being completely accurate
(consider for example that the exception for access done using a
character type is not mentioned in the Annex J.2 entry -- a clear
omission).

There is more to say about pointer types. Considering how long
this memo is already it seems better to defer that to a separate
posting.


Next we look at C11.

With respect to the question being considered, the C11 standard
is almost exactly the same as the C99 standard. There are two
differences. First, there is a cosmetic change in that the term
"trap representation" is given a summary definition in section
3.19.4; the paragraph in 6.2.6 where "trap representation" was
previously defined in C99 is unchanged except that in C11 there
are no italics.

The second difference is not a revision but an addition. In
section 6.3.2.1 paragraph 2, talking about lvalue conversion, one
sentence has been added at the end of the paragraph:

If the lvalue designates an object of automatic storage
duration that could have been declared with the register
storage class (never had its address taken), and that object
is uninitialized (not declared with an initializer and no
assignment to it has been performed prior to use), the
behavior is undefined.

Naturally there is a corresponding entry that has been added to
Annex J.2:

An lvalue designating an object of automatic storage
duration that could have been declared with the register
storage class is used in a context that requires the value
of the designated object, but the object is uninitialized.
(6.3.2.1).

The motivation for this new rule reportedly reflects hardware
behavior, on some more recent chips, for some stack-allocated
variables. The added text has several points worth noting.

One, the rule adds a specific, narrow case of undefined behavior
that is simple and clearly delineated.

Two, it does not use the term "indeterminate" or "indeterminate
value". Instead the rule is written in terms of initialization
and assignment. By avoiding "indeterminate", it avoids any
uncertainty about whether undefined behavior must result from
using an indeterminate value.

Three, it provides indirect evidence that use of an indeterminate
value is not necessarily undefined behavior, because if it were
then this new rule would not be necessary.

Four, the condition of undefined behavior is expressed using
imperative phrasing: what matters is what has been done, or not
done, to the object in question. This choice makes this rule a
supplement, not a replacement, for 6.8 p3 et al. Consider this
example function definition:

double
example( double in ){
unsigned yet = 0;
redux: ;
double d;
if( !yet ){
d = in;
yet++;
goto redux;
}
return d;
}

The use of 'd' in 'return d;' might give undefined behavior,
because 'd' may have a trap representation under 6.8 p3. But
the code doesn't violate the conditions of 6.3.2.1 p2, because
an assignment has been done before the lvalue conversion in the
final statement; the intervening evaluation of 'double d;'
doesn't change that. Note also that the clause in 6.8 p3 for
such declarations, "storing an indeterminate value in objects
without an initializer", does not interfere with the application
of the rule in 6.3.2.1 p2, because that rule is written in terms
of assignment, and not in terms of storing a value (which may
have been done because of the parenthetical phrase in 6.8 p3).


After C11

I have not taken the time to review at the C17 standard or the
C23 draft standard while researching the topic here. I see that
some changes have been made (such as "non-value representation"
for "trap representation"), but to the best of my knowledge none
of the key passages are substantively different. I may check on
that later (but no promises on when or whether).


Summary: my reading is that accessing an object that has not
been explicitly stored into since its declaration was evaluated
is necessarily undefined behavior in C90, but not necessarily
undefined behavior in C99 and C11 (and AFAIAA also in C17 and
the upcoming C23). My reasoning is given in detail above.


Postscript: this commentary has taken much longer to write than
I thought it would, for the most part because I made an early
decision to be systematic and thorough. I hope the effort has
helped the readers gain confidence in the explanations and
conclusions stated. I may return to the deferred topic about
pointer types but have no plans at present about when that might
be.
Keith Thompson
2023-08-03 22:20:14 UTC
Permalink
Post by Tim Rentsch
Does reading an uninitialized object [always] have undefined
behavior?
Background: Annex J part 2 says (in various phrasings in
different revisions of the C standard, with the one below
The value of an uninitialized object that has automatic
storage duration is used before a value is assigned [is
undefined behavior] (6.5.7)
Remembering that Annex J is informative rather than normative,
is this statement right even for a type that has no trap
representations? To ask that question another way, is this
statement always right or is it just a (perhaps useful)
approximation?
[400+ lines deleted]
Post by Tim Rentsch
Summary: my reading is that accessing an object that has not
been explicitly stored into since its declaration was evaluated
is necessarily undefined behavior in C90, but not necessarily
undefined behavior in C99 and C11 (and AFAIAA also in C17 and
the upcoming C23). My reasoning is given in detail above.
Postscript: this commentary has taken much longer to write than
I thought it would, for the most part because I made an early
decision to be systematic and thorough. I hope the effort has
helped the readers gain confidence in the explanations and
conclusions stated. I may return to the deferred topic about
pointer types but have no plans at present about when that might
be.
Thank you for taking the time to write that.

I'd like to offer a brief summary of the points you made. Please let me
know if my summary is incorrect.

- An "indeterminate value" is by definition either an "unspecified
value" or a "trap representation".

- In C90 (which did not yet define all these terms), accessing the value
of an uninitialized object explicitly has undefined behavior.

- In C99 and later, J.2 (which is *not* normative) states that using the
value of an object with automatic storage duration while it is
indeterminate has undefined behavior. This implies that:
int main(void) {
int n;
n;
}
has undefined behavior, even if int has no trap representations.

- Statements in J.2 *should* be supported by normative text.

- There is no normative text in any post-C90 edition of the C
standard that supports the claim that reading an uninitialized
int object actually has undefined behavior if it does not hold
a trap representation. (Pointers raise other issues, which I'll
ignore for now.)

- The cited statement in J.2 is incorrect, or at least imprecise.

I agree with you on all the above points.

There is one point on which I think we disagree. It is a matter
of opinion, not of fact. You wrote:

Remembering that Annex J is informative rather than normative,
is this statement right even for a type that has no trap
representations? To ask that question another way, is this
statement always right or is it just a (perhaps useful)
approximation?

The statement in N1570 J.2 is:

The behavior is undefined in the following circumstances:
[...]
- The value of an object with automatic storage duration is used
while it is indeterminate (6.2.4, 6.7.9, 6.8).

I get the impression that you're not particularly bothered by the fact
that the statement in J.2 is merely an "approximation". In my opinion,
the statement in J.2 is simply incorrect, and should be fixed. (That's
unlikely to be possible at this stage of the C23 process.) The fact
that Annex J is, to quote the standard's foreword, "for information
only", is not an excuse to ignore factual errors. Readers of the
standard rely on the informative annexes to provide correct information.
This particular text is not just a "(perhaps useful) approximation"; it
is actively misleading.

I'm not criticizing the author of the standard for making this mistake.
Stuff happens. It was likely a result of an oversight during the
transition from C90 to C99.
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */
Martin Uecker
2023-08-05 08:15:46 UTC
Permalink
Post by Keith Thompson
Post by Tim Rentsch
Does reading an uninitialized object [always] have undefined
behavior?
Background: Annex J part 2 says (in various phrasings in
different revisions of the C standard, with the one below
The value of an uninitialized object that has automatic
storage duration is used before a value is assigned [is
undefined behavior] (6.5.7)
Remembering that Annex J is informative rather than normative,
is this statement right even for a type that has no trap
representations? To ask that question another way, is this
statement always right or is it just a (perhaps useful)
approximation?
[400+ lines deleted]
Post by Tim Rentsch
Summary: my reading is that accessing an object that has not
been explicitly stored into since its declaration was evaluated
is necessarily undefined behavior in C90, but not necessarily
undefined behavior in C99 and C11 (and AFAIAA also in C17 and
the upcoming C23). My reasoning is given in detail above.
Postscript: this commentary has taken much longer to write than
I thought it would, for the most part because I made an early
decision to be systematic and thorough. I hope the effort has
helped the readers gain confidence in the explanations and
conclusions stated. I may return to the deferred topic about
pointer types but have no plans at present about when that might
be.
Thank you for taking the time to write that.
I'd like to offer a brief summary of the points you made. Please let me
know if my summary is incorrect.
- An "indeterminate value" is by definition either an "unspecified
value" or a "trap representation".
- In C90 (which did not yet define all these terms), accessing the value
of an uninitialized object explicitly has undefined behavior.
- In C99 and later, J.2 (which is *not* normative) states that using the
value of an object with automatic storage duration while it is
int main(void) {
int n;
n;
}
has undefined behavior, even if int has no trap representations.
- Statements in J.2 *should* be supported by normative text.
- There is no normative text in any post-C90 edition of the C
standard that supports the claim that reading an uninitialized
int object actually has undefined behavior if it does not hold
a trap representation. (Pointers raise other issues, which I'll
ignore for now.)
- The cited statement in J.2 is incorrect, or at least imprecise.
I agree with you on all the above points.
There is one point on which I think we disagree. It is a matter
Remembering that Annex J is informative rather than normative,
is this statement right even for a type that has no trap
representations? To ask that question another way, is this
statement always right or is it just a (perhaps useful)
approximation?
[...]
- The value of an object with automatic storage duration is used
while it is indeterminate (6.2.4, 6.7.9, 6.8).
I get the impression that you're not particularly bothered by the fact
that the statement in J.2 is merely an "approximation". In my opinion,
the statement in J.2 is simply incorrect, and should be fixed. (That's
unlikely to be possible at this stage of the C23 process.) The fact
that Annex J is, to quote the standard's foreword, "for information
only", is not an excuse to ignore factual errors. Readers of the
standard rely on the informative annexes to provide correct information.
This particular text is not just a "(perhaps useful) approximation"; it
is actively misleading.
I'm not criticizing the author of the standard for making this mistake.
Stuff happens. It was likely a result of an oversight during the
transition from C90 to C99.
I personally agree with this analysis and also about the need to fix J.2.
Pointers seem to fit into this scheme if you think about the valid
addresses of objects + null pointers as the set of valid values
for a pointer. Any representation not corresponding to such an
address is then a non-value representation.

But note that there are many people who believe that "indeterminate"
should be understood as an abstract property propagated similar
to pointer provenance that can be an abstract non-value
representation even for types which do not have room for such
representations.

For C23 the rules stay the same. We changed the term "trap representation"
to "non-value representation" because people were often confused.
A non-value representation is UB in lvalue conversion but this does
not necessarily imply a trap. On the other hand, a trap might be
defined behavior caused by a valid value of a type.

The term "indeterminate value" was changed to "indeterminate
representation" because the wording "an indeterminate value is
either an unspecified value or a trap representation" does not
much sense because value and representation are different
things. Also some compilers and also C++ have indeterminate
values with different semantics, which caused confusion, i.e.
in C++ you can copy indeterminate values from an uninitialized
object to another and this is not UB. In C you either directly
have UB or you copy an unspecified value which is valid, so
there are no indeterminate values as such.


Martin
Tim Rentsch
2023-08-16 16:19:10 UTC
Permalink
Post by Keith Thompson
Post by Tim Rentsch
Does reading an uninitialized object [always] have undefined
behavior?
Background: Annex J part 2 says (in various phrasings in
different revisions of the C standard, with the one below
The value of an uninitialized object that has automatic
storage duration is used before a value is assigned [is
undefined behavior] (6.5.7)
Remembering that Annex J is informative rather than normative,
is this statement right even for a type that has no trap
representations? To ask that question another way, is this
statement always right or is it just a (perhaps useful)
approximation?
[400+ lines deleted]
Post by Tim Rentsch
Summary: my reading is that accessing an object that has not
been explicitly stored into since its declaration was evaluated
is necessarily undefined behavior in C90, but not necessarily
undefined behavior in C99 and C11 (and AFAIAA also in C17 and
the upcoming C23). My reasoning is given in detail above.
Postscript: this commentary has taken much longer to write than
I thought it would, for the most part because I made an early
decision to be systematic and thorough. I hope the effort has
helped the readers gain confidence in the explanations and
conclusions stated. I may return to the deferred topic about
pointer types but have no plans at present about when that might
be.
Thank you for taking the time to write that.
It's nice to be appreciated. Thank you.
Post by Keith Thompson
I'd like to offer a brief summary of the points you made. Please let me
know if my summary is incorrect.
Excellent. I am writing a reaction directly after each item.
Post by Keith Thompson
- An "indeterminate value" is by definition either an "unspecified
value" or a "trap representation".
Yes.
Post by Keith Thompson
- In C90 (which did not yet define all these terms), accessing the value
of an uninitialized object explicitly has undefined behavior.
C90 made "use [...] of indeterminately valued objects" part of the
definition of undefined behavior. To connect the dots we need to
know that "If an object that has automatic storage duration is not
initialized explicitly, its value is indeterminate." These two
normative items are combined into one in J.2: "The value of an
uninitialized object that has automatic storage duration is used
before a value is assigned".
Post by Keith Thompson
- In C99 and later, J.2 (which is *not* normative) states that using the
value of an object with automatic storage duration while it is
int main(void) {
int n;
n;
}
has undefined behavior, even if int has no trap representations.
For the J.2 summary, yes. I don't think I gave the implied
conclusion, but I agree with you that the J.2 entry does seem to
imply this.
Post by Keith Thompson
- Statements in J.2 *should* be supported by normative text.
I don't think I said this at all. At least for now I offer
no opinion on this recommendation.
Post by Keith Thompson
- There is no normative text in any post-C90 edition of the C
standard that supports the claim that reading an uninitialized
int object actually has undefined behavior if it does not hold
a trap representation. (Pointers raise other issues, which I'll
ignore for now.)
Yes, with a very minor correction that it is C99 and later, because
I haven't looked at the editions of the C standard after C90 but
before C99.
Post by Keith Thompson
- The cited statement in J.2 is incorrect, or at least imprecise.
I don't think I said this exactly. I did say or at least imply
that the quoted entry in J.2 is not completely accurate. Certainly
it allows conclusions that are not supported by normative text, and
looked at from that point of view it is "wrong".
Post by Keith Thompson
I agree with you on all the above points.
There is one point on which I think we disagree. It is a matter
Remembering that Annex J is informative rather than normative,
is this statement right even for a type that has no trap
representations? To ask that question another way, is this
statement always right or is it just a (perhaps useful)
approximation?
[...]
- The value of an object with automatic storage duration is used
while it is indeterminate (6.2.4, 6.7.9, 6.8).
I get the impression that you're not particularly bothered by the fact
that the statement in J.2 is merely an "approximation". In my opinion,
the statement in J.2 is simply incorrect, and should be fixed. (That's
unlikely to be possible at this stage of the C23 process.) The fact
that Annex J is, to quote the standard's foreword, "for information
only", is not an excuse to ignore factual errors. Readers of the
standard rely on the informative annexes to provide correct information.
This particular text is not just a "(perhaps useful) approximation"; it
is actively misleading.
Like I said before, for now I offer no opinion on this question. I
wouldn't mind if a footnote were added to help mitigate the problem.
Post by Keith Thompson
I'm not criticizing the author of the standard for making this mistake.
Stuff happens. It was likely a result of an oversight during the
transition from C90 to C99.
After reading the various standards carefully, I believe the wording
in the J.2 entry was not just an oversight. I suspect there is
something deeper going on. In neither case, however, does it prompt
any specific reaction (ie, in myself) as to what to do about it (if
anything).
Kaz Kylheku
2023-08-16 19:51:40 UTC
Permalink
Post by Keith Thompson
Post by Tim Rentsch
Does reading an uninitialized object [always] have undefined
behavior?
Background: Annex J part 2 says (in various phrasings in
different revisions of the C standard, with the one below
The value of an uninitialized object that has automatic
storage duration is used before a value is assigned [is
undefined behavior] (6.5.7)
Remembering that Annex J is informative rather than normative,
is this statement right even for a type that has no trap
representations? To ask that question another way, is this
statement always right or is it just a (perhaps useful)
approximation?
[400+ lines deleted]
Post by Tim Rentsch
Summary: my reading is that accessing an object that has not
been explicitly stored into since its declaration was evaluated
is necessarily undefined behavior in C90, but not necessarily
undefined behavior in C99 and C11 (and AFAIAA also in C17 and
the upcoming C23). My reasoning is given in detail above.
Postscript: this commentary has taken much longer to write than
I thought it would, for the most part because I made an early
decision to be systematic and thorough. I hope the effort has
helped the readers gain confidence in the explanations and
conclusions stated. I may return to the deferred topic about
pointer types but have no plans at present about when that might
be.
Thank you for taking the time to write that.
I'd like to offer a brief summary of the points you made. Please let me
know if my summary is incorrect.
- An "indeterminate value" is by definition either an "unspecified
value" or a "trap representation".
- In C90 (which did not yet define all these terms), accessing the value
of an uninitialized object explicitly has undefined behavior.
- In C99 and later, J.2 (which is *not* normative) states that using the
value of an object with automatic storage duration while it is
int main(void) {
int n;
n;
}
has undefined behavior, even if int has no trap representations.
- Statements in J.2 *should* be supported by normative text.
- There is no normative text in any post-C90 edition of the C
standard that supports the claim that reading an uninitialized
int object actually has undefined behavior if it does not hold
a trap representation. (Pointers raise other issues, which I'll
ignore for now.)
- The cited statement in J.2 is incorrect, or at least imprecise.
I agree with you on all the above points.
There is one point on which I think we disagree. It is a matter
Remembering that Annex J is informative rather than normative,
is this statement right even for a type that has no trap
representations? To ask that question another way, is this
statement always right or is it just a (perhaps useful)
approximation?
[...]
- The value of an object with automatic storage duration is used
while it is indeterminate (6.2.4, 6.7.9, 6.8).
I get the impression that you're not particularly bothered by the fact
that the statement in J.2 is merely an "approximation". In my opinion,
the statement in J.2 is simply incorrect, and should be fixed. (That's
unlikely to be possible at this stage of the C23 process.) The fact
that Annex J is, to quote the standard's foreword, "for information
only", is not an excuse to ignore factual errors. Readers of the
standard rely on the informative annexes to provide correct information.
This particular text is not just a "(perhaps useful) approximation"; it
is actively misleading.
I'm not criticizing the author of the standard for making this mistake.
Stuff happens. It was likely a result of an oversight during the
transition from C90 to C99.
I would be in favor of a formal model of what "uninitialized" means
which could be summarized as below.

Implementors wishing to develop tooling to catch uses of uninitialized
data can refer to the model; if their tooling diagnoses only
what the model deems undefined, then the tool can be integrated
into a conforming implementation.

- Certain objects are unintialized, like auto variables without
an initializer, or new bytes coming from malloc or realloc.

- What is undefined behavior is when an uninitialized value is used
to make a control-flow decision, or when it is output, or otherwise
passed to the host environment.

- The formal model defines "uninitialized" in terms of there being,
in the abstract semantics, a "shadow value" corresponding to every
byte of a value, and that shadow value indicates whether the
corresponding byte is initialized or not.

- Shadow values propagate across copies, accesses and calculations.

- No special exception is needed for unsigned, other than that
it doesn't have trap representations.

- This would be undefined:

{
int uninited;
int *p = &uninited;
int v = * (unsigned char *) p;

if (v) ... // undefined here

printf("%d\n", v); // undefined

No special blessing is required for unsigned char to access
the object. The resulting value keeps carrying the shadow byte
which indicates that it is uninitialized, and so when it is output,
or used for a control flow decision, the behavior is undefined.

memcpy can be written without outputting the bytes being copied,
and without allowing their value sto control flow.

If a structure is copied with memcpy, and has uninitialized padding,
the shadow value models says that the destination object now
has uninitialized padding.

- When a value is obtained by accessing an object which has one
or more uninitialized bytes, the corresponding bytes of the
value are uninitialized.

- When a calculation has any operands that have one or more
uninitialized bytes, all bytes of the resulting value
are uninitialized.

E.g. if there is an int *p, which is used to access a value *p,
where the low-order byte is initialized, then the low order
byte of *p is initialized; the other bytes are uninitialized.
But in the value *p + 0, the entire value is uninitialized.
Implementations following the model don't have to track individual
bits or bytes through calculations. This could apply to type
conversions. e.g. tif *p is of type unsigned char, and
refers to an uninitialized byte, then the entire promoted
int (or possibly unsigned int) value is uninitialized:
all four bytes (or however many) of it.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Kaz Kylheku
2023-08-16 20:03:54 UTC
Permalink
Post by Keith Thompson
Post by Tim Rentsch
Does reading an uninitialized object [always] have undefined
behavior?
Thank you for taking the time to write that.
[ ... ]
Post by Keith Thompson
I'm not criticizing the author of the standard for making this mistake.
Stuff happens. It was likely a result of an oversight during the
transition from C90 to C99.
[Supersede attempt to reduce quoted material.]

I would be in favor of a formal model of what "uninitialized" means
which could be summarized as below.

Implementors wishing to develop tooling to catch uses of uninitialized
data can refer to the model; if their tooling diagnoses only
what the model deems undefined, then the tool can be integrated
into a conforming implementation.

- Certain objects are unintialized, like auto variables without
an initializer, or new bytes coming from malloc or realloc.

- What is undefined behavior is when an uninitialized value is used
to make a control-flow decision, or when it is output, or otherwise
passed to the host environment.

- The formal model defines "uninitialized" in terms of there being,
in the abstract semantics, a "shadow value" corresponding to every
byte of a value, and that shadow value indicates whether the
corresponding byte is initialized or not.

- Shadow values propagate across copies, accesses and calculations.

- No special exception is needed for unsigned, other than that
it doesn't have trap representations.

- This would be undefined:

{
int uninited;
int *p = &uninited;
int v = * (unsigned char *) p;

if (v) ... // undefined here

printf("%d\n", v); // undefined

No special blessing is required for unsigned char to access
the object. The resulting value keeps carrying the shadow byte
which indicates that it is uninitialized, and so when it is output,
or used for a control flow decision, the behavior is undefined.

memcpy can be written without outputting the bytes being copied,
and without allowing their value sto control flow.

If a structure is copied with memcpy, and has uninitialized padding,
the shadow value models says that the destination object now
has uninitialized padding.

- When a value is obtained by accessing an object which has one
or more uninitialized bytes, the corresponding bytes of the
value are uninitialized.

- When a calculation has any operands that have one or more
uninitialized bytes, all bytes of the resulting value
are uninitialized.

E.g. if there is an int *p, which is used to access a value *p,
where the low-order byte is initialized, then the low order
byte of *p is initialized; the other bytes are uninitialized.
But in the value *p + 0, the entire value is uninitialized.
Implementations following the model don't have to track individual
bits or bytes through calculations. This could apply to type
conversions. e.g. tif *p is of type unsigned char, and
refers to an uninitialized byte, then the entire promoted
int (or possibly unsigned int) value is uninitialized:
all four bytes (or however many) of it.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Keith Thompson
2023-08-16 20:43:30 UTC
Permalink
Post by Kaz Kylheku
Post by Keith Thompson
Post by Tim Rentsch
Does reading an uninitialized object [always] have undefined
behavior?
Thank you for taking the time to write that.
[ ... ]
Post by Keith Thompson
I'm not criticizing the author of the standard for making this mistake.
Stuff happens. It was likely a result of an oversight during the
transition from C90 to C99.
[Supersede attempt to reduce quoted material.]
I would be in favor of a formal model of what "uninitialized" means
which could be summarized as below.
Implementors wishing to develop tooling to catch uses of uninitialized
data can refer to the model; if their tooling diagnoses only
what the model deems undefined, then the tool can be integrated
into a conforming implementation.
- Certain objects are unintialized, like auto variables without
an initializer, or new bytes coming from malloc or realloc.
- What is undefined behavior is when an uninitialized value is used
to make a control-flow decision, or when it is output, or otherwise
passed to the host environment.
Why restrict it to those particular uses, rather than saying that any
attempt to read an uninitialized value has undefined behavior?

For example, something like:
{
int uninit;
int copy = uninit + 1;
}
might cause a hardware trap on some systems (for example Itanium if
uninit is stored in a register and the NaT bit is set).

[...]
--
Keith Thompson (The_Other_Keith) Keith.S.Thompson+***@gmail.com
Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */
Kaz Kylheku
2023-08-16 21:08:19 UTC
Permalink
Post by Keith Thompson
Post by Kaz Kylheku
Post by Keith Thompson
Post by Tim Rentsch
Does reading an uninitialized object [always] have undefined
behavior?
Thank you for taking the time to write that.
[ ... ]
Post by Keith Thompson
I'm not criticizing the author of the standard for making this mistake.
Stuff happens. It was likely a result of an oversight during the
transition from C90 to C99.
[Supersede attempt to reduce quoted material.]
I would be in favor of a formal model of what "uninitialized" means
which could be summarized as below.
Implementors wishing to develop tooling to catch uses of uninitialized
data can refer to the model; if their tooling diagnoses only
what the model deems undefined, then the tool can be integrated
into a conforming implementation.
- Certain objects are unintialized, like auto variables without
an initializer, or new bytes coming from malloc or realloc.
- What is undefined behavior is when an uninitialized value is used
to make a control-flow decision, or when it is output, or otherwise
passed to the host environment.
Why restrict it to those particular uses, rather than saying that any
attempt to read an uninitialized value has undefined behavior?
Because that then brings back complications like

- unsigned char access has to be exempt

- what happens if we copy through in intermediate values:

int ch = *src++; // *src is uninitialized, therefore so is ch
*dst++ = ch; // ch is uninitialized and not unsigned char

Is the second access to ch uninitialized?

- structures: when a struct is access which has uninitialized
padding, what happens: we need a rule like if those bytes
are accessed, they are accessed as if unsigned char.

The idea of trapping only control flow decisions or output is inspired
by Valgrind.

Valgrind does not "spaz out" just because an uninitialized value is
accessed, because it would result in useless false positives.

Not all of the reasoning applies to C; part of it is that Valgrind is
working with machine, with no source language knowledge. The basic idea
makes sense though.

Valgrind usefully finds uninitialized data bugs, while allowing you to
write your own memcpy which can copy a structure full of uninitialized
bytes: and it does so without knowing anything about unsigned char.

We could make the rule that only visible behavior depending on
an uninitialized byte is undefined; the rule about control flows
makes it a bit tighter, while allowing the copying of of uninited
data.
Post by Keith Thompson
{
int uninit;
int copy = uninit + 1;
}
might cause a hardware trap on some systems (for example Itanium if
uninit is stored in a register and the NaT bit is set).
Right, so the model above doesn't speak to traps. We still have those.

You can copy an object using unsigned char not because it's specially
blessed for access (other than in regard to aliasing rules), but because
it has no trap representation.

On a machine without traps, the above code would just result
in copy being uninitialized.

If that value isn't printed, or used in if, or switch, then it
doesn't matter.

If the type int has trap representations, then it's undefined on that
implementation; it's basically just a matter of luck whether uninit is a
trap or a value, so it has to be regarded as undefined.

I believe that the model can be used to implement useful diagnostics
even without realizing the actual shadow bytes. A subset of the
bugs can be diagnosed within a lexical scope, like uses of
uninitialized auto locals. When the compiler is doing data flow
analysis, it just propagates that uninited info around the program
graph. If an uninited data flow reaches certain nodes in the program
graph, like where control decisions are made or certain functions
are called that are known to pass the datum to the host environment,
then it can diagnose.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Loading...