Discussion:
May a string span multiple, independent objects?
(too old to reply)
Vincent Lefevre
2024-07-03 14:31:27 UTC
Permalink
ISO C17 (and C23 draft) 7.1.1 defines a string as follows: "A string
is a contiguous sequence of characters terminated by and including
the first null character."

But may a string span multiple, independent objects that happens
to be contiguous in memory?

For instance, is the following program valid and what does the ISO C
standard say about that?

#include <stdio.h>
#include <string.h>

typedef char *volatile vp;

int main (void)
{
char a = '\0', b = '\0';
vp p = &a, q = &b;

printf ("%p\n", (void *) p);
printf ("%p\n", (void *) q);
if (p + 1 == q)
{
a = 'x';
printf ("%zd\n", strlen (p));
}
if (q + 1 == p)
{
b = 'x';
printf ("%zd\n", strlen (q));
}
return 0;
}

If such a program is valid, would there be issues by working with
pointers on such a string, say, dereferencing p[1] in the first "if"
(which is normally UB)?
--
Vincent Lefèvre <***@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
Hans-Bernhard Bröker
2024-07-03 15:23:24 UTC
Permalink
Post by Vincent Lefevre
ISO C17 (and C23 draft) 7.1.1 defines a string as follows: "A string
is a contiguous sequence of characters terminated by and including
the first null character."
But may a string span multiple, independent objects that happens
to be contiguous in memory?
For instance, is the following program valid and what does the ISO C
standard say about that?
Comparing pointers pointing at distinct objects is already invalid (for
some interpretation of "invalid"), so: no. Yes, that means the
implementation of a function like memmove() cannot be fully portable C.

A program whose correctness relies on things "happening to be" just like
so cannot possibly be entirely valid.
Vincent Lefevre
2024-07-03 15:37:49 UTC
Permalink
Post by Hans-Bernhard Bröker
Comparing pointers pointing at distinct objects is already invalid (for
some interpretation of "invalid"), so: no. Yes, that means the
implementation of a function like memmove() cannot be fully portable C.
This is valid in C17 (6.5.9p6). If this is no longer valid, this
is really bad. GMP and MPFR fully rely on comparison of pointers
that may point to unrelated objects (to know whether arguments
are pointers to identical or different objects), and I think
that one cannot do differently to avoid useless copies of input
objects.
--
Vincent Lefèvre <***@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
James Kuyper
2024-07-03 16:11:25 UTC
Permalink
Post by Hans-Bernhard Bröker
Post by Vincent Lefevre
ISO C17 (and C23 draft) 7.1.1 defines a string as follows: "A string
is a contiguous sequence of characters terminated by and including
the first null character."
But may a string span multiple, independent objects that happens
to be contiguous in memory?
For instance, is the following program valid and what does the ISO C
standard say about that?
Comparing pointers pointing at distinct objects is already invalid (for
some interpretation of "invalid")
Comparison of valid pointers that point at distinct objects has
well-defined behavior. Such operations would be pretty useless if that
weren't the case, since they only compare the locations of those objects
- if they were allowed only for objects that are not distinct, the
locations would necessarily be the same, so, ==, <= and >= would always
return true, and !=, <, and > would always return false.

Furthermore, comparison for equality (as opposed to comparison for
relative order) is permitted even for objects that aren't sub-objects of
the same larger object. The problems with such code involves
incrementing and dereferencing such pointers, not comparing them.
Tim Rentsch
2024-08-08 15:51:26 UTC
Permalink
Post by Hans-Bernhard Bröker
Comparing pointers pointing at distinct objects is already invalid
(for some interpretation of "invalid"), [...].
Using relational operators (<, <=, >, >=) to compare pointers to
distinct objects has defined behavor only if the pointed-to objects
are sub-objects of the same containing object.

Equality operators (==, !=) may be used, with defined behavior,
to compare pointers to any objects, regardless of whether the
pointed-to objects belong to an enclosing containing object.
Post by Hans-Bernhard Bröker
Yes, that means the implementation of a function like memmove()
cannot be fully portable C.
The function memmove() can be defined in fully portable C. It's
just more convenient to write it in a way that takes advantage of
implementation internals.

James Kuyper
2024-07-03 15:59:06 UTC
Permalink
Post by Vincent Lefevre
ISO C17 (and C23 draft) 7.1.1 defines a string as follows: "A string
is a contiguous sequence of characters terminated by and including
the first null character."
But may a string span multiple, independent objects that happens
to be contiguous in memory?
If they're truly independent, you cannot portably guarantee that they
are contiguous, but they might happen to be contiguous.

If they happen to be contiguous, they can together qualify as a string,
but there's very little that can usefully be done with such a string.
That's because if you start with a pointer to one array, and increment
it until it points one past the end of that array, it is permitted for
that pointer to be compared for equality to a pointer to the start of
another array, and it will compare true if and only if they are
contiguous. However, it is undefined behavior to dereference such a
pointer, or to increment it even one step further. Therefore, any code
that tries to do anything useful with such an accidental string will
generally have undefined behavior.

While, in principle, undefined behavior could be arbitrarily bad, in
many cases this will not cause problems except on an implementation that
does run-time bounds checking of pointer, for instance raising a signal
if the behavior is undefined. Run time bounds checking would be very
slow, so it would probably only be turning on in a debugging mode.

Far more likely is a much more subtle possibility. Any time that code
has undefined behavior, an implementation might perform optimizations
based upon the assumption that you will not write such code.
Specifically, consider two pointers, one of which started out pointing
into one array, but was incremented to the point where the behavior was
undefined, and ended up pointing into a second array. The other pointer
started out pointing into that second array, and still does. They
currently both point at the same location. Because the behavior of such
code is undefined, an implementation is not obliged to make sure that
reads and writes through the two pointers are synchronized. If you have
*p = 'z', there's no guarantee that subsequently *q == 'z', even though
p and q both currently point at the same location. The 'z' might,for
instance, still be stored in a register waiting to be written to the
actual memory location at some later time.
Post by Vincent Lefevre
For instance, is the following program valid and what does the ISO C
standard say about that?
#include <stdio.h>
#include <string.h>
typedef char *volatile vp;
int main (void)
{
char a = '\0', b = '\0';
a and b are not guaranteed to be contiguous.
Post by Vincent Lefevre
vp p = &a, q = &b;
printf ("%p\n", (void *) p);
printf ("%p\n", (void *) q);
if (p + 1 == q)
{
That comparison is legal, and has well-defined behavior. It will be true
only if they are in fact contiguous.
Post by Vincent Lefevre
a = 'x';
printf ("%zd\n", strlen (p));
Because strlen() must take a pointer to 'a' (which is treated, for these
purposes, as a array of char of length 1), and increment it one past the
end of that array, and then dereference that pointer to check whether it
points as a null character, the behavior is undefined.

...
Post by Vincent Lefevre
If such a program is valid, would there be issues by working with
pointers on such a string, say, dereferencing p[1] in the first "if"
(which is normally UB)?
Yes.
Ben Bacarisse
2024-07-03 21:08:39 UTC
Permalink
Post by James Kuyper
Post by Vincent Lefevre
ISO C17 (and C23 draft) 7.1.1 defines a string as follows: "A string
is a contiguous sequence of characters terminated by and including
the first null character."
But may a string span multiple, independent objects that happens
to be contiguous in memory?
...
Post by James Kuyper
Post by Vincent Lefevre
For instance, is the following program valid and what does the ISO C
standard say about that?
#include <stdio.h>
#include <string.h>
typedef char *volatile vp;
int main (void)
{
char a = '\0', b = '\0';
a and b are not guaranteed to be contiguous.
Post by Vincent Lefevre
vp p = &a, q = &b;
printf ("%p\n", (void *) p);
printf ("%p\n", (void *) q);
if (p + 1 == q)
{
That comparison is legal, and has well-defined behavior. It will be true
only if they are in fact contiguous.
Post by Vincent Lefevre
a = 'x';
printf ("%zd\n", strlen (p));
Because strlen() must take a pointer to 'a' (which is treated, for these
purposes, as a array of char of length 1), and increment it one past the
end of that array, and then dereference that pointer to check whether it
points as a null character, the behavior is undefined.
I think this is slightly misleading. It suggests that the UB comes from
something strlen /must/ do, but strlen must be thought of as a black
box. We can't base anyhting on a assumed implementation.

But our conclusion is correct because there is explicit wording covering
this case. The section on "String function conventions" (7.24.1)
states:

"If an array is accessed beyond the end of an object, the behavior is
undefined."
--
Ben.
James Kuyper
2024-07-03 21:36:09 UTC
Permalink
...
Post by James Kuyper
Because strlen() must take a pointer to 'a' (which is treated, for these
purposes, as a array of char of length 1), and increment it one past the
end of that array, and then dereference that pointer to check whether it
points as a null character, the behavior is undefined.
I think this is slightly misleading. It suggests that the UB comes from
something strlen /must/ do, but strlen must be thought of as a black
box. We can't base anyhting on a assumed implementation.
But our conclusion is correct because there is explicit wording covering
this case. The section on "String function conventions" (7.24.1)
"If an array is accessed beyond the end of an object, the behavior is
undefined."
I'd forgotten about that; in fact, I'm not sure I've ever been aware of
that clause - but it makes sense. While C standard library routines
don't have to be written in C (many of them are in the standard library
precisely because they cannot be written in portable C), most of them
have limitations that make sense if you consider how they would have to
behave if written in C.
Vincent Lefevre
2024-07-04 13:22:26 UTC
Permalink
Post by Ben Bacarisse
Post by James Kuyper
Post by Vincent Lefevre
ISO C17 (and C23 draft) 7.1.1 defines a string as follows: "A string
is a contiguous sequence of characters terminated by and including
the first null character."
But may a string span multiple, independent objects that happens
to be contiguous in memory?
...
Post by James Kuyper
Post by Vincent Lefevre
For instance, is the following program valid and what does the ISO C
standard say about that?
#include <stdio.h>
#include <string.h>
typedef char *volatile vp;
int main (void)
{
char a = '\0', b = '\0';
a and b are not guaranteed to be contiguous.
Post by Vincent Lefevre
vp p = &a, q = &b;
printf ("%p\n", (void *) p);
printf ("%p\n", (void *) q);
if (p + 1 == q)
{
That comparison is legal, and has well-defined behavior. It will be true
only if they are in fact contiguous.
Post by Vincent Lefevre
a = 'x';
printf ("%zd\n", strlen (p));
Because strlen() must take a pointer to 'a' (which is treated, for these
purposes, as a array of char of length 1), and increment it one past the
end of that array, and then dereference that pointer to check whether it
points as a null character, the behavior is undefined.
I think this is slightly misleading. It suggests that the UB comes from
something strlen /must/ do, but strlen must be thought of as a black
box. We can't base anyhting on a assumed implementation.
I agree (and note that strlen is not necessarily written in C).
Post by Ben Bacarisse
But our conclusion is correct because there is explicit wording covering
this case. The section on "String function conventions" (7.24.1)
"If an array is accessed beyond the end of an object, the behavior is
undefined."
Arguments of these functions are either arrays and strings, where a
string is not defined as being an array (or a part of an array). So
I don't see why this text, as written, would apply to strings.

BTW, the definition of an object is rather vague: "region of data
storage in the execution environment, the contents of which can
represent values". But it is not excluded that contiguous areas
can form an object.

Similarly, malloc() is specified as allocating space for an object,
but this does not mean that one initially has an object in the
allocated space, though with the above restriction, this would
be important to be able to use memset() on this storage area.
--
Vincent Lefèvre <***@vinc17.net> - Web: <https://www.vinc17.net/>
100% accessible validated (X)HTML - Blog: <https://www.vinc17.net/blog/>
Work: CR INRIA - computer arithmetic / AriC project (LIP, ENS-Lyon)
Ben Bacarisse
2024-07-05 04:14:52 UTC
Permalink
Post by Vincent Lefevre
Post by Ben Bacarisse
Post by James Kuyper
Post by Vincent Lefevre
ISO C17 (and C23 draft) 7.1.1 defines a string as follows: "A string
is a contiguous sequence of characters terminated by and including
the first null character."
But may a string span multiple, independent objects that happens
to be contiguous in memory?
...
Post by James Kuyper
Post by Vincent Lefevre
For instance, is the following program valid and what does the ISO C
standard say about that?
#include <stdio.h>
#include <string.h>
typedef char *volatile vp;
int main (void)
{
char a = '\0', b = '\0';
a and b are not guaranteed to be contiguous.
Post by Vincent Lefevre
vp p = &a, q = &b;
printf ("%p\n", (void *) p);
printf ("%p\n", (void *) q);
if (p + 1 == q)
{
That comparison is legal, and has well-defined behavior. It will be true
only if they are in fact contiguous.
Post by Vincent Lefevre
a = 'x';
printf ("%zd\n", strlen (p));
Because strlen() must take a pointer to 'a' (which is treated, for these
purposes, as a array of char of length 1), and increment it one past the
end of that array, and then dereference that pointer to check whether it
points as a null character, the behavior is undefined.
I think this is slightly misleading. It suggests that the UB comes from
something strlen /must/ do, but strlen must be thought of as a black
box. We can't base anyhting on a assumed implementation.
I agree (and note that strlen is not necessarily written in C).
Post by Ben Bacarisse
But our conclusion is correct because there is explicit wording covering
this case. The section on "String function conventions" (7.24.1)
"If an array is accessed beyond the end of an object, the behavior is
undefined."
Arguments of these functions are either arrays and strings, where a
string is not defined as being an array (or a part of an array). So
I don't see why this text, as written, would apply to strings.
I'd say because an object like a (or b) is considered to be an array of
length one.
Post by Vincent Lefevre
BTW, the definition of an object is rather vague: "region of data
storage in the execution environment, the contents of which can
represent values". But it is not excluded that contiguous areas
can form an object.
Indeed. In fact an array is an object made up of contiguous objects,
but &a points to an array of length one.
Post by Vincent Lefevre
Similarly, malloc() is specified as allocating space for an object,
but this does not mean that one initially has an object in the
allocated space, though with the above restriction, this would
be important to be able to use memset() on this storage area.
I think you have an object as soon all the storage is allocated.
--
Ben.
James Kuyper
2024-07-05 05:37:35 UTC
Permalink
Post by Vincent Lefevre
Post by Ben Bacarisse
Post by James Kuyper
Post by Vincent Lefevre
ISO C17 (and C23 draft) 7.1.1 defines a string as follows: "A string
is a contiguous sequence of characters terminated by and including
the first null character."
But may a string span multiple, independent objects that happens
to be contiguous in memory?
...
Post by James Kuyper
Post by Vincent Lefevre
For instance, is the following program valid and what does the ISO C
standard say about that?
#include <stdio.h>
#include <string.h>
typedef char *volatile vp;
int main (void)
{
char a = '\0', b = '\0';
a and b are not guaranteed to be contiguous.
Post by Vincent Lefevre
vp p = &a, q = &b;
printf ("%p\n", (void *) p);
printf ("%p\n", (void *) q);
if (p + 1 == q)
{
That comparison is legal, and has well-defined behavior. It will be true
only if they are in fact contiguous.
Post by Vincent Lefevre
a = 'x';
printf ("%zd\n", strlen (p));
Because strlen() must take a pointer to 'a' (which is treated, for these
purposes, as a array of char of length 1), and increment it one past the
end of that array, and then dereference that pointer to check whether it
points as a null character, the behavior is undefined.
I think this is slightly misleading. It suggests that the UB comes from
something strlen /must/ do, but strlen must be thought of as a black
box. We can't base anyhting on a assumed implementation.
I agree (and note that strlen is not necessarily written in C).
Post by Ben Bacarisse
But our conclusion is correct because there is explicit wording covering
this case. The section on "String function conventions" (7.24.1)
"If an array is accessed beyond the end of an object, the behavior is
undefined."
Arguments of these functions are either arrays and strings, where a
string is not defined as being an array (or a part of an array). So
I don't see why this text, as written, would apply to strings.
BTW, the definition of an object is rather vague: "region of data
storage in the execution environment, the contents of which can
represent values". But it is not excluded that contiguous areas
can form an object.
Not everything you need to know about a term defined in the C standard
is included in its definition. Other parts of the standard tell you that
objects are created by declarations of identifiers for those objects
with static, thread_local, or automatic storage duration. Other parts
tell you that anonymous objects can be created by the presence of string
or compound literals. The description of the standard library tells you
that objects with allocated storage duration are created by calling
memory allocation functions.

Nowhere does it say that a larger C object can be created simply by
having two C objects that happen to be adjacent with each other.

The basic rule, even though it is not explicitly part of the definition
of "object", is that you don't have a C object unless some clause of the
C standard tells you that it is an object, and the clauses I've
summarized above are the only ones that do so.

Note: if they don't just "happen" to be adjacent - if the C standard
guarantees that two objects are adjacent to each other by reason of
being sub-objects of some larger object - then the existence of that
larger object is what makes the behavior defined when incrementing a
pointer into the first object through the second.
Post by Vincent Lefevre
Similarly, malloc() is specified as allocating space for an object,
but this does not mean that one initially has an object in the
Actually, it does. "The lifetime of an allocated object extends from the
allocation until the deallocation." (7.24.3p1). It becomes an object as
soon as allocated.

"The effective type of an object for an access to its stored value is
the declared type of the object, if any." (6.5p6).

Note that allocated memory is the only kind that doesn't start out with
a declared type. That paragraph goes on to say that

"If a value is stored into an object having no declared type through an
lvalue having a type that is not a non-atomic character type, then the
type of the lvalue becomes the effective type of the object for that
access and for subsequent accesses that do not modify the stored value."

Note that this wording describes it as already being an object before
any value has been written into the allocated memory. The second way to
give allocated memory an effective type uses wording with that same
implication:

"If a value is copied into an object having no declared type using
memcpy or memmove, or is copied as an array of character type, then the
effective type of the modified object for that access and for subsequent
accesses that do not modify the value is the effective type of the
object from which the value is copied, if it has one."
Tim Rentsch
2024-08-08 15:35:04 UTC
Permalink
Post by Vincent Lefevre
Post by Ben Bacarisse
Post by James Kuyper
ISO C17 (and C23 draft) 7.1.1 defines a string as follows: "A
string is a contiguous sequence of characters terminated by and
including the first null character."
But may a string span multiple, independent objects that happens
to be contiguous in memory?
...
Post by James Kuyper
For instance, is the following program valid and what does the
ISO C standard say about that?
#include <stdio.h>
#include <string.h>
typedef char *volatile vp;
int main (void)
{
char a = '\0', b = '\0';
a and b are not guaranteed to be contiguous.
vp p = &a, q = &b;
printf ("%p\n", (void *) p);
printf ("%p\n", (void *) q);
if (p + 1 == q)
{
That comparison is legal, and has well-defined behavior. It will
be true only if they are in fact contiguous.
a = 'x';
printf ("%zd\n", strlen (p));
Because strlen() must take a pointer to 'a' (which is treated, for
these purposes, as a array of char of length 1), and increment it
one past the end of that array, and then dereference that pointer
to check whether it points as a null character, the behavior is
undefined.
I think this is slightly misleading. It suggests that the UB comes
from something strlen /must/ do, but strlen must be thought of as a
black box. We can't base anyhting on a assumed implementation.
I agree (and note that strlen is not necessarily written in C).
Post by Ben Bacarisse
But our conclusion is correct because there is explicit wording
covering this case. The section on "String function conventions"
"If an array is accessed beyond the end of an object, the
behavior is undefined."
Arguments of these functions are either arrays and strings, where
a string is not defined as being an array (or a part of an array).
So I don't see why this text, as written, would apply to strings.
Something that's important to understand is the C standard is not
meant to be read as legalese or mathematicalese. Certainly the
authors are making an effort to be precise, but not always to the
degree that every sentence is entirely correct, or presenting the
whole story, if considered just in isolation. To avoid being led
astray it helps to remember that and try to read holistically in
addition to reading passages individually.

In any case, the question here is easily resolved by noting the
description in paragraph 1 of 7.24.1 "String function conventions",
which says in part

The header <string.h> declares one type and several functions,
and defines one macro useful for manipulating arrays of
character type and other objects treated as arrays of character
type. [...] Various methods are used for determining the
lengths of the arrays, but in all cases a char * or void *
argument points to the initial (lowest addressed) character of
the array.

Note especially the second part of the last sentence, starting with
"but in all cases". Arguments to functions in <string.h> always
refer to arrays, regardless of whether they might also refer to
strings.
Kaz Kylheku
2024-07-05 07:14:43 UTC
Permalink
Post by Vincent Lefevre
ISO C17 (and C23 draft) 7.1.1 defines a string as follows: "A string
is a contiguous sequence of characters terminated by and including
the first null character."
But may a string span multiple, independent objects that happens
to be contiguous in memory?
It is undefined behavior. Implementations are allowed to track the
provenance of a displaced pointer, and diagnose when it is out of bounds
even if the displaced value points into a valid object, and even if th
eprogram validates that via a well-defined equality test.
Post by Vincent Lefevre
For instance, is the following program valid and what does the ISO C
standard say about that?
#include <stdio.h>
#include <string.h>
typedef char *volatile vp;
int main (void)
{
char a = '\0', b = '\0';
vp p = &a, q = &b;
printf ("%p\n", (void *) p);
printf ("%p\n", (void *) q);
if (p + 1 == q)
{
a = 'x';
printf ("%zd\n", strlen (p));
}
In this situation, the p + 1 expression is well-defined as well
the p + 1 == q test.

However, while *q is a valid expression that evaluates to zero,
*(p + 1) isn't valid. The one byte past the object pointer value
may not be dereferenced.

The equivalence p + 1 == q doesn't save it; p + 1 is displaced from p,
unrelated to q.
Post by Vincent Lefevre
if (q + 1 == p)
{
b = 'x';
printf ("%zd\n", strlen (q));
}
return 0;
}
If such a program is valid, would there be issues by working with
pointers on such a string, say, dereferencing p[1] in the first "if"
(which is normally UB)?
An issue could be that the implementation's optimizer assumes that
p + 1 and q are poiners to distinct objects, even in the middle
of a block of code that is conditional on p + 1 == q.

If the code executes *(p + 1) = 'a', a subsequent evaluation of
*q or b can still produce 0.
--
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @***@mstdn.ca
Loading...