Null-Terminated Strings [LWN.net]

Null-Terminated Strings

Posted Nov 17, 2010 4:37 UTC (Wed) by neilbrown (subscriber, #359) [Link] (13 responses)

I actually think nul terminated strings are simple and elegant and work.

The problem is strcpy and strcat and sprintf should should never have existed. strlcpy etc are much better interfaces when you have static or preallocated buffers.
If you want dynamic strings, then talloc_strdup and talloc_strdup_append etc (in libtalloc) are probably your friends, though I confess I haven't used them extensively.

strlcpy

Posted Nov 17, 2010 6:26 UTC (Wed) by ncm (guest, #165) [Link]

Sorry, strlcpy is a failure. It takes more and uglier code to use it correctly than to use strcpy with the same level of checking. As a consequence, it is rarely used correctly, and unprofitably when it is. This is not to say that one cannot improve on strcpy, just that strlcpy doesn't.

Null-Terminated Strings

Posted Nov 17, 2010 8:26 UTC (Wed) by nix (subscriber, #2304) [Link] (6 responses)

nul terminated strings have a huge problem: you can accidentally overwrite the nul, and then you're dead. But if you're doing that, you can accidentally overwrite bits of the inside of the string as well, and then you have wrong results! Is that better? Probably not. Oh, and overwriting off the start or end can break your memory allocator or stack frame anyway, so you'd be dead in any case, even if not using nul-terminated strings.

And then we have Pascal-layout strings (as opposed to actual Pascal 'strings', a nightmare for other reasons, see Kernighan). They don't fix this problem (you just have to overwrite the start of the string, not its end) and have two much bigger problems: finite string length, and an increase in size of every string. The finite string length means that writing general string-handling algorithms without special cases for the rare event of large strings is impossible, and the increase in size of every string bloats small strings, which are by far the common case. You can patch both of these: the first, by making the finite string length as large as a pointer; and the second, by noting that alignment constraints in existing systems bloat the effective size of strings anyway. But of course this soon turns into a special case of nul-terminated strings: point the pointer at the end of the string, bingo, one rather hard-to-consult nul by any other name.

The biggest downside is probably a long-term ABI problem. The scheme is inflexible. If your Pascal string-length header is too short, however do you expand it? It's wired into every string-using program out there! At least nul-terminated strings need no expansion.

The real solution to string-handling unfortunately requires a VM of some description which can prevent the program from accidentally overwriting fields in aggregates by writes to any other field or variable. Then you can do reliable Pascal strings, separating the length from the content, or reliable null-terminated strings, with the separate compartment containing a pointer into the string. Unfortunately this is incompatible with low-level all-the-world's-a-giant-arena languages like C without very specialized fine-grained MMU hardware.

(I have, like everyone, written my own dynamic string-handliing library when younger. It starts out simple but it's amazing how soon you have to introduce extra code to track pointers and make freeing them in error cases less verbose, and extra code to track memory leaks... you need that in C anyway of course but the massive increase in dynamic memory use that dynamically allocating most strings brings tends to force them on you sooner than otherwise.)

Null-Terminated Strings

Posted Nov 17, 2010 12:38 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link] (4 responses)

"The biggest downside is probably a long-term ABI problem. The scheme is inflexible. If your Pascal string-length header is too short, however do you expand it? It's wired into every string-using program out there! At least nul-terminated strings need no expansion."

Come on.

You'd naturally use 32 bits on 32 bit systems for string length. And by a strange coincidence, that's the maximum amount of contiguous RAM that you can address on 32-bit systems. On 64-bit systems, you'd naturally use 64-bit counter.

Null-Terminated Strings

Posted Nov 18, 2010 16:40 UTC (Thu) by nix (subscriber, #2304) [Link] (3 responses)

Yes, I talked about that as well. Nice to know that reading five paragraphs of text is too much for you.

Null-Terminated Strings

Posted Nov 18, 2010 17:04 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

I read it completely. However, you point about: "The biggest downside is probably a long-term ABI problem. The scheme is inflexible. If your Pascal string-length header is too short, however do you expand it? It's wired into every string-using program out there! At least nul-terminated strings need no expansion" is not correct.

Null-Terminated Strings

Posted Nov 25, 2010 13:10 UTC (Thu) by renox (guest, #23785) [Link] (1 responses)

I agree with him that Pascal's strings are inflexible: think about two computer communicating together one with a 32-bit CPU, one with a 64-bit CPU, if you use a word as a length, you have an issue with Pascal's strings,
but C-strings don't care..

Null-Terminated Strings

Posted Nov 25, 2010 16:22 UTC (Thu) by vonbrand (guest, #4458) [Link]

No, you haven't... (Original) Pascal "strings" were just (packed) arrays of characters of a fixed length.

/me ducks and runs for cover

Null-Terminated Strings

Posted Nov 19, 2010 11:24 UTC (Fri) by job (guest, #670) [Link]

A modern string handling library would have to handle different character sets and different encondings as well, so there's already metadata to be stored with every string.

If memory efficiency is a problem for you, multibyte encodings is a much worse problem than storing string length. But UTF-8/16 is here to stay, there is simply no competition. I think we have to accept it.

Null-Terminated Strings

Posted Nov 18, 2010 9:50 UTC (Thu) by stijn (subscriber, #570) [Link]

Does it not give an easy fuzzing attack? For anything that parses an input stream, the presence of nul bytes in that stream can lead to very unpredictable results unless one is really careful. Additionally, it is painful to have a string version (str) and a byte array version (mem) of everything, especially with a richer API (e.g. splice(), substr(), squash()). I've come to the conclusion that keeping length alongside the array is the only sane solution. Perhaps that already commits it too much down one path, so that it does not properly belong in the C library. By now I think the best is to have a byte-array API, and leave it up to the user of that API whether they want to keep it C-string compatible. If the keeping-length overhead is unacceptable, it is possible to do the string manipulations painlessly with the more generic API, and isolate a classic C-string as the very last step.

Null-Terminated Strings

Posted Nov 18, 2010 15:38 UTC (Thu) by etienne (guest, #25256) [Link] (3 responses)

> I actually think nul terminated strings are simple and elegant and work.

And you can also combine them to do things like:
enum {lang_english, lang_french, lang_german} current_language = lang_french;
const char mltstr_language[] = "english\0francais\0deutch\0"
const char *curlang(const char *mltstr)
{
/* select the right sub-string depending on current_language */
}
void fct(void)
{
printf ("LANG=%s", curlang(mltstr_language))
}
It saves *a lot of space* ; having strings, (aligned) pointers arrays everywhere, and worse having (aligned) size for pascal strings takes easily more memory than the program code and data altogether.

Null-Terminated Strings

Posted Nov 18, 2010 17:24 UTC (Thu) by pr1268 (subscriber, #24648) [Link] (2 responses)

I like your code example, but it might only work in C (not C ).

Two cases in point:

Using the enum value as an array index might give unpredictable results since C treats enumerations as a distinct type (instead of int as in C)¹
The C standard library string can have '\0' characters anywhere inside the string (which may also lead to unpredictable behavior at runtime)². Of course, you're referring to a C-style string, so this may be a moot point.

¹ Stroustrup, B. The C Programming Language, Special Edition, p. 77
² Ibid, p. 583

Null-Terminated Strings

Posted Nov 19, 2010 10:58 UTC (Fri) by etienne (guest, #25256) [Link]

The enum is only used like (to have empty substring default to english), so no problem with its C size:

const char *curlang(const char *mltstr)
{
const char *ptr = mltstr;
for (unsigned cptlang = 0; cptlang < current_language; cptlang )
while (*ptr ) {}
return (*ptr)? ptr : mltstr;
}

Oviously none of the substrings can have embedded zero char.

A C line of code like:
cout << "The " << big? "big " : "small " << "dog is " << age << " year old.";
needs an efficient storage for small strings, even more when doing a multi language software.

Null-Terminated Strings

Posted Nov 20, 2010 1:03 UTC (Sat) by cmccabe (guest, #60281) [Link]

> I like your code example, but it might only work in C (not C ).

Sorry, you are confused. It works in both C and C.

> Using the enum value as an array index might give unpredictable results
> since C treats enumerations as a distinct type (instead of int as in C)1

Nope.

Here the enum is promoted to an integer. C , like C, promotes a lot of types to integers under the right situations.

> The C standard library string can have '\0' characters anywhere inside
> the string (which may also lead to unpredictable behavior at runtime)2. Of
> course, you're referring to a C-style string, so this may be a moot point.

There is no std::string in this example. You are confused.

Null-Terminated Strings

Posted Nov 18, 2010 12:44 UTC (Thu) by Kwi (subscriber, #59584) [Link] (2 responses)

A better suggestion might be D-style strings, which are dynamic arrays of char. In D, a dynamic array is a (pointer, length) tuple. This gives you the ability to work on substrings without having to allocate new memory, since a substring is nothing more than a new reference to the same character data.

(Incidentally, Java strings work the same way behind the scenes, but are immutable, which I guess is what you object to when you call them inefficient?)

Of course, one problem with this suggestion is that it doubles the size of a string reference (8 bytes on 32-bit architechtures, 16 bytes on 64 bit architechtures).

Null-Terminated Strings

Posted Nov 18, 2010 16:14 UTC (Thu) by pr1268 (subscriber, #24648) [Link] (1 responses)

> (Incidentally, Java strings work the same way behind the scenes, but are immutable, which I guess is what you object to when you call them inefficient?)

Exactly. And, my semi-rhetorical question immediately after that ("is there a better way?") begs the question of whether the Sun engineers who developed the Java language imposed that immutability for thread safety (since thread safety was/is a primary goal of the Java language). I don't know for sure; just going off intuition here.

Null-Terminated Strings

Posted Nov 19, 2010 10:07 UTC (Fri) by mfedyk (guest, #55303) [Link]

python has this immutable storage for values as well, but then the stopped and made the GIL...

Null-Terminated Strings

Posted Nov 18, 2010 18:03 UTC (Thu) by nevyn (guest, #33129) [Link]

I do: http://www.and.org/ustr/

I think it solves almost all the "normal" problems people have with non-nil terminated strings:

1. You can easily allocate them on the stack.

2. You can easily allocate them in constant memory.

3. "" and "x" don't have overhead in the 1,000% range (depending on how you count).

...but still has good solutions to the nil terminated strings problems, in that it allows you to have know the allocated size and length used (and put \0 in your string).

Saying that, the solution was far from obvious ... so while I think it would have been usable in the 1970s, using NIL terminated strings was much more obvious.