Null-Terminated Strings
Null-Terminated Strings
Posted Nov 17, 2010 3:23 UTC (Wed) by pr1268 (subscriber, #24648)In reply to: Null-Terminated Strings by ldo
Parent article: Ghosts of Unix past, part 3: Unfixable designs
Do you have a better suggestion? Pascal-style strings?
While I agree that C-style strings are bothersome at times, there just doesn't seem to be any better alternative. And never mind that Java's Strings are hideously inefficient (but again, is there a better way?).
I don't mean to argue; I'm just playing devil's advocate here. I honestly don't know myself whether there could have been a better way to do character strings way back in the day.
Posted Nov 17, 2010 4:37 UTC (Wed)
by neilbrown (subscriber, #359)
[Link] (13 responses)
The problem is strcpy and strcat and sprintf should should never have existed. strlcpy etc are much better interfaces when you have static or preallocated buffers.
Posted Nov 17, 2010 6:26 UTC (Wed)
by ncm (guest, #165)
[Link]
Posted Nov 17, 2010 8:26 UTC (Wed)
by nix (subscriber, #2304)
[Link] (6 responses)
And then we have Pascal-layout strings (as opposed to actual Pascal 'strings', a nightmare for other reasons, see Kernighan). They don't fix this problem (you just have to overwrite the start of the string, not its end) and have two much bigger problems: finite string length, and an increase in size of every string. The finite string length means that writing general string-handling algorithms without special cases for the rare event of large strings is impossible, and the increase in size of every string bloats small strings, which are by far the common case. You can patch both of these: the first, by making the finite string length as large as a pointer; and the second, by noting that alignment constraints in existing systems bloat the effective size of strings anyway. But of course this soon turns into a special case of nul-terminated strings: point the pointer at the end of the string, bingo, one rather hard-to-consult nul by any other name.
The biggest downside is probably a long-term ABI problem. The scheme is inflexible. If your Pascal string-length header is too short, however do you expand it? It's wired into every string-using program out there! At least nul-terminated strings need no expansion.
The real solution to string-handling unfortunately requires a VM of some description which can prevent the program from accidentally overwriting fields in aggregates by writes to any other field or variable. Then you can do reliable Pascal strings, separating the length from the content, or reliable null-terminated strings, with the separate compartment containing a pointer into the string. Unfortunately this is incompatible with low-level all-the-world's-a-giant-arena languages like C without very specialized fine-grained MMU hardware.
(I have, like everyone, written my own dynamic string-handliing library when younger. It starts out simple but it's amazing how soon you have to introduce extra code to track pointers and make freeing them in error cases less verbose, and extra code to track memory leaks... you need that in C anyway of course but the massive increase in dynamic memory use that dynamically allocating most strings brings tends to force them on you sooner than otherwise.)
Posted Nov 17, 2010 12:38 UTC (Wed)
by Cyberax (✭ supporter ✭, #52523)
[Link] (4 responses)
Come on.
You'd naturally use 32 bits on 32 bit systems for string length. And by a strange coincidence, that's the maximum amount of contiguous RAM that you can address on 32-bit systems. On 64-bit systems, you'd naturally use 64-bit counter.
Posted Nov 18, 2010 16:40 UTC (Thu)
by nix (subscriber, #2304)
[Link] (3 responses)
Posted Nov 18, 2010 17:04 UTC (Thu)
by Cyberax (✭ supporter ✭, #52523)
[Link] (2 responses)
Posted Nov 25, 2010 13:10 UTC (Thu)
by renox (guest, #23785)
[Link] (1 responses)
Posted Nov 25, 2010 16:22 UTC (Thu)
by vonbrand (guest, #4458)
[Link]
No, you haven't... (Original) Pascal "strings" were just (packed) arrays of characters of a fixed length.
/me ducks and runs for cover
Posted Nov 19, 2010 11:24 UTC (Fri)
by job (guest, #670)
[Link]
If memory efficiency is a problem for you, multibyte encodings is a much worse problem than storing string length. But UTF-8/16 is here to stay, there is simply no competition. I think we have to accept it.
Posted Nov 18, 2010 9:50 UTC (Thu)
by stijn (subscriber, #570)
[Link]
Posted Nov 18, 2010 15:38 UTC (Thu)
by etienne (guest, #25256)
[Link] (3 responses)
And you can also combine them to do things like:
Posted Nov 18, 2010 17:24 UTC (Thu)
by pr1268 (subscriber, #24648)
[Link] (2 responses)
I like your code example, but it might only work in C (not C ). Two cases in point: 1 Stroustrup, B. The C Programming Language, Special Edition, p. 77
Posted Nov 19, 2010 10:58 UTC (Fri)
by etienne (guest, #25256)
[Link]
const char *curlang(const char *mltstr)
Oviously none of the substrings can have embedded zero char.
A C line of code like:
Posted Nov 20, 2010 1:03 UTC (Sat)
by cmccabe (guest, #60281)
[Link]
Sorry, you are confused. It works in both C and C.
> Using the enum value as an array index might give unpredictable results
Nope.
Here the enum is promoted to an integer. C , like C, promotes a lot of types to integers under the right situations.
> The C standard library string can have '\0' characters anywhere inside
There is no std::string in this example. You are confused.
Posted Nov 18, 2010 12:44 UTC (Thu)
by Kwi (subscriber, #59584)
[Link] (2 responses)
(Incidentally, Java strings work the same way behind the scenes, but are immutable, which I guess is what you object to when you call them inefficient?)
Of course, one problem with this suggestion is that it doubles the size of a string reference (8 bytes on 32-bit architechtures, 16 bytes on 64 bit architechtures).
Posted Nov 18, 2010 16:14 UTC (Thu)
by pr1268 (subscriber, #24648)
[Link] (1 responses)
> (Incidentally, Java strings work the same way behind the scenes, but are immutable, which I guess is what you object to when you call them inefficient?) Exactly. And, my semi-rhetorical question immediately after that ("is there a better way?") begs the question of whether the Sun engineers who developed the Java language imposed that immutability for thread safety (since thread safety was/is a primary goal of the Java language). I don't know for sure; just going off intuition here.
Posted Nov 19, 2010 10:07 UTC (Fri)
by mfedyk (guest, #55303)
[Link]
Posted Nov 18, 2010 18:03 UTC (Thu)
by nevyn (guest, #33129)
[Link]
I think it solves almost all the "normal" problems people have with non-nil terminated strings:
1. You can easily allocate them on the stack.
2. You can easily allocate them in constant memory.
3. "" and "x" don't have overhead in the 1,000% range (depending on how you count).
...but still has good solutions to the nil terminated strings problems, in that it allows you to have know the allocated size and length used (and put \0 in your string).
Saying that, the solution was far from obvious ... so while I think it would have been usable in the 1970s, using NIL terminated strings was much more obvious.
Null-Terminated Strings
If you want dynamic strings, then talloc_strdup and talloc_strdup_append etc (in libtalloc) are probably your friends, though I confess I haven't used them extensively.
strlcpy
Null-Terminated Strings
Null-Terminated Strings
Null-Terminated Strings
Null-Terminated Strings
Null-Terminated Strings
but C-strings don't care..
Null-Terminated Strings
Null-Terminated Strings
Null-Terminated Strings
Null-Terminated Strings
enum {lang_english, lang_french, lang_german} current_language = lang_french;
const char mltstr_language[] = "english\0francais\0deutch\0"
const char *curlang(const char *mltstr)
{
/* select the right sub-string depending on current_language */
}
void fct(void)
{
printf ("LANG=%s", curlang(mltstr_language))
}
It saves *a lot of space* ; having strings, (aligned) pointers arrays everywhere, and worse having (aligned) size for pascal strings takes easily more memory than the program code and data altogether.
Null-Terminated Strings
2 Ibid, p. 583Null-Terminated Strings
{
const char *ptr = mltstr;
for (unsigned cptlang = 0; cptlang < current_language; cptlang )
while (*ptr ) {}
return (*ptr)? ptr : mltstr;
}
cout << "The " << big? "big " : "small " << "dog is " << age << " year old.";
needs an efficient storage for small strings, even more when doing a multi language software.
Null-Terminated Strings
> since C treats enumerations as a distinct type (instead of int as in C)1
> the string (which may also lead to unpredictable behavior at runtime)2. Of
> course, you're referring to a C-style string, so this may be a moot point.
Null-Terminated Strings
Null-Terminated Strings
Null-Terminated Strings
Null-Terminated Strings