The many ways to copy a string in C

C, unlike most other languages, doesn't provide a string type of any sort. Its low-level nature means that strings are supported via arrays of null-terminated characters. Common string operations are provided by the C standard in the form of functions prefixed by str.

When it comes to string copying, the standard provides strcpy and strncpy. Despite this, BSD provides strlcpy alongside a corresponding strlcat function. The Linux kernel contains strscpy. C11 added strcpy_s, which may get removed. Microsoft have an abundance of options, including lstrcpy and StringCchCopy.

This plethora of string copying routines may pique your interest. Given that there are already two standard functions for copying a string, why is it so often reinvented? The primary reasons relate to safety, particularly when dealing with untrusted sources.

Let's look at each in turn.

strcpy

This is the simplest method to copy a C string. In fact, implementing and understanding strcpy is a good task for any student studying C.

char*
strcpy(char *dst, const char *src)
{
    while (*dst++ = *src++) {}
    return dst;
}

While the code is simple and in its own way rather elegant, it assumes that the source string is a valid null-terminated C string and that the destination buffer has sufficient space.

If the source parameter is not a valid null-terminated string then the loop will continue until it encounters a zero, possibly reading far forward in memory. This may cause a segmentation fault by accessing invalid memory pages.

A more distressing issue is that strcpy can cause buffer overruns by writing past the end of the destination buffer. Writing past the end of the buffer could, for example, overwrite data on the stack and, if carefully orchestrated, allow execution of arbitrary code. This can occur even when the source is a valid C string, if the string is longer than the destination buffer.

For example, an email address should not exceed 256 characters. Knowing this, a programmer could create a character buffer of size 257 in order to copy an email address using strcpy. If an untrusted input provides an email address longer than 256 then a buffer overrun will occur, providing an attack vector.

char email[257];
strcpy(email, source); // not safe if source invalid or longer than 256

You can read about the recent vulnerablity in Microsoft's Equation Editor (CVE-2017-11882) as an example of this exact weakness.

The only way to avoid this security issue while continuing to use strcpy is to make sure the source is a valid C string, its length is known and it's shorter than the destination buffer[1]. However, this may be impractical and better solutions exist in the form of other functions.

strncpy

After the issues with strcpy, this standard function offers hope with its extra length parameter. However, this is not a bounded version of strcpy,

strncpy was initially introduced into the C library to deal with fixed-length name fields in structures such as directory entries. Such fields are not used in the same way as strings. The trailing null is unnecessary for a maximum-length field, and setting trailing bytes for shorter names to null assures efficient field-wise comparisons. strncpy is not by origin a "bounded strcpy", and the Committee has preferred to recognize existing practice rather than alter the function to better suit it to such use.
Rationale for ANSI C, 4.11.2.4

strncpy is not designed for use with null-terminated C strings. It's included in this article because its frequent usage with C strings and the common misconception of its safety lead programmers to use it incorrectly and unsafely. The following analyses its usage on C strings.

An example implementation, taken from linux.die.net,

char*
strncpy(char *dest, const char *src, size_t n)
{
    size_t i;

    for (i = 0; i < n && src[i] != '\0'; i++) {
        dest[i] = src[i];
    }

    for ( ; i < n; i++) {
        dest[i] = '\0';
    }

    return dest;
}

The first loop reads as far as the end of the source string or as many as n characters, whichever comes first, copying each to the destination buffer. The second loop continues writing zeros to the destination buffer until exactly n characters have been written.

We can be sure that the source string is never read further than n characters and that the destination buffer has exactly, and no more than, n characters written. The two issues with strcpy are solved. However, there are now two different issues.

The most serious issue is also a little subtle. If the source string is too long then the destination buffer will not be null-terminated. This means that the result may or may not be a valid string, depending on the length of the source. Once aware of this issue, it is possible to work around it:

const size_t size = 256;
char dest[size + 1]; // one extra byte for the null terminator

strncpy(dest, src, size);
dest[size] = 0; // manually ensure a null terminator

This approach, however, is not so elegant and prone to mistakes. Even though it is possible, strncpy should not be used on C strings. Let it remain for fixed-length character arrays as it was designed.

The second issue is not a correctness or security issue but a performance one. strncpy always writes n characters to the destination buffer, regardless of the source size. If used on a large destination buffer inside an inner loop then a lot of time could be wasted writing unnecessary zeros.

The reason strncpy doesn't enforce a null terminator is that when using fixed-length arrays, the maximum length lets you imply a null terminator, thus saving one byte. If you have an array of length sixteen and you reach the sixteenth byte without seeing a null terminator, you still know that you've reached the end of the string. The null terminator is only required if the string is shorter than the array length.

It may be less obvious why the entire destination buffer is null-padded. A consistent value for the unused bytes of the arrays means that memcmp can be used for comparisons, improving efficiency.

strlcpy

strscpy is a not a standard C function but is present on various UNIX and Unix-like operating systems such as BSD, Solaris, Android and Mac OS X. It's not part of glibc on Linux but it can be found in libbsd.

Here is the implementation, taken from OpenBSD's CVS:

size_t
strlcpy(char *dst, const char *src, size_t dsize)
{
    const char *osrc = src;
    size_t nleft = dsize;

    /* Copy as many bytes as will fit. */
    if (nleft != 0) {
        while (--nleft != 0) {
            if ((*dst++ = *src++) == '\0') {
                break;
            }
        }
    }

    /* Not enough room in dst, add NUL and traverse rest of src. */
    if (nleft == 0) {
        if (dsize != 0) {
            *dst = '\0'; /* NUL-terminate dst */
        }

        while (*src++) {
            ;
        }
    }

    return(src - osrc - 1); /* count does not include NUL */
}

It has the same parameters as strncpy and solves the issues. The result is always null-terminated (unless the provided length was zero, in which case nothing is written) and there is no unnecessary null-padding.

You may note that the return type is not the same as strncpy. Whereas strcpy and strncpy somewhat unhelpfully returns the destination parameter you passed in, strlcpy returns the length of the source string. This means that the source string must be a valid C string, which is a departure from strncpy.

The result value does not include the null-terminator as part of the length. This simplifies the check for truncation:

char dst[100];
if (strlcpy(dst, src, sizeof(dst)) >= sizeof(dst)) {
    // truncation occurred
}

Most importantly, the requirement that the source is a valid C string means that you cannot use this function safely on untrusted source strings. If you need a function that offers the same functionality but can operate on untrusted sources, read on.

strscpy

The Linux kernel contains strscpy. It is not a standard function in any respect but the Linux kernel is a notable body of low-level code and it's worth exploring why it has its own string copying routine. Linus gave his rationale, in characteristic style, for not using strlcpy and having an alternative:

But no, strlcpy() is complete garbage, and should never be used. It is truly a shit interface, and anybody who uses it is by definition buggy.

Why? Because the return value of "strlcpy()" is defined to be ignoring the limit, so you FUNDAMENTALLY must not use that thing on untrusted source strings.

But since the whole *point* of people using it is for untrusted sources, it by definition is garbage.

Ergo: don't use strlcpy(). It's unbelievable crap. It's wrong. There's a reason we defined "strscpy()" as the way to do safe copies

— Linus Torvalds

In short, if you have an untrusted source, use strscpy.

The return type is a signed type and returns the number of characters copied. However, strscpy returns -E2BIG when truncation occurs. This may be slightly more readable than the equivalent test for strlcpy, though this is subjective.

char dst[100];
if (strscpy(dst, src, sizeof(dst)) == -E2BIG) {
    // truncation occurred
}

strcpy_s

TR24731-1 (Extensions to the C Library Part I: Bounds-checking interfaces) proposed more secure versions of common functions, adding the suffix _s. These functions do not prevent or correct any errors but highlight these errors when they occur. If a function violates any safety constraits it calls the current constraint handler and the user can ignore or react to it.

Given the issues with strcpy, it's hardly surprising that strcpy_s was proposed. It takes a length parameter, unlike strcpy.

errno_t strcpy_s(char *restrict dest, rsize_t destsz, const char *restrict src);

An error is raised if either the source or destination is a null pointer; the source and destinations overlap; the length is zero; the length is too large[2] or if truncation would occur. If any of these conditions are detected, the function returns a non-zero error code and in most cases the destination buffer's first element is set to zero.

#include <stdlib.h>

char buf[10];

// default constraint handler is implementation-defined
if (strcpy_s(buf, sizeof(buf), "Works")) {
    // Don't reach here because strcpy_s succeeds and
    // returns zero
}

// The handler can be disabled, in effect:
set_constraint_handler_s(ignore_handler_s);
if (strcpy_s(buf, sizeof(buf), "Too much text for buf")) {
    // We reach here because strcpy_s returns ERANGE (non zero)
    // to signify overflow. buf[0] is set to \0.
}

set_constraint_handler_s(abort_handler_s);
if (strcpy_s(buf, sizeof(buf), "Too much text for buf")) {
    // We don't reach here as the call to strcpy_s invokes the
    // handler, which aborts the program.
}

Microsoft's secure CRT, which implements strcpy_s, can use C++ templates to automatically supply the length parameter[3], if you compile as C++ and set a define:

#define _CRT_SECURE_CPP_OVERLOAD_SECURE_NAMES 1

char email[257];
strcpy_s(email, source); // is equivalent to strcpy_s(email, 257, source)

TR24731-1 was accepted into the C11 standard in Annex K, which deals with bounds-checking. However, implementation is optional and few have implemented it, effectively preventing portability. Microsoft is the most notable implementor but even theirs doesn't follow the standard in all cases[4].

Furthermore, there is a proposal (n1967) to remove Annex K from the next standard. It draws on field experience to discuss the successes and failures of Annex K in regard to its goals and safety. It concludes:

Therefore, we propose that Annex K be either removed from the next revision of the C standard, or deprecated and then removed.

Summary

strcpy, despite being the standard method for copying a C style string, is particularly problematic. For safe usage the source must be a valid C string, its length must be known and be shorter than the destination buffer. In most cases, you'll need a different, safer function.

strncpy, also a standard function, is not designed for C style strings and this shows because it does not ensure that the result is a valid C string. While this can be worked around, it's best to leave it for fixed-length character arrays as per its design.

strlcpy, a widely-available function, is a good way to copy a string but requires that the source string be valid. It cannot be used on untrusted sources.

strscpy, from the Linux kernel, is much like strlcpy but can be used on untrusted sources.

strcpy_s, an optional part of the C11 standard, would be a great suggestion but its appeal is limited due to the lack of portability and potential removal from the standard. For those who can be sure they only require Windows coverage using MSVC, it may be suitable.

Working with strings in C is tricky and not C's speciality. In many cases, the complexity is incidental and not inherent. You can use a library, such as the better string library, to raise string operations to a slightly higher level.

If you wish to work with UTF-8 text in C, the utf8rewind library may be of interest.

Footnotes

  1. Due to the terminating null character, the destination buffer must be at least one character longer than the length of the source string.
    [return]

  2. Too large means greater than RSIZE_MAX. RSIZE_MAX may be set to a value smaller than rsize_t's maximum in order to catch potential bugs, such as accidents with signed and unsigned conversions:

    3 Extremely large object sizes are frequently a sign that an object’s size was calculated incorrectly. For example, negative numbers appear as very large positive numbers when converted to an unsigned type like size_t. Also, some implementations do not support objects as large as the maximum value that can be represented by type size_t.
    4 For those reasons, it is sometimes beneficial to restrict the range of object sizes to detect programming errors. For implementations targeting machines with large address spaces, it is recommended that RSIZE_MAX be defined as the smaller of the size of the largest object supported or (SIZE_MAX >> 1), even if this limit is smaller than the size of some legitimate, but very large, objects. Implementations targeting machines with small address spaces may wish to define RSIZE_MAX as SIZE_MAX, which means that there is no object size that is considered a runtime-constraint violation.
    C11, K.3.2

    [return]

  3. This is the essence of extracting an array size in C++ using templates:

    template <typename T, size_t N>
    size_t array_size(T (&array)[N])
    {
        return N;
    }

    This can only work on an array and not a pointer.
    [return]

  4. For example, strtok_s and localtime_s differ from the standard.
    [return]