'Si must die'

Content

The C language is one of the most influential programming languages in history. It has become an indispensable tool for operating system development, displacing assembly languages from this pedestal. Learning C is essential for any self-respecting programmer. This language is loved for its external simplicity and hated for its ruthlessness towards errors. Thanks to it, we have the Linux kernel and thousands of vulnerabilities in it as well. Let's try to understand what this controversial C language is - a blessing or a curse?

The history of the C language begins in the depths of the American company Bell Labs and is closely related to the fate of the UNIX operating system. Its creators, Ken Thompson and Dennis Ritchie, developed their project for PDP-11 computers, and for the first two years their main tool was the assembly language. The labor intensity of writing machine code forced them to seek a replacement, which ultimately became the C language. With its help, the core of the operating system and a large part of the utilities were completely rewritten. The C language allowed for the creation of efficient low-level programs on PDP-11, practically without using assembly language.

Over time, the question of porting UNIX to new hardware platforms arose. The use of the C language significantly simplified this task. After all, if only assembly language had been used in development, the operating system would have had to be rewritten for each computer architecture. On the other hand, the UNIX source code still contained a lot of code specifically created for the PDP-11 computer. Moreover, the C language did not always accurately reflect the features and details of a particular hardware platform. This further complicated the porting process and deprived the language of one of its main advantages - transparent and understandable machine code generation. The more computer architectures C encompassed, the less obvious its connection to low-level programming became.

During the migration of UNIX to new hardware platforms, another problem was discovered. Ported programs in the C language ran slower than could be expected from them. The more the target computer architecture differed from PDP-11, the less efficient the generated code was. To compensate for this drawback, compiler developers increasingly began to apply implicit optimizations. And although this solution improved the performance of the programs themselves, C increasingly distanced itself from the low level. Now it was necessary not only to understand how the language constructs were defined for each of the computer architectures, but also how they were optimized. Of course, any compiler independently decided how to translate the source code for each hardware platform. As a result, writing a low-level program in C that was independent of the compiler used became practically impossible.

It was necessary to understand how to effectively implement high-level constructs of the C language while preserving its low-level properties. An attempt to solve this problem was the publication of the first standard of the language in 1989. It is commonly referred to as "ANSI C" or "C89", and this is what we will refer to in the future. The creators of the standard decided to finally sever the connection of C with the PDP-11 architecture and make the language completely high-level. The so-called "abstract machine" was introduced - an imaginary executor of code in the C language (section 2.1.2.3, "Program execution"):

Semantic descriptions in this Standard describe the behavior of an abstract machine, where optimization issues are irrelevant.

This means that compiler optimizations will not affect the operation of the program as long as its source text complies with the standard. The abstract machine had to solve two problems simultaneously. First, adherence to the standard allowed for the creation of easily portable programs in the C language. Second, the abstract machine could provide compilers with the freedom for optimizations. However, a quite reasonable question arises - how does the C language differ from any other compiled high-level language? The answer lies in the text of the standard. In order to still give programmers the theoretical opportunity to write low-level procedures, and thus non-portable ones, another concept was introduced - undefined behavior (undefined behavior, section 1.6, "DEFINITIONS OF TERMS"):

Undefined behavior - behavior when using an intolerable or erroneous programming construct, erroneous data, or objects with undefined values, for which the standard imposes no requirements. Possible undefined behavior ranges from completely ignoring the situation with unpredictable results, behavior during the translation or execution of the program in a documented manner characteristic of the environment (with or without issuing a diagnostic message), to the termination of translation or execution (with the issuance of a diagnostic message).

Simply put, undefined behavior is the intentionally left gaps in the description of the abstract machine of the C language. They allow compilers to decide for themselves how to handle certain constructs of the language, about the behavior of which the text of the standard deliberately remains silent. They can also be perceived as invalid in the program text. Let's take a closer look at undefined behavior with a specific example.

Let's take the following code snippet in C:

int x = 1; x = x << sizeof(int) * 8;

Let's try to assume what result we will get. Suppose we compiled this code for ARM architecture processors. The bit shift instruction on this hardware platform is defined in such a way that the final value of the variable "x" should be "0". On the other hand, we can translate our program into machine code for x86 architecture. And there, the bit shift is implemented in such a way that the value of "x" will not change and will remain equal to "1". We could conclude that the result of this code fragment depends on which hardware platform we compiled it for. But in reality, this is not the case.

In fact, this piece of code can be processed by the compiler in any possible and impossible way. The reason is as follows: according to the text of the C language standard, a bit shift by an amount greater than or equal to the size of the expression in bits is undefined behavior. Thus, there is no guarantee that this piece of code will work at all. In fact, even within the same architecture, the same compiler can generate completely different executable files. Let's provide examples of compiling and running a program that prints the value of the variable "x". In both cases, we use the gcc compiler version 10.2.1 for the target architecture x86-64.

$ cat test.c #include <stdio.h>

int main() { int x = 1; x = x << sizeof(int) * 8; printf("%d\n", x); return 0; } $ gcc test.c -o test $ ./test 1 $ gcc -O test.c -o test $ ./test 0

The flag "-O" allows the gcc compiler to use optimizations of the source code. The specific optimization mechanisms that can be applied, as well as the flags responsible for them, depend on the specific compiler. In general, it is impossible to know how undefined behavior in the program will be handled when translating the source code. Therefore, the only way to write portable programs in the C language is to completely avoid undefined behavior during development.

Let's consider a slightly more complex example. Another type of undefined behavior is dereferencing a null pointer. Its trivial variant will be the following code fragment:

  • (char *) 0;

Of course, no one in their right mind would write something like this in their program. However, it is not at all necessary to dereference a null pointer explicitly to invoke undefined behavior. In the series of articles "What Every C Programmer Should Know About Undefined Behavior" on the site blog.llvm.org, a code snippet is provided that confirms this:

void contains_null_check(int *p) { int dead = *p; if(p == 0) return; *p = 4; }

The example may seem contrived, but it allows for a better understanding of how the C language compiler works. The latter uses various optimization mechanisms, but here we are only interested in two. One of them removes unnecessary, "dead" code, while the other eliminates useless null pointer checks. If the first optimization mechanism is applied to the above code fragment, it will transform the function as follows:

void contains_null_check(int *p) { if(p == 0) return; *p = 4; }

Then the second mechanism will not detect unnecessary checks for a null pointer, and the source code of the function will take its final form. However, in reality, the order of optimizations may be different. For example, the compiler has the right to first eliminate unnecessary checks for a null pointer, and then the function will be transformed as follows:

void contains_null_check(int *p) { int dead = *p; if(0) return; *p = 4; }

Since we dereference the pointer before checking it, the compiler calmly decides that the pointer itself will never be null. As a result, the comparison "p == 0" is replaced with an expression that always returns false. Then the compiler initiates the first optimization mechanism and removes the "dead" code:

void contains_null_check(int *p) { *p = 4; }

It is important to emphasize that both of these optimizations are correct. Removing the check can become an unexpected gift for a programmer who is not attentive enough from the compiler developers. Such code can create a vulnerability for programs running without memory protection, i.e., kernels of operating systems or firmware of microcontrollers. Undoubtedly, this example contains an error, however, the main problem is not in it, but in how the compiler processes it.

Let's assume you accidentally introduced undefined behavior in your program. At best, you will immediately discover the error and fix it. In less fortunate cases, you will not do it right away. However, it is much more likely that the compiler will not take advantage of your mistake. In that case, undefined behavior will remain in the source code of the program until it manifests itself at the most inopportune moment. And such a moment can occur when changing: the target computer architecture, the compiler or even its version, optimization flags, or any other flags at all. Simply put, undefined behavior is a ticking time bomb. When it will blow up is unclear, but one can only guess how many interesting surprises are hidden in the source codes of thousands of programs.

Compiler optimizations can also affect the functions of the C standard library, including, for example, memset. It is widely and infamously known for the abundance of errors that programmers make when calling it. The function header looks as follows:

void *memset(void *ptr, int value, size_t num);

memset записывает "num" байтов со значением "value" по адресу "ptr". Несмотря на то, что параметр "value" имеет тип int, в действительности используется лишь его младший байт. Функция активно применяется для обнуления больших массивов данных, однако компилятор и сам частенько любит вставить её вызов туда, где это нужно и не очень. Так, любопытный случай обсуждался 15 апреля 2018 года на форуме osdev.org. Пользователь под ником ScropTheOSAdventurer создал topic, в которой рассказал о процессе разработки собственной учебной операционной системы. На свою беду он разрешил компилятору оптимизировать исходный код проекта, в результате чего последний перестал работать. В процессе отладки программист обнаружил ошибку в следующем фрагменте кода:

void *memset(void *ptr, int value, size_t num) { unsigned char *ptr_byte = (unsigned char *) ptr; for(size_t i = 0; i < num; ptr_byte[i] = (unsigned char) value, i++); return ptr; }

For his operating system, the developer decided to use his own implementation of the memset function. But he did not take into account that during the compilation process, the gcc compiler would discover a very tempting opportunity for optimization in this code. In fact, the function was ultimately transformed into the following form:

void *memset(void *ptr, int value, size_t num) { return memset(ptr, value, num); }

It is quite likely that among the developers of the gcc compiler were unmatched masters of sophistry. In any case, the compiler's optimization capabilities clearly exceed all limits accessible to the human mind.

Let's give another example with the memset function. The compiler is capable of not only creating its calls out of thin air but also removing them from the source code at its own discretion. Thus, in cryptographic programs, it is often useful to wipe all data from memory after it is no longer needed. Usually, such behavior is excessive, but let's imagine the following situation. Your program works with a user database that stores their names and passwords, and you described a function like this:

int check_password(const char *pwd) { char real_pwd[32]; get_password(real_pwd); return !strcmp(pwd, real_pwd); }

There is only one problem - after calling check_password, a string with the user's real password will remain on the stack. If your program has at least one vulnerability that allows reading data from memory, there is a real chance of stealing the password from the stack. An example of such a vulnerability was the infamous bug "Heartbleed". To reduce possible risks, the easiest way is to clear the stack fragment containing the password:

int check_password(const char *pwd) { int result; char real_pwd[32]; get_password(real_pwd); result = !strcmp(pwd, real_pwd); memset(real_pwd, 0, sizeof(real_pwd)); return result; }

It would seem that a solution has been found, but it's not that simple. An experienced compiler in optimization matters may decide that the call to memset here is unnecessary and will calmly remove it from the function body. Indeed, for the operation of the program itself, this action is absolutely useless. Worse still, the compiler may generate code in which the password ends up in one of the processor registers. In this case, obtaining it using a vulnerability in the program may turn out to be even easier. And you won't even be able to make the compiler clear the contents of the register. More details on this issue can be read at link.

One of the most insidious varieties of undefined behavior is strict aliasing. The term can be translated as "strict aliasing", however, there is no traditional name for it in Russian. For this reason, we will use the original English term. The text of the standard provides the following description for strict aliasing (section 3.3, "EXPRESSIONS"):

The value of the object must be accessible only through an lvalue expression of one of the following types: - the declared type of the object, - the qualified version of the declared type of the object, - a signed or unsigned type corresponding to the declared type of the object, - a signed or unsigned type corresponding to the qualified version of the declared type of the object, - an array, structure, or union type that includes one of the aforementioned types among its members (including, recursively, a member of an inner structure, array, or union),

- character type.

The easiest way to illustrate strict aliasing is with a specific example:

int x = 42; float *p = &x; *p = 13;

To invoke undefined behavior, it is sufficient to access any variable of a type incompatible with the declared one. This restriction can be circumvented by using a character type (char), to which the strict aliasing rules do not apply:

int x = 42; char *p = &x; *p = 13;

Only the decomposition of the variable into characters may turn out to be a labor-intensive task. It will be necessary to take into account the size of the data, as well as the byte order used. Undefined behavior can also be avoided using unions:

union u { int a; short b }; union u x; x.a = 42; x.b = 13;

However, this method is not without its drawbacks - the union must contain members of all possible types that will be used by the program. All of this seriously complicates the use of "type punning" or the so-called type pun - the intentional violation of the type system. This technique is necessary for more flexible low-level memory management of the machine.

To illustrate the benefits of type punning, let's consider a small example. Suppose you have read the contents of a file with an image into the program's memory. And now you need to write a function that returns the color of a pixel at the specified coordinates. For simplicity, we will assume that the size of the int type matches the size of the pixel, as well as the byte order of both:

int get_pixel(const char *buf, int width, int x, int y) { buf += get_header_size(buf); return ((const int *) buf)[y * width + x]; }

When calling the function, the address of the data area containing the file, including its header, image width, and the coordinates of the pixel whose color should be returned, is passed to it. Instead of the int type, we could choose any other type with a known size. But all of this is irrelevant because the get_pixel function is absolutely incorrect from the standard's perspective, as it violates strict aliasing rules. To use a type punning, it will be necessary to rewrite all the code related to the buffer being used, including the part responsible for reading the file.

There are a huge number of examples of programs that do not comply with strict aliasing rules. These include the famous function for calculating the fast inverse square root from the game Quake 3:

float FastInvSqrt(float x) { float xhalf = 0.5f * x; int i = *(int ) &x; i = 0x5f3759df - (i >> 1); / What the fuck? */ x = *(float *) &i; x = x * (1.5f - (xhalf * x * x)); return x; }

This code allowed to compute the inverse square root of a floating-point number several times faster than using the arithmetic coprocessor. However, this masterpiece of black magic does not pass the standard check - it seems that the creator of cult games John Carmack does not understand the C language at all.

There is one question left - why is this strict aliasing even needed? The point is that it allows compiler creators to apply extremely aggressive optimizations to the source code. The strict aliasing rules apply to accesses to any memory, including dynamic memory. Thus, the standards committee noted that the following code fragment (source):

void f(int *x, double *y) { *x = 0; *y = 3.14; *x = *x + 2; }

can be transformed in this way:

void f(int *x, double *y) { *x = 0; *y = 3.14; *x = 2; }

According to the rules of strict aliasing, the pointer y cannot hold the address of the same memory location as the pointer x. It is this fact that allows replacing the expression "*x = *x + 2" with "*x = 2". The active use of such optimizations by compilers has broken a huge amount of old code. Thus, in a letter dated July 12, 1998, one of the developers of the gcc compiler, Jeff Law, in response to questions about strict aliasing and related errors, writes (source):

There is a lot of code that violates strict aliasing rules. One such example is the "portable" universal IP checksum function found in the BSD source codes for network operations.

IMHO, such code is becoming less and less - modern compilers have been using strict aliasing in analysis for some time now, resulting in people being forced to fix their code. Of course, this does not apply to Linux and some other free projects, as they only use gcc.

If we start saying that such code is incorrect, it would be better for us to have some plan in case people start asking why their code, which has worked for many years, is now broken.

Point them to the C language standard :-) :-)

The strict aliasing rules for the gcc compiler can be enabled using the flag "-fstrict-aliasing" and disabled with the flag "-fno-strict-aliasing". The latter is recommended if you are unsure whether you are violating the text of the standard - you most likely are. Speaking of the Linux kernel mentioned in the letter, its author Linus Torvalds also gave his assessment of strict aliasing in particular and the work of the committee as a whole. Thus, criticizing the desire of one of the operating system developers to be extra cautious about violating the standard, Linus wrote such a letter (source):

To be honest, all of this seems dubious to me.

And I'm not talking about the changes themselves - I can come to terms with that. But the justification for these very changes is absolute and complete nonsense, and quite dangerous.

The fact is that using unions to implement type punning is a common and STANDARD way to do this. In fact, it is documented for gcc, and is used when you, being not too smart (original: "f*cking moron"), applied "-fstrict aliasing", and now you need to get rid of all the damage that this garbage standard imposes.

Andy, what was the reason for all this idiocy? And don't tell me that the text of the standard is "not clear enough". The text of the standard is, quite clearly, a piece of crap (see above about strict aliasing rules), and in such cases, it should be ignored. For this, it is necessary to use compiler tools to avoid damage. Similarly, one should act in situations where there is no complete clarity.

This is why we use "-fwrapv", "-fno-strict-aliasing" and other flags.

I have said this before and I will repeat it again: when the text of the standard contradicts reality - it is just a piece of toilet paper. It has absolutely no importance. In fact, instead of it, I would rather take a roll of real toilet paper - at least I won't have splinters and ink up my arse.

Apparently, Linus Torvalds poorly studied the C language - a real C programmer wouldn't think of such a thing.

However, the standard is not complete with just strict aliasing. To invoke undefined behavior, it is not even necessary to dereference a pointer:

void f(int *p, int q) { free(p); if(p == q) / Undefined behaviour! */ do_something(); }

Using the value of a pointer after the memory it points to has been freed is prohibited by the standard text (section 4.10.3, "Memory management functions"):

Summary
The C programming language, developed at Bell Labs, is one of the most influential languages in history, essential for operating system development, particularly UNIX. Initially, C replaced assembly language due to its efficiency in creating low-level programs for the PDP-11 computer. As UNIX was ported to new hardware platforms, C simplified this process, but challenges arose, including slower execution of ported programs and the need for compiler optimizations that distanced C from its low-level roots. In 1989, the first standard for C, known as ANSI C or C89, was introduced to address these issues by establishing an 'abstract machine' that allowed for high-level constructs while maintaining low-level properties. This standard aimed to enhance portability and provide compilers with optimization freedom. However, it also introduced the concept of 'undefined behavior,' which allows compilers to handle non-portable or erroneous constructs flexibly. This duality raises the question of how C differentiates itself from other high-level languages, as it retains the potential for low-level programming despite its high-level features.