A stroll down fuzzer optimisation lane and why instrumentation policies matter

David Korczynski
Envoy Proxy
Published in
9 min readMay 14, 2021

--

This blog post goes through an in-depth view of libFuzzer internals and how this affects Envoy fuzzing performance. This blog post is by David Korczynski from Ada Logics.

During March and April this year security researchers from Ada Logics worked on the fuzzing infrastructure of Envoy Proxy sponsored by the Linux Foundation. One of the focus areas was optimisation and in this blog post we will go through some of the findings of this work. The full report is a comprehensive 28-page document that is available here.

In Envoy we have a large code base comprising roughly 1.3 million lines of code including auto-generated code (e.g. Protobuf code), testing infrastructure and also various important dependencies. To cater for security we have a comprehensive fuzzing suite that continuously analyses Envoy by way of OSS-Fuzz. We follow the principles of an ideal integration and to this end we have a variety of fuzzers where some fuzzers resemble unit-tests in that they target specific areas of the Envoy proxy, such as our json fuzzer, and others are closer related to integration-tests in that they target end-to-end concepts of Envoy, such as our HTTP2 end-to-end fuzzer.

The issue we had observed was that our end-to-end fuzzers ran with a low execution speed and, in particular, the execution speed was much lower than our own performance experiments with Envoy Proxy even after considering the slowdown of sanitizers, e.g. AddressSanitizer. Since execution speed is an essential factor for the success of fuzzers we had a desire to resolve the excessive slowdown and our investigations into this slowdown is the focus of this blog post.

Coverage instrumentation and performance in libFuzzer

In order to understand the reason for slowdown we need to understand how coverage-guided fuzzing works in libFuzzer. We describe this in detail in the full report so in this blog post we will keep it short.

From a simplified perspective, coverage-guided fuzzing with libFuzzer works as follows. First, we compile the target with fuzzer-specific instrumentation, which means we instrument the code with coverage-feedback instrumentation. This instrumentation will place counters on each basic block of the target program.

Second, when we run the compiled fuzzer the following process happens. At instantiation the fuzzer identifies the number of counters in the target and creates a “corpus” object as a combination of all the counters. The fuzzer then continues by performing the following infinite loop:

  1. Execute the fuzzer entry point. In this case, basic blocks that are executed will trigger an increase of its corresponding counter.
  2. Post-process the execution by going through each of the counters in the program and log their state.If any of the counters is different in comparison to previous runs, it means that the execution triggered new coverage. Therefore, update corpus with the new test case and record the state of the counters.

Following the description above, we can observe that there are performance penalties in two places when using libFuzzer, namely in the execution of the target as well as the post-processing after each fuzz iteration. The performance penalty during execution of the target code is fairly obvious for a non-fuzzing expert, however, the performance penalty of the post-processing is less obvious.

To measure the impact of the post-processing, consider the following simple fuzzer harness:

#include <stdint.h>
#include <string.h>
#include <stdlib.h>
#include “target.h”
int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size){
char *new_str = (char *)malloc(size+1);
if (new_str == NULL){
return 0;
}
memcpy(new_str, data, size);
new_str[size] = ‘\0’;
free(new_str);
return 0;
}

The above is an empty fuzzer that simply does a few operations but essentially explores no code. However, it includes the target.h header file and compiling this fuzzer with an empty “target.h” file, we can observe the amount of counters inserted by the fuzzer-instrumentation into the executable fuzzer:

$ clang -fsanitize=fuzzer ./test.c 
$ ./a.out -runs=0
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 3040949426
INFO: Loaded 1 modules (3 inline 8-bit counters): 3 [0x6ee0c0, 0x6ee0c3),
...

The output shows us the fuzzer has 3 inline 8-bit counters. Now, if we compile the exact same fuzzer but with a target.h file defined as follows:

int target(char *addr, size_t size) {
if (addr[0] == ‘A’ && size == 0) return 1;
return 0;
}

and then load the fuzzer again without execution any iterations, we observe the following:

$ clang -fsanitize=fuzzer ./test.c 
$ ./a.out -runs=0
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 3295510585
INFO: Loaded 1 modules (7 inline 8-bit counters): 7 [0x6ee0c0, 0x6ee0c7),

This time, the fuzzer has 7 inline 8-bit counters, despite nothing in the fuzzer-relevant code has changed. That is, the fuzzers have the exact same coverage and the only difference between the two fuzzers is that one fuzzer is compiled with additional, albeit unreachable, code which in turn is instrumented with coverage-feedback instrumentation.

The question is, what is the impact of these inline 8-bit counters even if the fuzzer does not change? In order to identify this, we can use a small script to generate arbitrarily large target.h files and then run an experiment for each fuzzer that counts the number of inline 8-bit counters and the number of executions per second of the resulting fuzzer. To observe this, we used the following minor Python script to automate creation of the target.h file with an arbitrary number of if-statements similar to the if-statement above as well as a simple bash script:

import os
max_iterations = int(os.environ['N'])
func_impl = "int target(char *addr, size_t size) {\n"
for i in range(max_iterations):
func_impl += "\tif (addr[0] == 'A' && size == %d) return %d;\n"%(i, i)
func_impl += "\treturn 0;\n}\n"

with open("target.h", "w") as ff:
ff.write(func_impl)

The Python script generates a target.h file with N number of if-statements and we then use the following bash script to run several experiments with various values of N:

for N in 1 1000 10000 50000 100000 200000 500000 1000000; do
echo "[+] Starting analysis"
export MAX_ITER=${N}
python ./make_p.py
clang -fsanitize=fuzzer test.c
./a.out -runs=0
./a.out -max_total_time=60
done

Running this script and noting the counters as well sa the execution speed, we get the following data:

We observe the number of inline 8-bit counters grow linearly with N, which is to be expected. Again, we emphasize that the fuzzer itself is still just a single function and all fuzzers have the same code coverage, the only difference is the amount of extra code in the final executable that is compiled with fuzzer-specific instrumentation, although this code is not reachable by the fuzzer.

The next question now is how do these counters impact the execution speed of the fuzzer? To measure this, we simply ran the fuzzer for 60 seconds (done in the bash script above) and noted the number of executions per second. We observed the following data:

The following Figure visualises the data:

The difference between 7 inline 8-bit counters and 3,000,004 is a slowdown of 2407x despite the fuzzer having the exact same coverage (namely 2) and essentially executing very little code. The slowdown is entirely due to the post-processing of counters, and these counters.

Impact on Envoy

So how did this impact the Envoy fuzzers? As noted above, Envoy consists of a lot of code and the vast majority of this code was built with sanitizers. The HTTP2 end-to-end fuzzer had a staggering 1.3 million inline 8-bit counters and there is a specific reason for instrumenting our codebase in this manner. Our fuzzing is run by OSS-Fuzz, which requires for the fuzzers to be built statically. As such, we build a lot of our dependencies ourselves rather than using pre-compiled binaries, and all of our instrumentation flags are also used when compiling these dependencies.

To get a more meaningful understanding of how much time was spent inside the post-processing we profiled our system using Prodfiler and observed that 26% of samples were observed inside the post-processing logic of our fuzzer, meaning that roughly a quarter of our system processing was spent in this function:

The impact of 1.3 million counters is significant and we observe in the experiment above that 1.5 million inline 8-bit counters causes a slowdown of 1203x on an empty fuzzer. It is also noteworthy here that not all of the code in our resulting fuzz binaries is actually reachable by a given fuzzer. Thus, by default, there is room for improvements in terms of avoiding instrumentation of irrelevant code. However, it may also be that there are improvements to be found by limiting coverage instrumentation of code that is reachable by a fuzzer. This observation raised a set of questions:

  1. Should we accept the slowdown of instrumenting all of the code in Envoy with coverage-guided instrumentation and continue in our usual manner?
  2. Assuming we should not instrument all code with coverage-guided instrumentation, which parts should we focus on? Are some parts more relevant than others, e.g. parser-specific code?
  3. Should we push the issue of selectively instrumenting the code of a fuzz target away from the Envoy codebase and instead focus on adding heuristics about this into libFuzzer?

In essence, between (1) and (2) above, the solution must be based on empirical evidence. In this context, the empirical evidence is the amount of bugs we found, and not necessarily the execution performance of our fuzzers.

Reducing coverage instrumentation in turn reduces the amount of code the fuzzer explores, which can also have the benefit of helping the fuzzer focus on important parts of the codebase. This seems positive as we can focus our fuzzer even more on our threat model. However, it could also leave out specific areas of the code and this might indirectly affect how the fuzzer explores relevant code.

We experimented with reducing coverage instrumentation on certain parts of the code, resulting in decreasing the inline 8-bit counters from 1.3 million to 270,000. We saw a speedup of about 2–3x on our end-to-end fuzzers with this reduction. This was a quick win in terms of performance improvements, however, there is still a lot of room for improvement if we continue to reduce the amount of code instrumented. Unfortunately, this reduction also comes with an increase in our build system, which is already fairly complex in and of itself.

Conclusions and future work

In an investigation into our fuzzing infrastructure we found that we incurred a non-negligible performance penalty in our fuzzers due instrumenting a large codebase with coverage-feedback instrumentation. The post-processing of each fuzz iteration due to the number of coverage-counters had a significant impact on the overall performance.

We observed that we need a more refined policy in terms of which parts of the codebase to instrument instead of instrumenting our entire codebase. Previously our goal was to instrument all of the codebase, whereas now, we have observed the need for instrumenting code on a per-fuzzer basis.

In addition to improving our own build set up, we think it is worth exploring improvements to libFuzzer itself. Specifically, it should be possible by the fuzzer to approximate which coverage-feedback should be used by a given fuzzer. The benefit of this is that there are potential large gains to be achieved across all projects that use libFuzzer.

We thank the Linux Foundation for sponsoring this project as well as the team at Envoy for a fruitful and enjoyable collaboration.

--

--

Software security researcher at Ada Logics, focusing on program analysis and vulnerability analysis.