{n} times faster than C, where n = 128

A blog post popped up on the front page of Hacker News recently, titled {n} times faster than C (HN comments), where the author shows that sometimes a human can spot optimization opportunities a compiler can't. By tweaking the assembly of the tight loop in his program, he was able to achieve a 6x speedup over the baseline C program. In a follow up post, the speedup is raised further to an impressive 13.5x by rewriting the program from scratch in assembly. This certainly proves his point, but can we do even better?

TLDR: Our final program achieved a speedup of 128x (36 GiB/s throughput) by reformulating the problem and leveraging SIMD intrinsics.

We will be using Rust (rustc 1.70.0) for our programs. All benchmarks are done on a Macbook Pro 2021 (Apple M1 Pro). Code available at Github repo.

The problem

Consider a random string consisting of s and p. We want to write a function that iterate over the string, incrementing an accumulator if we see s and decrement for p. The value of the accumulator at the end is our result.

fn baseline(input: &str) -> i64 {
    let mut res = 0;
    for b in input.bytes() {
        match b {
            b's' => res += 1,
            b'p' => res -= 1,
            _ => continue,
        }
    }
    res
}

Apart from the language choice and the lack of \0 null terminator check, our function is otherwise identical to the baseline C program from the original blog post. Using the same benchmark input setup¹, our program took 3.3316 ms to execute with a throughput of 286.25 MiB/s. This is in the same ballpark as the author's version which has a throughput of 126.3 MiB/s on my machine.

Time	Throughput
3.3316 ms	286.25 MiB/s

Going up the ladder of abstraction

Optimizing programs does not always require dropping down to assembly; on the contrary, expressing a computation more abstractly gives the compiler more freedom to optimize our code.

fn opt1_idiomatic(input: &str) -> i64 {
    input
        .bytes()
        .map(|b| match b {
            b's' => 1,
            b'p' => -1,
            _ => 0,
        })
        .sum()
}

The logic is rewritten using iterator combinators, a way of expressing computation over a stream of elements in a functional style. In theory, iterator combinators are a zero-cost abstraction in Rust and the compiler should be able to generate tight assembly that we can't handwrite any better. Indeed, we got a speedup of 14.7x over baseline just by rewriting our code in idiomatic Rust!

Time	Throughput	Relative speed
227.33 µs	4.0968 GiB/s	14.7

Where did this increase in performance comes from? A quick glance over the assembly output reveals that the compiler had aggressively unrolled and vectorized our code. Of the 165 lines of assembly, 102 lines are SIMD instructions. It would be tough to beat this vector monstrosity.

Dropping branches

Continuing at a high level of abstraction, let's see if there are any optimization opportunities that the compiler cannot spot.

What is better than two branches? How about one! According to the problem description, the input string only consists of s and p. This gives us two insights:

A character is either s or not s.
len(input) == count(input, 's') + count(input, 'p')

The function is therefore equivalent to f(input) = count(input, 's') - (len(input) - count(input, 's')). In other words, the return value is the number of s subtracted by the number of p.

fn opt2_count_s(input: &str) -> i64 {
    let n_s = input.bytes().filter(|&b| b == b's').count();
    (2 * n_s) as i64 - input.len() as i64
}

Time	Throughput	Relative speed
152.44 µs	6.1096 GiB/s	21.9

This algorithmic improvement yielded us 49% improvement over our idiomatic solution. Before moving on, we could further micro-optimize our loop body to be totally branchless. Conceptually, our filter + count combo would branch if we see a s and increment an accumulator, in practice the compiler was able to optimized away the branch. Nonetheless, let's implement this manually. Observe that the ASCII value for p and s is 115 and 112 respectively, or 1110011 and 1110000 in binary. We can convert this into a s count by doing a bitwise AND against 0000001 and sum up all the 0 and 1s.

fn opt3_count_s_branchless(input: &str) -> i64 {
    let n_s = input.bytes().map(|b| (b & 1) as i64).sum::<i64>();
    (2 * n_s) - input.len() as i64
}

Time	Throughput	Relative speed
72.902 µs	12.775 GiB/s	45.7

We beat the compiler again and by a wide margin! We took advantage of the domain knowledge that the character can only take on two values, such that we do not need to do a equality comparison. While a compare instruction takes no more time to execute than a bitwise AND instruction under the ARM architecture, the former returns a boolean which sets all the bits to 1 for true and 0 for false. Additional instructions or branches would be needed to make use of this result.

Into the SIMD-verse

So far we have steered clear of tweaking assembly and for good reason, the compiler did the heavy lifting of vectorization and loop unrolling for us to great success. Let's now try to see if we can hand code our tight loop any better than the compiler. Instead of rewriting our program in assembly from scratch or using inline assembly, we will use compiler intrinsics which allow us to manually emit vector instructions by way of library functions.

Our code uses platform-specific intrinsics for the aarch64 platform, but the same optimizations can be translated to x64 and other architectures. The Apple Silicon processors, using the ARM architecture, has a SIMD extension known as Neon. These registers are 128-bit wide, allowing us to fit and operate on 16 8-bit elements at once.

use std::arch::aarch64::*;

fn opt4_simd(input: &str) -> i64 {
    let n = input.len();
    const N_LANES: usize = 16;
    let n_iters = n / N_LANES;
    let n_simd_elems = n_iters * N_LANES;
    let mut res = 0;
    // Intrincics are unsafe
    unsafe {
        // All the intrincics operates on the type uint8x16_t
        // Set all lanes to 0
        let mut acc_v = vmovq_n_u8(0);
        // Set all lanes to 1
        let one_v = vmovq_n_u8(1);
        // Process 16 characters per iteration
        for block_i in 0..n_iters {
            // Load 16 bytes from memory to vector register
            let input_v = vld1q_u8(input[block_i * N_LANES..].as_ptr());
            // Vector bitwise and
            let eq_s_v = vandq_u8(input_v, one_v);
            // Vector add
            acc_v = vaddq_u8(acc_v, eq_s_v);
            // Flush accumulator every 128 iterations to avoid overflow
            if block_i % 128 == 127 {
                // Reduce 16 u8 horizontally to a u16 scalar
                res += vaddlvq_u8(acc_v) as i64;
                acc_v = vmovq_n_u8(0);
            }
        }
        res += vaddlvq_u8(acc_v) as i64;
    }
    res = (2 * res) - n_simd_elems as i64;
    // Process leftovers
    res + baseline(&input[n_simd_elems..])
}

The function is logically the same as our previous algorithm, with the only change being that of flushing our vector accumulator every 128 iterations² to prevent overflowing each u8 lane. Since every loop iterations require 16 characters, there might be some leftover if the length of the input is not a multiple of 16. In this case we simply process the remaining items using our baseline function.

Time	Throughput	Relative speed
43.131 µs	21.593 GiB/s	77.2

Somehow our manually vectorized version is almost 80% faster than the compiler vectorized version. Assembly listing shows that our hot loop is 18 instructions whilst the automatic one is 34, a 88.8% reduction. The "branchyness" of our version likely accounts for the remaining difference between runtime and instruction count. The compiler vectorized inner loop, while fully branchless, appears to be aggregating the s count to wider and wider SIMD lanes every iteration to avoid integer overflow. On the other hand, our version only aggregates every 128 iterations to a scalar register directly.

While we gained better vectorization, the compiler stopped unrolling our loop. No matter, let's do that ourselves. To test different unroll factor, we will use Rust's macros to unroll our loop various amount of times. Feel free to skip over the code.

use std::arch::aarch64::*;
use seq_macro::seq;

macro_rules! simd_unrolled {
    ($func_name:ident, $unroll_factor:literal) => {
         pub fn $func_name(input: &str) -> i64 {
            let n = input.len();
            const N_LANES: usize = 16;
            const N_ELEM_PER_ITER: usize = N_LANES * $unroll_factor;
            let n_iters = n / N_ELEM_PER_ITER;
            let n_simd_elems = n_iters * N_ELEM_PER_ITER;
            let mut res = 0;
            unsafe {
                seq!(I in 0..$unroll_factor {
                    let mut v_acc~I = vmovq_n_u8(0);
                });
                let one_v = vmovq_n_u8(1);
                for block_i in 0..n_iters {
                    let offset = block_i * N_LANES * $unroll_factor;
                    seq!(I in 0..$unroll_factor {
                        let v_input~I = vld1q_u8(input[offset + I * N_LANES..].as_ptr());
                        let v_eq_s~I= vandq_u8(v_input~I, one_v);
                        v_acc~I = vaddq_u8(v_acc~I, v_eq_s~I);
                    });
                    if block_i % 128 == 127 {
                        seq!(I in 0..$unroll_factor {
                            res += vaddlvq_u8(v_acc~I) as i64;
                            v_acc~I = vmovq_n_u8(0);
                        });
                    }
                }
                seq!(I in 0..$unroll_factor {
                    res += vaddlvq_u8(v_acc~I) as i64;
                });
            }
            res = (2 * res) - n_simd_elems as i64;
            res + baseline(&input[n_simd_elems..])
        }
    }
}

simd_unrolled!(opt5_simd_unrolled_2x, 2);
simd_unrolled!(opt5_simd_unrolled_4x, 4);
// ...

Unroll factor	Time	Throughput	Relative speed
2x	32.810 µs	28.385 GiB/s	101.5
4x	28.524 µs	32.650 GiB/s	116.8
8x	26.518 µs	35.120 GiB/s	125.64
10x	26.070 µs	35.724 GiB/s	127.8 🎉
12x	27.833 µs	33.461 GiB/s	119.7
16x	27.157 µs	34.293 GiB/s	122.7

After experimenting with different unroll factors, we found 10 to be the sweet spot, bringing us to an impressive 127.8x reduction in runtime over out starting point. Fun fact: Our throughput of 35.724 GiB/s is equivalent to processing ~12 characters per clock cycle assuming the processor is running at its peak 3.22 GHz clock speed. Not too bad considering we didn't micro-micro-optimize instruction selection and ordering.

Conclusion

We obtained a 128 times performance improvement by exploiting domain knowledge of the problem as well as fixing suboptimal compiler output. There is no doubt that there are still opportunities left on the table, in fact, it would be trivial to gain up to 4 times more performance by switching to wider vector architectures such as AVX-512³. However, I am satisfied with the amount of speedup achieved while keeping the code relatively readable.

Discussions: reddit, lobsters

After the post was released, some readers managed to come up with more performant versions without using explicit SIMD intrinsics. The best one achieved a speedup of 290x (81 GiB/s) 🚀. See code here.

A 1,000,000 characters long string generated randomly with s and p having equal probability.
↩
Flushing every 255 iterations makes more sense intuitively as it would require the least amount of work, however, empirical testing shows that an aligned value such as 128 provides the best performance. The code listing originally uses a value of 256 but this is wrong as it overflows in the case of 256 consecutive s. Credit to @PeterFaiman for pointing this out.
↩
It is a shame that Rust's experimental portable SIMD doesn't quite supports all of our use-cases. Otherwise we could have trivially supported multiple architectures.
↩