Branch Prediction Friendliness

In this section, we introduce two tricks to improve branch prediction, which are sorting and binary operations.

Sorting

Let us first look into the following program to get to know branch prediction:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int cmpfunc(const void *a, const void *b) { return (*(int *)a - *(int *)b); }

int main() {
    /* Generate data */
    const unsigned arraySize = 32768;
    int data[arraySize];
    srand((int)123);

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = rand() % 256;

    /* !!! With this, the next loop runs faster */
    /* qsort(data, arraySize, sizeof(int), cmpfunc); */

    /* Test */
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i) {
        /* Primary loop */
        for (unsigned c = 0; c < arraySize; ++c) {
            if (data[c] >= 128)
                sum += data[c];
        }
    }
    double elapsedTime = ((double)(clock() - start)) / CLOCKS_PER_SEC;

    printf("elapsedTime = %f\n", elapsedTime);
    printf("sum = %llu\n", sum);
}

The following line of code sorts the array of data, which has great impact on the branch miss rate. We can use perf to check it.

qsort(data, arraySize, sizeof(int), cmpfunc)

without sorting

with sorting

Why sorting makes the difference?

At the processor level, it considers an if-statement as a branch instruction. When the processor sees a branch, it has no idea which way it will go until the last moment. So it guesses which direction the branch will go, i.e. the branch prediction.
- If it guesses right, continue executing.
- If it guesses wrong, flush the pipeline and roll back to the branch. Then it can restart along the other path.
If it guesses right every time, the execution will never have to stop. If it guesses wrong too often, the computer spends a lot of time stalling, rolling back, and restarting.
To minimize the number of wrong guesses, the processor looks at the past history. If "going left" 99% of the time, then it guesses left; if "going right" 99% of the time, then it guesses right. If it goes one way for 3 times, it likely guesses the same. In other words, it tries to identify a pattern and follow it. This is more or less how branch predictors work.
Most applications have well-behaved branches. So modern branch predictors will typically achieve >90% hit rates. But when faced with unpredictable branches with no recognizable patterns, branch predictors are virtually useless. This is why sorting makes much less wrong guesses.

Binary operations

If the compiler isn't able to optimize the branch into a conditional move, we can try some hacks which replace the branch with bitwise operations:

int t = (data[c] - 128) >> 31;
sum += ~t & data[c];

How this code works?

if data[c] >= 128, then t = 0, ~t & data[c] = data[c].
if data[c] < 128, then t = -1, ~t & data[c] = 0.

Therefore, the above two lines work the same as the if-statement. It leverages the fact that negative numbers have their most significant bit being 1 (two's complement) and >> 31 would leave all bits as 1. After ~, the mask becomes all 0's and wipe away any value < 128.

Refer to the references if you would like to know the operators (>>, ~, &) in detail.

Now the CPU time with sorting is almost equal to that without sorting:

without sorting:

with sorting:

You may find that with-sorting version has more branch misses now. This is brought by the sorting procedure.

Short circuiting (Branch prediction with && and &)

Both can be used to evaluate the correctness of two Boolean expressions (i.e., logical AND). In the following pseudo code, only when both condition1 and condition2 (and others, if any) are true will statement 1 be executed, otherwise statement 2.

if (condition1 && condition2 ...) {
    statement 1;
} else {
    statement 2;
}

if (condition1 & condition2 ...) {
    statement 1;
} else {
    statement 2;
}

However, there is a difference between them. && serves in the way of "short circuiting", that is, an expression will be stopped being evaluated as soon as its outcome is determined. So if condition1 is false, then condition2 will not be evaluated, unlike & which always evaluates the correctness of both conditions.

In the case where the first condition is mostly false and the condition evaluation is costly, &&, this early-stop can fasten the program.

However, in the case where the condition evaluation is not so expensive (just like the following case), the speedup is not guaranteed.

The following two programs compare the impact of operators && and & on the branch miss rate.

operator &&:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

const int n = 100000;

int filter(int *A, int *B, int *C, int a, int b, int c) {
    int i, j = 0;
    for (i = 0; i < n; ++i) {
        if (A[i] > a && B[i] > b && C[i] > c)
            j++;
    }

    return j;
}

int main() {
    srand(time(0));

    int A[100000] = {0};
    int B[100000] = {0};
    int C[100000] = {0};

    for (int i = 0; i < n; ++i) {
        int a = rand() / (RAND_MAX * 1.0) * 10;
        int b = rand() / (RAND_MAX * 1.0) * 10;
        int c = rand() / (RAND_MAX * 1.0) * 10;

        A[i] = a;
        B[i] = b;
        C[i] = c;
    }

    int filNum = filter(A, B, C, 2, 2, 2);

    printf("%d\n", filNum);

    return 0;
}

operator &:

The program for this case is simply replacing the condition with

A[i] > a & B[i] > b & C[i] > c

For if (A[i] > a && B[i] > b && C[i] > c), it has three branches to predict and each of them has a success rate of 80%. Therefore, the probability that CPU makes at least one wrong prediction is (1 - 0.8 * 0.8 * 0.8) ≈ 0.5.

For if (A[i] > a & B[i] > b & C[i] > c), it has only one branch to predict and its success rate is (0.8 * 0.8 * 0.8) ≈ 0.5.

Therefore, & version has a smaller number of branch misses and can achieve a better branch prediction accuracy than && version (though the performance gain is too marginal in this case).

References:

PreviousProgram optimization techniques NextLoop Unrolling

Last updated 6 months ago

Was this helpful?