Branchless programming

Overview

a branch is any conditional statement in code (if, switch, conditional operator, etc.)
- any time the CPU can take one of multiple paths
“branches” are slow in theory:
- usually the CPU reads code in linear fashion and tries to stay ahead of the execution by loading/preparing instructions before they’re run
- “branches” take away the CPU’s ability to “read ahead” because the CPU doesn’t know which code will be executed until the condition is evaluated
- to mitigate this, the CPU guesses which “path” to load
  - if the CPU is correct, there’s no performance hit
  - if the CPU is wrong, the CPU must discard the preparations it made based on its guess and then prepare and run the correct code
  - that discard/reload action slows down the runtime
branchless programming is a technique that minimizes branches in order to improve runtime efficiency

branching “smaller” function:

int smaller(int a, int b)
{
  if (a < b)
    return a;
  else
    return b;
}

branchless version using arithmetic with conditional operators (actually slower, see below):

int smaller_branchless(int a, int b)
{
  return a * (a < b) + b * (b < a);
}

A lesson on premature optimization

assembly generated by the compiler for branching “smaller” function:

mov ebx,dword ptr [b]
cmp dword ptr[a],ebx
cmovl ebx,dword ptr [a]

AKA:

move pointer b into register ebx
compare the value in ebx with a
conditionally move a into ebx if it’s less

In other words, the compiler recognized what we were trying to do and actually wrote a conditional into the assembly code.

If we were to translate our branchless example into straight assembly, it would look something like this:

mov ecx,dword ptr [a]
xor r8d,r8d
mov eas,dword ptr [b]
mov ebx,r8d
cmp eax,ecx
setle bl
imul ebx,eax
cmp ecx,eax
setl r8b
imul r8d,ecx
add ebx,r8d

this is clearly less efficient
branchless programming isn’t always faster or more efficient
compilers are aware of branchless programming and will try to optimize the assembly code
LESSON: no premature optimization, just write natural/simple code and let the compiler do its tricks… then check the result and optimize the compiled code if necessary

An example of performant branchless code

branching toUpper implementation:

void toUpper(char* d, int count)
{
  for (int i = 0; i < count; i++)
    if (d[i] >= 'a' && d[i] <= 'z')
      d[i] -= 32;
}

upon disassembly, this is very slow. Many ja (jump if above) commands.

branchless implementation:

void toUpper_branchless(char* d, int count)
{
  for (int i = 0; i < count; i++)
    d[i] = (d[i] * !(d[i] >= 'a' && d[i] <= 'z')) + (d[i] - 32) * (d[i] >= 'a' && d[i] <= 'z');
}

it’s not pretty and it’s way too clever (not clean code), but it’s 3x faster.

marginally cleaner branchless implementation:

void toUpper_branchless(char* d, int count)
{
  for (int i = 0; i < count; i++)
    d[i] -= 32 * (d[i] >= 'a' && d[i] <= 'z');
}

explanation of this code:
- the code iterates through the chars after the char* and, if the character code falls within [a-z] (lowercase alphabet characters), the paranthesis statement returns 1, which is multiplied by 32 in order to get the new character code of the current character.

Review

Branches are conditionals and potentially slow code down
Branchless programming isn’t a “golden hammer” and shouldn’t always be used
Note that it’s important to check the compiled code in order to ensure the code is actually optimal