Branchless programming - jaredgorski.org

Back to notes

Branchless programming

Keywords: code, programming, techniques, performance
Date:

Overview

  • a branch is any conditional statement in code (if, switch, conditional operator, etc.)
    • any time the CPU can take one of multiple paths
  • "branches" are slow in theory:
    • usually the CPU reads code in linear fashion and tries to stay ahead of the execution by loading/preparing instructions before they're run
    • "branches" take away the CPU's ability to "read ahead" because the CPU doesn't know which code will be executed until the condition is evaluated
    • to mitigate this, the CPU guesses which "path" to load
      • if the CPU is correct, there's no performance hit
      • if the CPU is wrong, the CPU must discard the preparations it made based on its guess and then prepare and run the correct code
      • that discard/reload action slows down the runtime
  • branchless programming is a technique that minimizes branches in order to improve runtime efficiency

branching "smaller" function:

int smaller(int a, int b)
{
  if (a < b)
    return a;
  else
    return b;
}

branchless version using arithmetic with conditional operators (actually slower, see below):

int smaller_branchless(int a, int b)
{
  return a * (a < b) + b * (b < a);
}

A lesson on premature optimization

assembly generated by the compiler for branching "smaller" function:

mov ebx,dword ptr [b]
cmp dword ptr[a],ebx
cmovl ebx,dword ptr [a]

AKA:

  1. move pointer b into register ebx
  2. compare the value in ebx with a
  3. conditionally move a into ebx if it's less

In other words, the compiler recognized what we were trying to do and actually wrote a conditional into the assembly code.

If we were to translate our branchless example into straight assembly, it would look something like this:

mov ecx,dword ptr [a]
xor r8d,r8d
mov eas,dword ptr [b]
mov ebx,r8d
cmp eax,ecx
setle bl
imul ebx,eax
cmp ecx,eax
setl r8b
imul r8d,ecx
add ebx,r8d
  • this is clearly less efficient
  • branchless programming isn't always faster or more efficient
  • compilers are aware of branchless programming and will try to optimize the assembly code
  • LESSON: no premature optimization, just write natural/simple code and let the compiler do its tricks... then check the result and optimize the compiled code if necessary

An example of performant branchless code

branching toUpper implementation:

void toUpper(char* d, int count)
{
  for (int i = 0; i < count; i++)
    if (d[i] >= 'a' && d[i] <= 'z')
      d[i] -= 32;
}
  • upon disassembly, this is very slow. Many ja (jump if above) commands.

branchless implementation:

void toUpper_branchless(char* d, int count)
{
  for (int i = 0; i < count; i++)
    d[i] = (d[i] * !(d[i] >= 'a' && d[i] <= 'z')) + (d[i] - 32) * (d[i] >= 'a' && d[i] <= 'z');
}
  • it's not pretty and it's way too clever (not clean code), but it's 3x faster.

marginally cleaner branchless implementation:

void toUpper_branchless(char* d, int count)
{
  for (int i = 0; i < count; i++)
    d[i] -= 32 * (d[i] >= 'a' && d[i] <= 'z');
}
  • explanation of this code:
    • the code iterates through the chars after the char* and, if the character code falls within [a-z] (lowercase alphabet characters), the paranthesis statement returns 1, which is multiplied by 32 in order to get the new character code of the current character.

Review

  • Branches are conditionals and potentially slow code down
  • Branchless programming isn't a "golden hammer" and shouldn't always be used
  • Note that it's important to check the compiled code in order to ensure the code is actually optimal