Branchless programming
Overview
- a branch is any conditional statement in code (if, switch, conditional operator, etc.)
- any time the CPU can take one of multiple paths
 
- “branches” are slow in theory:
- usually the CPU reads code in linear fashion and tries to stay ahead of the execution by loading/preparing instructions before they’re run
- “branches” take away the CPU’s ability to “read ahead” because the CPU doesn’t know which code will be executed until the condition is evaluated
- to mitigate this, the CPU guesses which “path” to load
- if the CPU is correct, there’s no performance hit
- if the CPU is wrong, the CPU must discard the preparations it made based on its guess and then prepare and run the correct code
- that discard/reload action slows down the runtime
 
 
- branchless programming is a technique that minimizes branches in order to improve runtime efficiency
branching “smaller” function:
int smaller(int a, int b) { if (a < b) return a; else return b; }
branchless version using arithmetic with conditional operators (actually slower, see below):
int smaller_branchless(int a, int b) { return a * (a < b) + b * (b < a); }
A lesson on premature optimization
assembly generated by the compiler for branching “smaller” function:
mov ebx,dword ptr [b] cmp dword ptr[a],ebx cmovl ebx,dword ptr [a]
AKA:
- move pointer binto registerebx
- compare the value in ebxwitha
- conditionally move aintoebxif it’s less
In other words, the compiler recognized what we were trying to do and actually wrote a conditional into the assembly code.
If we were to translate our branchless example into straight assembly, it would look something like this:
mov ecx,dword ptr [a] xor r8d,r8d mov eas,dword ptr [b] mov ebx,r8d cmp eax,ecx setle bl imul ebx,eax cmp ecx,eax setl r8b imul r8d,ecx add ebx,r8d
- this is clearly less efficient
- branchless programming isn’t always faster or more efficient
- compilers are aware of branchless programming and will try to optimize the assembly code
- LESSON: no premature optimization, just write natural/simple code and let the compiler do its tricks… then check the result and optimize the compiled code if necessary
An example of performant branchless code
branching toUpper implementation:
void toUpper(char* d, int count)
{
  for (int i = 0; i < count; i++)
    if (d[i] >= 'a' && d[i] <= 'z')
      d[i] -= 32;
}
- upon disassembly, this is very slow. Many ja(jump if above) commands.
branchless implementation:
void toUpper_branchless(char* d, int count)
{
  for (int i = 0; i < count; i++)
    d[i] = (d[i] * !(d[i] >= 'a' && d[i] <= 'z')) + (d[i] - 32) * (d[i] >= 'a' && d[i] <= 'z');
}
- it’s not pretty and it’s way too clever (not clean code), but it’s 3x faster.
marginally cleaner branchless implementation:
void toUpper_branchless(char* d, int count)
{
  for (int i = 0; i < count; i++)
    d[i] -= 32 * (d[i] >= 'a' && d[i] <= 'z');
}
- explanation of this code:
- the code iterates through the chars after the char* and, if the character code falls within [a-z] (lowercase alphabet characters), the paranthesis statement returns 1, which is multiplied by 32 in order to get the new character code of the current character.
 
- the code iterates through the chars after the char* and, if the character code falls within [a-z] (lowercase alphabet characters), the paranthesis statement returns 
Review
- Branches are conditionals and potentially slow code down
- Branchless programming isn’t a “golden hammer” and shouldn’t always be used
- Note that it’s important to check the compiled code in order to ensure the code is actually optimal