# Round to next Power of 2

The task to optimize today is to calculate the next power of two higher than or equal to a given number. i.e. round a number up to the next power of two. The first step when given such a task is to see if there is an obviously way of doing it. Having correct code is more important than having fast, incorrect code.

The obvious way to do this is something much like:
``````
unsigned next_pow2(unsigned x)
{
unsigned y = 1;
if (!x) return 0;

/* Keep doubling y until we are done */
while (y <= x) y += y;

return y;
}
``````
Using a simple benchmark program:
``````
int main(void)
{
unsigned x;

for (x = 0; x < (1 << 28); x++)
{
next_pow2(x);
}

return 0;
}
``````

we can test to see how long this takes. Unfortunately, the compiler is a little smart for us - and inlines the routine we want to benchmark into main(). The lack of doing the function call means that the results are slightly tainted. To fix this, add `__attribute__((noinline))` to the definition of `next_pow2()` The result of this is a program that runs in 8.3s on my machine.

To go faster we need a better algorithm. The current algorithm works fairly well if `x` is small. However, the number of loop iterations scales logarithmically, leading to longer times for larger inputs. Is there a constant-time algorithm we could use instead?

The trick is to try to tell the compiler more information. The more information it has, the better it optimizes. The compiler has intrinsic functions that we may use, and one in particular `__builtin_clz()` looks promising. This function counts the leading zeros in the binary expansion of a number. A version of `next_pow2()` using this function is:

``````
static __attribute__((noinline)) unsigned next_pow2(unsigned x)
{
if (x <= 2) return x;

return (1ULL << 32) >> __builtin_clz(x - 1);
}
``````

Benchmarking this yields immediate improvement. The program now takes a total of 0.000s to run. This is too good to be true. The compiler is very smart, and has realized that `next_pow2()` does not modify memory. Since it's output is ignored by the benchmarking loop - the loop itself can be elided. The result is the equivalent of just `return 0;` for `main()`

To prevent this, modify the benchmarking loop into something which does not ignore the results:
``````
#include <stdio.h>
int main(void)
{
unsigned x;
unsigned y;

for (x = 0; x < (1 << 30); x++)
{
y += next_pow2(x);
}

printf("Total = %u\n", y);

return 0;
}
``````
This runs in 1.911s, a bit over four times faster than the original code, even though it is doing more work via summing the results. Why is this so fast? What is gcc doing for for `__builtin_clz()`? The assembly for the function is:
``````
0x00000000004004f0 <next_pow2+0>:	cmp    \$0x2,%edi
0x00000000004004f3 <next_pow2+3>:	mov    %edi,%eax
0x00000000004004f5 <next_pow2+5>:	jbe	 0x40050d <next_pow2+29>
0x00000000004004f7 <next_pow2+7>:	sub	 \$0x1,%eax
0x00000000004004fa <next_pow2+10>:	bsr	 %eax,%ecx
0x00000000004004fd <next_pow2+13>:	mov	 \$0x100000000,%rax
0x0000000000400507 <next_pow2+23>:	xor	 \$0x1f,%ecx
0x000000000040050a <next_pow2+26>:	shr	 %cl,%rax
0x000000000040050d <next_pow2+29>:	repz retq
``````

As we can see, gcc is using the bsr instruction to find the leading 1, and then manipulates it to get the count of leading zeros. The problem is that it isn't smart enough to realize that it is really the position of the leading 1 that we are after. To get better code we'll need to use some inline assembly:

``````
static inline unsigned flsu(unsigned x)
{
unsigned result;

asm ("bsr %[x], %[result]"
: [result] "=r" (result)
: [x] "mr" (x));

return result;
}

static __attribute__((noinline)) unsigned next_pow2(unsigned x)
{
unsigned y;

if (x <= 2) return x;

y = flsu(x - 1);
return (2 << y);
}
``````
which results in the compiled code:
``````
0x00000000004004f0 <next_pow2+0>:       cmp    \$0x2,%edi
0x00000000004004f3 <next_pow2+3>:       mov    %edi,%eax
0x00000000004004f5 <next_pow2+5>:       jbe    0x400504 <next_pow2+20>
0x00000000004004f7 <next_pow2+7>:       sub    \$0x1,%eax
0x00000000004004fa <next_pow2+10>:      bsr    %eax,%ecx
0x00000000004004fd <next_pow2+13>:      mov    \$0x2,%eax
0x0000000000400502 <next_pow2+18>:      shl    %cl,%eax
0x0000000000400504 <next_pow2+20>:      repz retq
``````

The resulting code is an instruction shorter. However, it isn't faster, still taking 1.911s. The move and xor instructions can overlap in execution, so we haven't gained anything yet.

Perhaps we can gain something by going to branchless code? Branches are notoriously slow, so we may be able to go faster by eliminating the check for `x <= 2` somehow. This is possible if we realize that the non-branching version of the code above still gives the correct answer if `x == 0, or 2`; making `x == 1` the only special case. We can check for this without branching, however, yielding the code

``````
static __attribute__((noinline)) unsigned next_pow2(unsigned x)
{
unsigned y = (x != 1);

return (y + 1) << flsu(x - y);
}
``````
Giving
``````
0x00000000004004f0 <next_pow2+0>:	xor    %eax,%eax
0x00000000004004f2 <next_pow2+2>:	cmp    \$0x1,%edi
0x00000000004004f5 <next_pow2+5>:	setne  %al
0x00000000004004f8 <next_pow2+8>:	sub    %eax,%edi
0x00000000004004fd <next_pow2+13>:	bsr    %edi,%ecx
0x0000000000400500 <next_pow2+16>:	shl    %cl,%eax
0x0000000000400502 <next_pow2+18>:	retq
``````

as the function in assembly. This is indeed slightly faster, clocking in at 1.908s, but this is well within the noise, so it is debatable whether or not this trick helped other than making the generated code smaller by two bytes.

Can we go faster? The next trick is to do some research. The book "Hacker's Delight" is full of useful algorithms for problems like this. On page 48 it offers an improvement using the fact that if you set all bits lower than a given bit, then the number is one less than a power of 2.

``````
static __attribute__((noinline)) unsigned next_pow2(unsigned x)
{
x -= 1;
x |= (x >> 1);
x |= (x >> 2);
x |= (x >> 4);
x |= (x >> 8);
x |= (x >> 16);

return x + 1;
}
``````
This runs in 1.591s. The simpler instructions used here are much faster to execute than the more complex bsr required in the code above. Even though there are much much more of them:
``````
0x00000000004004f0 <next_pow2+0>:       sub    \$0x1,%edi
0x00000000004004f3 <next_pow2+3>:       mov    %edi,%eax
0x00000000004004f5 <next_pow2+5>:       shr    %eax
0x00000000004004f7 <next_pow2+7>:       or     %edi,%eax
0x00000000004004f9 <next_pow2+9>:       mov    %eax,%edx
0x00000000004004fb <next_pow2+11>:      shr    \$0x2,%edx
0x00000000004004fe <next_pow2+14>:      or     %eax,%edx
0x0000000000400500 <next_pow2+16>:      mov    %edx,%eax
0x0000000000400502 <next_pow2+18>:      shr    \$0x4,%eax
0x0000000000400505 <next_pow2+21>:      or     %edx,%eax
0x0000000000400507 <next_pow2+23>:      mov    %eax,%edx
0x0000000000400509 <next_pow2+25>:      shr    \$0x8,%edx
0x000000000040050c <next_pow2+28>:      or     %eax,%edx
0x000000000040050e <next_pow2+30>:      mov    %edx,%eax
0x0000000000400510 <next_pow2+32>:      shr    \$0x10,%eax
0x0000000000400513 <next_pow2+35>:      or     %edx,%eax
0x0000000000400518 <next_pow2+40>:      retq
``````

So is this the fastest code? On this processor, no. It is marginally faster, running in 1.403s, if we use the unsigned long long type instead of unsigned integers in the `next_pow2()` function. However, this trick probably only applies to this particular machine. Moreover, the fact that the bsr instruction is too slow to help also doesn't apply in other machines like those using the core2 architecture. In that case, the branchless bsr-using function may be the fastest.

In short, even with a task as simple as rounding to the next highest power of two, there are multiple solutions, each vying to be the fastest. There may be no clear winner, as in this case, where different machines like different styles of code. However, we can see that a little optimization can make a huge difference. The resulting 1.4s is nearly six times faster than the original routine on this machine. 