Implementations for Gray code encoding and decoding
Gray code encoding
Gray code encoding is a bijective function between natural numbers (including 0) in a way that the Gray code of subsequent integers only differ in a single binary digit. This encoding was proved useful for rotary encoders and other kinds of counting where switching multiple bits at once would be problematic.
A 10bit Gray code rotary encoder.
I use the following common definition for Gray code encoding:
\newcommand{\xor}{\wedge} \newcommand{\rshift}{\gg} \renewcommand{\neg}{\mathord{\sim}} \begin{align} &a: \mathcal{N} \rightarrow \mathcal{N} \\ &a_i = i \xor (i \rshift 1), \end{align}where \xor denotes the ,,bitwise exclusive or’’ and \rshift denotes the ,,bitwise right shift’’ operators.
Gray code decoding
I denote the inverse of the function a with b. b_i can be expressed in the following way:
b_i = \begin{cases} 0 & i = 0 \\ i \xor (i\rshift1) \xor (i\rshift2) \xor\dots\xor (i\rshift \lfloor \log_2 i \rfloor) & i > 0 \\ \end{cases}It can be shown that b is indeed the inverse of a using the commutative and associative properties of \xor and the following identities:
\begin{align} x \xor x &= 0 \\ x \xor 0 &= x \\ (x \xor y)\rshift n &= (x \rshift n)\xor(y \rshift n) \\ (x \rshift n) \rshift m &= x \rshift (n+m) \\ x \rshift (\lfloor \log_2 x \rfloor + 1) &= 0 \end{align}Implementations
Implementing Gray code encoding is rather simple:
unsigned gray_encode(unsigned i) {
return i^(i>>1);
}
One can implement Gray code decoding directly from the defining expression of b_i:
#include <stdint.h>
unsigned gray_decode(unsigned i) {
unsigned ret = 0;
while (i != 0) {
ret = (ret ^ i);
i = (i >> 1);
}
return ret;
}
One can actually do better for a known fixed-width integral type. The following algorithm is from Wikipedia:
#include <stdint.h>
uint32_t gray_decode(uint32_t i) {
i ^= (i >> 16);
i ^= (i >> 8);
i ^= (i >> 4);
i ^= (i >> 2);
i ^= (i >> 1);
return i;
}
One can show the correctness of this implementation using the identities I presented before and induction.
I also developed an alternative implementation that makes use of the popcnt
and pdep
x86_64 CPU instructions (part of the POPCNT and BMI2 extensions respectively).
#include <stdint.h>
#include <x86intrin.h>
uint32_t gray_decode(uint32_t i) {
uint32_t evens
= _pdep_u32(0x55555555u,i);
uint32_t odds
= _pdep_u32(0xAAAAAAAAu,i);
uint32_t popcount
= __builtin_popcount(i);
return (-(popcount & 1))
^ ((odds << 1) - (evens << 1));
}
In the following sections I break down how this implementation works. I stick with 32 bit numbers, but my reasoning works for any bitwidth.
If i = \mathbf{x}_{31} \mathbf{x}_{30} \dots \mathbf{x}_1 \mathbf{x}_0 and b_i = \mathbf{y}_{31} \mathbf{y}_{30} \dots \mathbf{y}_1 \mathbf{y}_0 – where the bold letters denote binary digits – then \mathbf{y}_n = \mathbf{x}_n \xor \mathbf{x}_{n+1} \xor \dots \xor \mathbf{x}_{31}. In other words one can get the nth digit in b_i by xoring the digits from the nth to the most significant digit in i. For a given i where the 1 digits are sparsely laid out one can think of b_i in the following way:
\begin{align} i &= \verb|…010…010…010…010…| \\ b_i &= \verb|…011…100…011…100…| \\ \end{align}The leading digits (from the most significant digits) are 0 up until the first 1 digit in i. Then the following digits in b_i are all 1 up until the next 1 digit in i, and so on. The 1 digits in i mark the transitions between strips of 0s and 1s in b_i. For this description it’s not required that the 1s in i are sparsely laid out, it just makes the intuition easier (for me at least).
I define e_i by taking every second 1 binary digit in i starting from the least significant one.
I also define o_i by taking every second 1 binary digit in 1 starting from the second least significant one.
e_i and o_i correspond to the evens
and odds
variables in my gray_decode
function above.
After adding e_i and \neg o_i, where \neg denotes bitwise negation, the result has similar strips of 0s and 1s as b_i:
\begin{align} i &= \verb|…010…010…010…010…| \\ e_i &= \verb|…000…010…000…010…| \\ \neg o_i &= \verb|…101…111…101…111…| \\ e_i+\neg o_i &= \verb|…110…001…110…001…| \\ \end{align}There are two main differences to b_i:
- The transitions between the strips of 0s and 1s start one place to the right compared to b_i.
- The first strips from the left can be all 1s or all 0s depending if there are even or odd number of 1 digits in i.
The first issue can be fixed by shifting e_i and o_i to the left by one place beforehand. The second difference can be addressed by first calculating if there are even or odd number of 1 bits in i. Then one can flip the bits in the result for the even case.
I denote the number of 1 bits in i as p_i, this corresponds to the popcnt
variable in my gray_decode
function.
By applying these fixes we get the following expression for b_i:
The branching between the even and odd cases can also be eliminated by xoring the intermediate result with 00…0
or 11…1
depending on the parity of p_i.
p_i \mathbin{\&} 1 is either 0 or 1 , -(p_i \mathbin{\&} 1) is either 00…0
or 11…11
for the even and odd cases respectively.
Since bitwise negation is required for the even case an additional bitwise negation is necessary:
Update: IJzerbaard pointed out that this expression can be further simplified by using the x \xor (\neg y) = (\neg x) \xor y and \neg(\neg x+y) = x-y identities:
b_i = (-(p_i \mathbin{\&} 1)) \xor ((o_i \ll 1) - (e_i \ll 1))The popcnt
and pdep
CPU instructions can be used to calculate e_i, o_i and p_i efficiently.
The popcnt
instruction takes one source and one destination argument.
It writes the number of 1 binary digits in the source and returns the count in the destination.
p_i can be calculated by applying popcnt
on i.
pdep
takes two source arguments (src
and mask
) and one destination argument (dst
).
It takes the low bits from src
and deposits them in the destination to the corresponding set bit locations in mask
.
One can calculate e_i by using pdep
and setting src
to 0101…01
and mask
to i.
o_i can be calculated similarly with pdep
but with 1010…10
as src
.
This is how evens
and odds
are calculated in my gray_decode
function, the magic numbers are the src
arguments.
Benchmark
I think there is no correct way to objectively benchmark these Gray code decoding functions in isolation. One can call these functions in a tight loop and compare their throughput that way, but one shouldn’t take the resulting numbers too seriously.
I used Google’s benchmark library and the following setup:
static void Generic32(benchmark::State& state) {
uint32_t i = 0;
for (auto _ : state) {
++i;
uint32_t x = gray_decode_generic(i);
benchmark::DoNotOptimize(x);
}
state.SetBytesProcessed(i*sizeof(uint32_t));
}
BENCHMARK(Generic32);
And similarly for all the other functions for both uint32_t
and uint64_t
.
I compiled the benchmark with gcc -O2
.
-------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------
Generic32 2 ns 2 ns 58150989 1.56283GB/s
PDEP32 2 ns 2 ns 82701968 2.18142GB/s
Generic64 2 ns 2 ns 75010161 4.67384GB/s
PDEP64 1 ns 1 ns 110827182 5.80519GB/s
The throughput of the popcnt
and pdep
method is somewhat faster in this specific benchmark.
Notes
- b_i can also be calculated using carry-less multiplying i \ll 1 with
11…1
and fixing the result withpopcnt
. The PCLMUL extension provides this operation, but it can only carry-less multiply 64bit arguments and it still requires XMM registers. The performance of this method wasn’t great the last time I tried. There can easily be a CPU instruction that could help beating my method at least in a streaming case, but I’m not familiar with all of the vector instructions. - Fun fact: b_i \mathbin{\&} 1 is the Thue–Morse sequence
- I use a variation of my algorithm in my fast Hilbert-curve library. For my purposes I don’t need to shift e_i and o_i.
Update2: IJzerbaard discovered an other improvement: the left shifting can be done early before the pdep instructions, this saves a single instruction in the resulting binary when compiled with gcc:
#include <stdint.h>
#include <x86intrin.h>
uint32_t gray_decode(uint32_t i) {
uint32_t evens
= _pdep_u32(0x55555555u,i << 1);
uint32_t odds
= _pdep_u32(0xAAAAAAAAu,i << 1);
uint32_t popcount
= __builtin_popcount(i);
return (-(popcount & 1))
^ (odds - evens);
}