# Implementations for Gray code encoding and decoding

# Gray code encoding

Gray code encoding is a bijective function between natural numbers (including 0) in a way that the Gray code of subsequent integers only differ in a single binary digit. This encoding was proved useful for rotary encoders and other kinds of counting where switching multiple bits at once would be problematic.

A 10bit Gray code rotary encoder.

I use the following common definition for Gray code encoding:

where denotes the ,,bitwise exclusive or’’ and denotes the ,,bitwise right shift’’ operators.

# Gray code decoding

I denote the inverse of the function with . can be expressed in the following way:

It can be shown that is indeed the inverse of using the commutative and associative properties of and the following identities:

# Implementations

Implementing Gray code encoding is rather simple:

```
unsigned gray_encode(unsigned i) {
return i^(i>>1);
}
```

One can implement Gray code decoding directly from the defining expression of :

```
#include <stdint.h>
unsigned gray_decode(unsigned i) {
unsigned ret = 0;
while (i != 0) {
ret = (ret ^ i);
i = (i >> 1);
}
return ret;
}
```

One can actually do better for a known fixed-width integral type. The following algorithm is from Wikipedia:

```
#include <stdint.h>
uint32_t gray_decode(uint32_t i) {
i ^= (i >> 16);
i ^= (i >> 8);
i ^= (i >> 4);
i ^= (i >> 2);
i ^= (i >> 1);
return i;
}
```

One can show the correctness of this implementation using the identities I presented before and induction.

I also developed an alternative implementation that makes use of the `popcnt`

and `pdep`

x86_64 CPU instructions (part of the POPCNT and BMI2 extensions respectively).

```
#include <stdint.h>
#include <x86intrin.h>
uint32_t gray_decode(uint32_t i) {
uint32_t evens
= _pdep_u32(0x55555555u,i);
uint32_t odds
= _pdep_u32(0xAAAAAAAAu,i);
uint32_t popcount
= __builtin_popcount(i);
return (-(popcount & 1))
^ ((odds << 1) - (evens << 1));
}
```

In the following sections I break down how this implementation works. I stick with 32 bit numbers, but my reasoning works for any bitwidth.

If and – where the bold letters denote binary digits – then . In other words one can get the th digit in by xoring the digits from the th to the most significant digit in . For a given where the 1 digits are sparsely laid out one can think of in the following way:

The leading digits (from the most significant digits) are 0 up until the first 1 digit in . Then the following digits in are all 1 up until the next 1 digit in , and so on. The 1 digits in mark the transitions between strips of 0s and 1s in . For this description it’s not required that the 1s in are sparsely laid out, it just makes the intuition easier (for me at least).

I define by taking every second 1 binary digit in starting from the least significant one.
I also define by taking every second 1 binary digit in starting from the second least significant one.
and correspond to the `evens`

and `odds`

variables in my `gray_decode`

function above.

After adding and , where denotes bitwise negation, the result has similar strips of 0s and 1s as :

There are two main differences to :

- The transitions between the strips of 0s and 1s start one place to the right compared to .
- The first strips from the left can be all 1s or all 0s depending if there are even or odd number of 1 digits in .

The first issue can be fixed by shifting and to the left by one place beforehand. The second difference can be addressed by first calculating if there are even or odd number of 1 bits in . Then one can flip the bits in the result for the even case.

I denote the number of 1 bits in as , this corresponds to the `popcnt`

variable in my `gray_decode`

function.
By applying these fixes we get the following expression for :

The branching between the even and odd cases can also be eliminated by xoring the intermediate result with `00…0`

or `11…1`

depending on the parity of .
is either 0 or 1 , is either `00…0`

or `11…11`

for the even and odd cases respectively.
Since bitwise negation is required for the even case an additional bitwise negation is necessary:

**Update:** IJzerbaard pointed out that this expression can be further simplified by using the and identities:

The `popcnt`

and `pdep`

CPU instructions can be used to calculate , and efficiently.

The `popcnt`

instruction takes one source and one destination argument.
It writes the number of 1 binary digits in the source and returns the count in the destination.
can be calculated by applying `popcnt`

on .

`pdep`

takes two source arguments (`src`

and `mask`

) and one destination argument (`dst`

).
It takes the low bits from `src`

and deposits them in the destination to the corresponding set bit locations in `mask`

.

One can calculate by using `pdep`

and setting `src`

to `0101…01`

and `mask`

to .
can be calculated similarly with `pdep`

but with `1010…10`

as `src`

.
This is how `evens`

and `odds`

are calculated in my `gray_decode`

function, the magic numbers are the `src`

arguments.

# Benchmark

I think there is no correct way to objectively benchmark these Gray code decoding functions in isolation. One can call these functions in a tight loop and compare their throughput that way, but one shouldn’t take the resulting numbers too seriously.

I used Google’s benchmark library and the following setup:

```
static void Generic32(benchmark::State& state) {
uint32_t i = 0;
for (auto _ : state) {
++i;
uint32_t x = gray_decode_generic(i);
benchmark::DoNotOptimize(x);
}
state.SetBytesProcessed(i*sizeof(uint32_t));
}
BENCHMARK(Generic32);
```

And similarly for all the other functions for both `uint32_t`

and `uint64_t`

.
I compiled the benchmark with `gcc -O2`

.

```
-------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------
Generic32 2 ns 2 ns 58150989 1.56283GB/s
PDEP32 2 ns 2 ns 82701968 2.18142GB/s
Generic64 2 ns 2 ns 75010161 4.67384GB/s
PDEP64 1 ns 1 ns 110827182 5.80519GB/s
```

The throughput of the `popcnt`

and `pdep`

method is somewhat faster in this specific benchmark.

# Notes

- can also be calculated using carry-less multiplying with
`11…1`

and fixing the result with`popcnt`

. The PCLMUL extension provides this operation, but it can only carry-less multiply 64bit arguments and it still requires XMM registers. The performance of this method wasn’t great the last time I tried. There can easily be a CPU instruction that could help beating my method at least in a streaming case, but I’m not familiar with all of the vector instructions. - Fun fact: is the Thue–Morse sequence
- I use a variation of my algorithm in my fast Hilbert-curve library. For my purposes I don’t need to shift and .

**Update2:** IJzerbaard discovered an other improvement: the left shifting can be done early before the pdep instructions, this saves a single instruction in the resulting binary when compiled with gcc:

```
#include <stdint.h>
#include <x86intrin.h>
uint32_t gray_decode(uint32_t i) {
uint32_t evens
= _pdep_u32(0x55555555u,i << 1);
uint32_t odds
= _pdep_u32(0xAAAAAAAAu,i << 1);
uint32_t popcount
= __builtin_popcount(i);
return (-(popcount & 1))
^ (odds - evens);
}
```