Bit permutations on AVX2

As part of a paper on a high-speed NTRU implementation, we required a highly optimized permutation routine for vectors of bits on x86 with AVX2 instructions. This tool helps to write assembly that performs specific bit permutations by simulating the relevant instructions and displaying intermediate values in a human-readable representation. Additionally, it comes with two code-generation functions that rely on this simulation to find efficient permutation routines for arbitrary bit permutations. While there is still a performance benefit in hand-crafting the permutations, the generated code is in the same ballpark.

The code is available on Github, under the CC0 Public Domain license.