Intel’s Haswell design (expected in 2013) will include a new
instruction set called BMI2 with various fun bit manipulation
instructions. Particularly of note are the `pext` and `pdep`
instructions which are essentially bit-level gather/scatter
operations. Combining two `pext` operations results in the GRP
instruction described in Efficient Permutation Instructions for Fast
Software Cryptography, where
the authors show how to implement bit level permutations using a
variety of instructions. Perhaps not coincidentally at least one of
the authors now works at Intel.

Here are the operations written in as C++ templates (Haswell will only
support `pdep` and `pext` with 32 and 64 bit operands, though). As
might be apparent, they are written for clarity rather than speed:

```
template<typename T>
T pext(T val, T mask)
{
T res = 0;
size_t off = 0;
for(size_t bit = 0; bit != sizeof(T)*8; ++bit)
{
const bool val_bit = (val >> bit) & 1;
const bool mask_bit = (mask >> bit) & 1;
if(mask_bit)
{
res |= static_cast<T>(val_bit) << off;
++off;
}
}
return res;
}
template<typename T>
T pdep(T val, T mask)
{
T res = 0;
size_t off = 0;
for(size_t bit = 0; bit != sizeof(T)*8; ++bit)
{
const bool val_bit = (val >> off) & 1;
const bool mask_bit = (mask >> bit) & 1;
if(mask_bit)
{
res |= static_cast<T>(val_bit) << bit;
++off;
}
}
return res;
}
```

These have been checked fairly heavily (though obviously not exhaustively), against what is produced by Intel’s emulator.

The paper shows that any arbitrary 32-bit permutation can be done
using 10 `pext`s, 5 ors and 5 shifts. For 64-bit permutations,
it’s 12, 6, and 6. A big unknown is what kind of latency and
throughput these instructions will have in Haswell, but `pclmulqdq`,
which seems at least a metaphorically similiar operation, takes 14
cycles with at most two instructions outstanding - assuming those
rates for `pext`, a 32-bit permutation might take 80 cycles or so.

Here’s the DES P permutation:

```
uint32_t des_p(uint32_t val)
{
val = (pext(val, 0x07137FE0) << 16) | pext(val, 0xF8EC801F);
val = (pext(val, 0x75196E8C) << 16) | pext(val, 0x8AE69173);
val = (pext(val, 0x56A3CCE4) << 16) | pext(val, 0xA95C331B);
val = (pext(val, 0xAA539AC9) << 16) | pext(val, 0x55AC6536);
val = (pext(val, 0x96665A69) << 16) | pext(val, 0x6999A596);
return val;
}
```

And here’s DES’s initial permutation:

```
uint64_t des_ip(uint64_t val)
{
val = (pext(val, 0x00FF00FF00FF00FF) << 32) | pext(val, 0xFF00FF00FF00FF00);
val = (pext(val, 0x00FF00FF00FF00FF) << 32) | pext(val, 0xFF00FF00FF00FF00);
val = (pext(val, 0x00FF00FF00FF00FF) << 32) | pext(val, 0xFF00FF00FF00FF00);
val = (pext(val, 0xCCCCCCCCCCCCCCCC) << 32) | pext(val, 0x3333333333333333);
val = (pext(val, 0xCCCCCCCCCCCCCCCC) << 32) | pext(val, 0x3333333333333333);
val = (pext(val, 0x5555555555555555) << 32) | pext(val, 0xAAAAAAAAAAAAAAAA);
return val;
}
```

The regularity of the IP operation is pretty obvious, and suggests one could get by with a much less powerful operation (and indeed Richard Outerbridge and others have gotten IP down to 30 simple operations)

Finally, PRESENT‘s permutation, also a very regular operation:

```
uint64_t present_p(uint64_t val)
{
val = (pext(val, 0xF0F0F0F0F0F0F0F0) << 32) | pext(val, 0x0F0F0F0F0F0F0F0F);
val = (pext(val, 0xF0F0F0F0F0F0F0F0) << 32) | pext(val, 0x0F0F0F0F0F0F0F0F);
val = (pext(val, 0xF0F0F0F0F0F0F0F0) << 32) | pext(val, 0x0F0F0F0F0F0F0F0F);
val = (pext(val, 0xF0F0F0F0F0F0F0F0) << 32) | pext(val, 0x0F0F0F0F0F0F0F0F);
val = (pext(val, 0xAAAAAAAAAAAAAAAA) << 32) | pext(val, 0x5555555555555555);
val = (pext(val, 0xAAAAAAAAAAAAAAAA) << 32) | pext(val, 0x5555555555555555);
return val;
}
```

Each pair of `pext`s in the above examples is actually a GRP:

```
(pext(v, mask) << hamming_weight(mask)) | pext(v, ~mask)
```

which has the effect of moving all the bits of `v` where mask is 1
to the leftmost part of the output, and all the other bits to the
rightmost part.

To generate the sequence of constants, we actually use the same
grouping operation. Your input is an array of words specifying where
each bit should end up. For a 32 bit permutation, each target is 5
bits because it is in the range [0,2^{5}), so 160 bits sufficies
to specify the permutation. We convert this to 5 32-bit values, with
the lowest bits of each target index in `p`[0], the 2nd bit of
each target index in `p`[1], and so on. For instance for DES’s P,
the values turn out to be:

```
p[0] = 0x07137FE0
p[1] = 0x6BD9232C
p[2] = 0xDD230F1C
p[3] = 0x63665639
p[4] = 0xA5A435AE
```

Then we can use 15 GRP operations to perform the permutation:

```
for i in range(0, 5):
x = GRP(x, p[i])
for j in range(i+1, 5):
p[j] = GRP(p[j], p[i])
```

However you’ll note that almost all of these operations do not depend on
*x*, and we can precompute them:

```
for i in range(1, 5):
for j in range(0, i):
p[i] = GRP(p[i], p[j])
```

producing the values hardcoded into the function above.

This approach works for any arbitrary permutation, which is convenient, but many permutations can be done with fewer operations, which is pretty relevant since it is rare to need to perform arbitrary bit permutations compared to performing specific fixed ones.

There are likely many other useful cryptographic applications to the BMI2 instruction set. Areas that seem particularly ripe for this include GF(2^n) mathematics (including GCM’s MAC calculation), the DES round function (likely possible in constant time), perhaps some LFSRs, and a few of the recent hardware-oriented primitives like PRESENT, as in hardware a bit permutation is often free. I will also be interested to see if in a few years, new designs start making use of cheap bit permutations, key-dependent permutations, or more elaborate bit operations like bit interleaving of two words, in the same way that some SHA-3 entries took advantage of the cheap AES round function AES-NI made available. There is always a tradeoff with this, though, as running such an algorithm on processors without BMI2 would mean using a loop such as the ones above, which would be much slower, and possibly even (as with my templates above) not always running in constant time, opening up a dangerous side channel.

On some processors and some compilers, if you replace the conditional in the templates above with:

```
res |= (static_cast<T>(val_bit & mask_bit)) << bit;
off += mask_bit;
```

it will run in constant time, but you should definitely verify that is the case by examining the assembly your compiler produces, as a compiler can sometimes surprise you with its choice of instructions, including a conditional jump where you do not expect it.

C++0x, the next major revision of C++, includes a number of new language and library facilities that I am greatly looking forward to, including a standard thread interface. Initially the agenda for C++0x had included facilities built on threads, such as a thread pool, but as part of the so-called ‘Kona compromise’ (named after the location of the committee meeting where the compromise was made) all but the most basic facilities were deferred for a later revision.

However there were many requests for a simple facility for creating
an asynchronous function call, and a function for this, named
`std::async`, was voted in at the last meeting.
`std::async` is a rather blunt tool; it spawns a new thread
(though wording is included which would allow an implementation to
spawn threads in a fixed-size thread pool to eliminate thread creation
overhead and reduce hardware oversubscription) and returns a “future”
representing the return value of the function. A future is a
placeholder for a value which can be passed around the program, and if
and when the value is actually needed, it can be retrieved from the
future; the `get` operation may block if the value has not yet
been computed. In C++0x the future/promise system is primarily
intended for use with threads, but there doesn’t seem to be any reason
a system for distributed RPC (ala E’s Pluribus protocol) could not
provide an interface using the same classes.

An operation which felt like easy low-hanging fruit for parallel
invocation is RSA’s decrypt/sign operation. Mathematically, when one
signs a message using RSA, the message representation (usually a hash
function output plus some specialized padding) is converted to an
integer, and then raised to the power of `d`, the RSA private
key, modulo another number. Both of these numbers are relatively
large, typically 300 to 600 digits long. A well known trick, which
takes advantage of the underlying structure of the numbers, allows one
to instead compute two modular exponentiations, both using numbers
about half the size of `d`, and combine them using the Chinese
Remainder Theorem (thus this optimization is often called
RSA-CRT). The two computations are both still quite intensive, and
since they are independent it seemed reasonable to try computing them
in parallel. Running one of the two exponentiations in a different
thread showed an immediate doubling in speed for RSA signing on a
multicore! Other mathematically intensive algorithms that offer some
amount of parallel computation, including DSA and ElGamal, also showed
nice improvements.

As `std::async` is not included in GCC 4.5, I wrote a simple
clone of it. This version does not offer thread pooling or the option
of telling the runtime to run the function on the same thread; it is
mostly a ‘proof of concept’ version I’m using until GCC includes the
real deal in libstdc++. Here is the code:

```
#include <future>
#include <thread>
template<typename F>
auto std_async(F f) -> std::unique_future<decltype(f())>
{
typedef decltype(f()) result_type;
std::packaged_task<result_type ()> task(std::move(f));
std::unique_future<result_type> future = task.get_future();
std::thread thread(std::move(task));
thread.detach();
return future;
}
```

The highly curious `auto` return type of `std_async`
uses C++0x’s new function declaration syntax; ordinarily there is
no reason to use it but here we want to specify that the function
returns a `unique_future` paramaterized by whatever it is
that `f` returns. Since `f` can’t be referred to until
it has been mentioned as the name of an argument, the return value
has to come after the parameter list.

Unlike the version of `std::async` that was finally voted
in, `std_async` assumes its argument takes no arguments (one of
the original proposals for `std::async` used a similar
interface). This would be highly inconvenient except for the
assistance of C++0x’s lambdas, which allow us to pack everything
together. For instance here is the code for RSA signing, which
packages up one half of the computation in a 0-ary lambda
function:

```
auto future_j1 = std_async([&]() { return powermod_d1_p(i); });
BigInt j2 = powermod_d2_q(i);
BigInt j1 = future_j1.get();
// Now combine j1 and j2 using CRT
```

Using C++0x’s `std::bind` instead of a lambda here should
work as well, but I ran into problem with that in the 4.5 snapshot I’m
using; the current implementation follows the TR1 style of requiring
`result_type` typedefs which will not be necessary in C++0x
thanks to `decltype`. Since the actual `std::async` can
take an arbitrary number of arguments, the declaration of
`future_j1` will eventually change to simply:

`auto future_j1 = std::async(powermod_d1_p, i);`

The implementation of `std_async` may strike you as
excessively C++0x-ish, for instance by using `decltype` instead
of TR1’s `result_of` metaprogramming function. Part of this is
due to current limitations of GCC and/or libstdc++; the version of
`result_of` in 4.5’s libstdc++ does not understand lambda
functions (C++0x’s `result_of` is guaranteed to get this right,
because it itself uses `decltype`, but apparently libstdc++
hasn’t changed to use this yet).

Overall I’m pretty happy with C++0x as an evolution of C++98 for systems programming tasks. Though I am certainly interested to see how Thompson and Pike’s Go works out; now that BitC is more or less dead after the departure of its designers to Microsoft, Go seems to be the only game in town in terms of new systems programming languages that might provide a compelling alternative to C++.

I recently packaged botan using InnoSetup, an open source installation creator. Overall I was pretty pleased with it - it seems to do everything I need it to do without much of a hassle, and I’ll probably use it in the future if I need to package other programs or tools for Windows.

After I got the basic package working, a nit I wanted to deal with was converting the line endings of all the header files and plain-text documentation (readme, license file, etc) to use Windows line endings. While many Windows programs, including Wordpad and Visual Studio, can deal with files with Unix line endings, not all do, and it seemed like it would be a nice touch if the files were not completely unreadable if opened in Notepad.

There is no built in support for this, but InnoSetup includes a scripting facility (using Pascal!), including hooks that can be called at various points in the installation process, including immediately after a file is installed, which handles this sort of problem perfectly. So all that was required was to learn enough Pascal to write the function. I’ve included it below to help anyone who might be searching for a similar facility, since my own searches looking for an example of doing this were fruitless:

```
[Code]
const
LF = #10;
CR = #13;
CRLF = CR + LF;
procedure ConvertLineEndings();
var
FilePath : String;
FileContents : String;
begin
FilePath := ExpandConstant(CurrentFileName)
LoadStringFromFile(FilePath, FileContents);
StringChangeEx(FileContents, LF, CRLF, False);
SaveStringToFile(FilePath, FileContents, False);
end;
```

Adding the hook with `AfterInstall: ConvertLineEndings`
caused this function to run on each of my text and include files.