/ :: bitbashing

All Programs Are Inefficient

I was thinking about the efficiency of programs on our current CPUs and machine architectures, and I've come to the conclusion that they are all inefficient. On one level, that's an incredibly obvious statement - anyone who has done any amount of C and assembly programming knows that a program in Lisp or ML is rather inefficient from the view of machine efficiency. However languages like Lisp and ML also tend to be much more efficient in terms of programmer time, and in most cases that is a more important constraint to optimize for, especially given that the tradeoffs doesn't seem to be linear - taking a slowdown in processor time by a factor of three by writing in Lisp might let you write ten times as much functionality as you could writing the same program in C.

In my experience, the better optimized a program is, the more uniform its operations are. In crypto code implementing RSA, you'll have huge sequences of mostly linear memory accesses that perform integer operations (add, subtract, multiply) with control operations that map down directly to (easy to predict) compares and jumps. In the finance applications I've been working on lately, our computations are long loops over doubles, with the only integers being for loop control. Both are, even when perfectly optimized, black holes of processor time; you would always want your SSL server to be able to service more clients on a given CPU. On the finance side, the faster our computations run, the more stocks we can run calculations on, and the more trades we can make (hopefully this will equate to making more money).

In both cases, and I'm thinking in many others (media codecs, for instance) we're essentially wasting half of the available CPU resources by failing to mix integer and floating point; where a program is looping over a Newton-Raphson calculation memory bandwidth is underutilized; when it is sorting an integer array it's not using the floating point unit at all.

These resources, particularly available execution units, aren't averaged out over multiple processes. To some extent a computers memory, disk, and network bandwidth are, since while one program is in a tight FPU loop the DMA transfers to disk your operating system kicked off for another process will continue to run and make use of what is available. But execution units can't be shared in the same way. One minor exception is when using a system like Intel's Hyperthreading, which allows two threads to share some CPU resources. But because the operating system doesn't know what the threads are or will be executing, it can't schedule them appropriately to minimize resource contention.

So how can we use spare execution resources when they are available? The idea I've come up with, which is probably deeply impractical for any number of reasons, was inspired by Haskell's lazy evaluation and the Itanium's support for speculative execution. Lazy evaluation means that things are only evaluated when and if those values are used. If C was lazily evaluating, you might write

    double x = sqrt(5);
    double x2 = x * x;
    int y = (int)x2 / 0;
    printf("%f\n", y); // evaluated

In the code above, nothing would actually be evaluated until we print y; at that point we would receive a SIGFPE about divide by zero, but if we were to never use the value of y there would be no need to evaluate it (and thus no need to evaulate x or x2). So lazy evaluation is a form of deferring computations until they are needed. As an aside, it would be interesting if any Haskell implementations can 'see', when x2 is evaluated, that the value can be set to five, without having to actually compute the square of the square root.

Intel's Itanium architecture supports several forms of speculative execution. The Itanium is built around the idea of having as many execution units that can sanely be built into a chip, and providing an instruction set that allows code to use as many of them as possible. One way it does that is to allow computations which might not actually be needed to be run; if it turns out they're not needed, they can be thrown away. It is important that errors that occur during a speculative execution are not emitted unless the values are actually used. For expository purposes, consider this code

double f(int);
bool done_p(int, int);

void f_over(double arr[], int len)
   {
   for(int j = 0; done_p(j, len); j++)
      arr[j] = f(j);
   }

An Itanium processor might (conceivably) execute multiple iterations of f() with values writing to past the end of the array, if the value of done_p hadn't been made available yet. However we wouldn't want those to actually have an effect, if it turned out that done_p had returned true three iterations ago. Itanium provides hardware support for reverting back those writes so they won't affect the visible state of the program.

By combining speculative execution (not necessarily processor supported) and lazy evaluation, we can then use all, or nearly all, available resources.

   let x = some_long_fpu_computation(/* inputs */)
   let y = some_long_integer_computation(/* more inputs */)
   output(y);

The programming environment can detect when it is about to enter a sequence of long-running integer operations and speculatively execute some_long_fpu_computation interleaved with the execution of some_long_integer_computation (allowing the code to use the ALU and FPU units of the processor to their fullest extent); if the FPU code turns out to produce _|_ (the Haskell language spec uses this to represent any sort of error, be it divide by zero or an infinite loop), the execution of that function will be backed out without any effect on the state of the process while the integer code carries on. The result is then cached, and if and when the value of the computation is examined, it is available immediately.

My intuition is that on current microarchitectures this just isn't worth it; the bookkeeping overhead will be high, and the amount of resources at stake is relatively small. But as the superscalar execution resources of processors grows, the wins should become larger and larger. This could also be applicable on multicore processors and SMP machines.

Posted 2007/05/21 in programming; no comments

< Was This Obvious To Everyone But Me? | Questions I Will Remember (Next Time) >

Name:


E-mail:


URL:


Comment: