1 Introduction

1.1 Introduction

The modern DSP’s/CPU’s requires a tight interface between compiler and hardware architecture in order to achieve high utilization of the available resources. Compilers have not shown the ability to meet the needs of programmers on critical code. As architecture scales up, it is becoming so complex that human programmers can't deal with the scheduling and tracking of so many registers and execution units. The result may be an architecture that can't be programmed to apply most of its resources on real-life algorithms.

The goal of this project is to develop Code Optimizer ( dco ) - an optimizing code system for the Digital Signal Processor(s) or CPU ( referred as target ). This will be a software package specifically designed to optimize the target code by taking full advantage of the options and features provided by the target microprocessor.

Note, that acquiring our technology will save you tremendous effort in developing house-built solution as well as ensure quick availability of the product.

1.2 General Overview

The dco will be used to optimize code generated by a compiler. The programmer uses a compiler (C/C++, FORTRAN etc.) to translate his code into targets assembly code. This code would be used as an input to dco. The output, generated by the dco, will be a highly optimized targets assembly code that is logically identical to the original one; dco will rearrange the existing code, performing multi-issue optimization, loop unrolling and vectorization, reassigning available registers, etc. To create a final object file the generated code should be assembled.

Note that dco will not require preprocessing or any other involvement from the user. It will be fully automated and it would be possible to incorporate dco into makefiles or other product generation tools.

The use of dco will greatly improve the quality of the generated code. It, therefore, may prove to be a vital contribution to the production of a winning solution for your Digital Signal Processor(s) or CPU.

1.2.1 Features and Description

Takes full advantage of the target
dco will optimize a target machine code by taking full advantage of the options and features provided by the target microprocessor.
Fully automated
dco will not require preprocessing or any other user involvement. It will be fully automated and may be incorporated into makefiles or other product generation tools.
Flexible
dco will have many input options allowing the user to choose the right optimization technique to be performed as well as the level of optimization to be applied.

2 General Information

2.1 Source of optimization

dco will be a software package that optimizes a target code by taking full advantage of the options and features provided by the target microprocessor. The following is a description of some of the optimization techniques that will be provided by the package.

Static optimization: Static optimization performs dead-code elimination, copy propagation, code compression and simplification, common subexpression elimination, use of algebraic identities, etc.
Memory optimization: Memory optimization performs static and dynamic memory aliasing, attempting to resolve dependencies caused by the memory accessing instructions.
Auto parallelization: dco utilizes multiple execution units ( "cores" ) found on many modern processors and offers powerful auto-parallelization that capable to identify code-patterns that are suitable for parallelization, and create the optimized code that will be executed by all the cores available
Loop unrolling: Loop unrolling duplicates the body of a loop specified number of times creating piece of code with more opportunities for optimization.
Software pipelining: Software pipelining overlaps epilog code execution for current loop iteration with prolog code execution for the next loop iteration.
Multi-issuing: Multi-issuing attempts to utilize Instruction Level Parallelism ( ILP ) of the target processor leading to more than one instruction being executed at instruction cycle.
Art substitution: Art substitution replaces parts of the code by the logically equivalent code sequences. The generated code is then evaluated in order to find the optimal one.

By default dco will perform most of the optimizations that are available. It will be possible to enable or disable any number of the optimization techniques.

2.2 dco Code Optimization

This paragraph describes the way dco will treat the input code.

dco will operate on basic block, although most of optimizations and scheduling will be done across the blocks. A basic block is a sequence of a targets non-branching instructions in which flow of control enters at the top. dco will assumes as a basic block any sequence of instructions preceded and/or followed by a label or a branch instruction.

Optimization of the code will done as follows:

perform data flow analysis

extract basic block

perform static optimization

build data flow graph

          art substitution
       perform resource reduction
       extract patterns for scheduler
       perform instructions multi-issuing

After scanning the input targets assembly source, dco will performs comprehensive data flow analysis calculating resources needed and available for execution of the instructions of the code. It then will extract a basic block(s) to be optimized.

When the basic block is selected, the static optimization will be performed and dynamic memory map will be built. The code generated by the static optimization will then be used to build the data flow graph.

From the data flow graph dco will perform art substitutions and resource dependency reduction and generate patterns of code that are organized into multi-instruction units. dco will attempt to generate the fastest assembly code that is logically identical to the input code.

2.3 Target Code Improvement

At this stage it is difficult to come up with the reasonable estimates of the dco performances. However, this package would be based on the technology developed and implemented for the Intel’s i860 RISC processor ( dco860 ), DEC’s Alpha Axp RISC processor ( ago ) , Analog Devices ADSP-2106x ( SHARC ) family of DSP’s ( compactor ), Analog Devices TS00x ( tigerSHARC ) DSP ( dco ), Freescale’s StarCore DSP ( sco ) and recently for the x86 family of processors ( dco ). The modern DSP’s/CPU’s ( as powerful as they are ) contain little of what haven’t been successfully implemented in the series of the optimizers already implemented ( i860, Alpha and SHARC ). All of the supported CPUs provide multifunctional units that allow to execute more than one instruction per cycle ( i860 - 2, Alpha and SHARC - 4 ), Alpha, SHARC and x86 support conditional instruction execution, i860 supports data types contained across two registers, tigerSHARC and x86 support SIMD etc.

All of the implemented code optimizers achieve significant code improvements on variety of applications under different optimizing compilers, as summarized in the following table.

dco860 ( i860 )	30 to 40% on Linpack, Fourier Transform, Convolution
ago ( Alpha )	20 to 70% on Linpack, Fourier Transform, Convolution, Alvinn
compactor ( SHARC )	20 to 60% on Fourier Transform, Convolution, Matrix Multiply, Filters
dco ( tigerSHARC )	30 to 80% on Convolution, Matrix Multiply, Filters
sco ( StarCore )	30 to 50% on Convolution, Matrix Multiply, Filters
dco ( x86 )	up to 80% on numeric double precision applications

Note, that all code improvements are over performance of the code that is maximally optimized by the corresponding compiler. More data is available here or upon request.

Based on this, there is little doubt in the potential of this technology to significantly improve code for your DSP’s/CPU’s.

4 Code samples

This section contains some code samples that illustrate various optimization techniques supported by dco and shows code improvements achieved on different code patterns using various compilers for distinct processors.

See this and this for the samples of code and discussion regarding the new optimization techniques, utilized in our x86 optimizer.

See this and this for the samples of code and discussion regarding the optimization techniques, utilized in our StarCore optimizer.

4.1 Dynamic Memory Disambiguation

Dynamic memory disambiguation is one of the most powerful optimization of the inner loop body supported by dco. This technique allows, at the time of program execution ( dynamically ), resolve memory conflicts of the code. To achieve that, dco generates two versions of the code: one assuming that memory conflicts are not resolved and the second assuming that memory conflicts are resolved ( which is usually much more efficient ). At the run time, depending on the actual setting of the memory pointers, the appropriate code is executed.

As an example, consider the kernel of the linpack benchmark suite ( called daxpy ):

for ( i = 0; i < n; i++ )
{
dy[i] = dy[i] + da*dx[i];
}

When compiled by the Alpha Axp compiler, the following code is generated ( fully optimized ):

$43:
      ldt $f1,0($18)
      mult $f17,$f1,$f1
      ldt $f10,0($20)
      addt $f10,$f1,$f10
      addl $2,1,$2
      addq $18,8,$18
      cmplt $2,$16,$1
      stt $f10,0($20)
      addq $20,8,$20
      bne $1,$43
$41:

Unrolling this code by 2 and performing dynamic memory disambiguation dco produces the following result:

$43:
      lda      $1, 1
      addl      $2, $1, $1
      subl      $1, $16, $1
      blt      $1, .lualpha_41
.align 4
.lualpha_42:
      ldt      $f1, 0($18)
      mult      $f17, $f1, $f1
      ldt      $f10, 0($20)
      addt      $f10, $f1, $f10
      addl      $2, 1, $2
      addq      $18, 8, $18
      cmplt      $2, $16, $1
      stt      $f10, 0($20)
      addq      $20, 8, $20
      bne      $1, .lualpha_42
      br      $31, .lualpha_45
.lualpha_41:
      lda      $1, 15
      addq      $18, $1, $1
      subq      $1, $20, $1
      bge      $1, .mcalpha_48
.align 4
.wlalpha_49:
      ldt      $f12, 0($18)
      ldt      $f1, 8($18)
      mult      $f17, $f12, $f12
      mult      $f17, $f1, $f1
      ldt      $f11, 0($20)
      ldt      $f10, 8($20)
      addt      $f11, $f12, $f11
      addt      $f10, $f1, $f10
      addl      $2, 2, $2
      addl      $2, 1, $1
      addq      $18, 16, $18
      subl      $1, $16, $1
      stt      $f11, 0($20)
      stt      $f10, 8($20)
      addq      $20, 16, $20
      blt      $1, .wlalpha_49
      Br      $31, .mlalpha_54
.mcalpha_48:
      lda      $1, 1
      addq      $18, $1, $1
      subq      $1, $20, $1
      ble      $1, .mcalpha_55
.align 4
.wlalpha_56:
      ldt      $f12, 0($18)
      ldt      $f1, 8($18)
      mult      $f17, $f12, $f12
      mult      $f17, $f1, $f1
      ldt      $f11, 0($20)
      ldt      $f10, 8($20)
      addt      $f11, $f12, $f11
      addt      $f10, $f1, $f10
      addl      $2, 2, $2
      addl      $2, 1, $1
      addq      $18, 16, $18
      subl      $1, $16, $1
      stt      $f11, 0($20)
      stt      $f10, 8($20)
      addq      $20, 16, $20
      blt      $1, .wlalpha_56
      Br      $31, .mlalpha_54
.mcalpha_55:
.align 4
.wlalpha_61:
      ldt      $f12, 0($18)
      mult      $f17, $f12, $f12
      ldt      $f11, 0($20)
      addt      $f11, $f12, $f11
      stt      $f11, 0($20)
      ldt      $f1, 8($18)
      mult      $f17, $f1, $f1
      ldt      $f10, 8($20)
      addt      $f10, $f1, $f10
      addl      $2, 2, $2
      addl      $2, 1, $1
      addq      $18, 16, $18
      subl      $1, $16, $1
      stt      $f10, 8($20)
      addq      $20, 16, $20
      blt      $1, .wlalpha_61
      br      $31, .mlalpha_54
.mlalpha_54:
      addl      $2, $31, $1
      subl      $1, $16, $1
      blt      $1, .lualpha_42
.lualpha_45:
$41:

The unrolled loop:

.wlalpha_61:
      ldt      $f12, 0($18)
      mult      $f17, $f12, $f12
      ldt      $f11, 0($20)
      addt      $f11, $f12, $f11
      stt      $f11, 0($20)
      ldt      $f1, 8($18)
      mult      $f17, $f1, $f1
      ldt      $f10, 8($20)
      addt      $f10, $f1, $f10
      addl      $2, 2, $2
      addl      $2, 1, $1
      addq      $18, 16, $18
      subl      $1, $16, $1
      stt      $f10, 8($20)
      addq      $20, 16, $20
      blt      $1, .wlalpha_61

has memory conflict:

stt $f11, 0($20)
ldt $f1, 8($18)

The generated code solves this conflict by producing version of the loop with the assumption that memory conflict doesn't exist ( loop labeled .wlalpha_49 - note that all the memory reads ( ldt ) are performed before memory writes ( stt ) ) and version of the loop without such an assumption ( loop labeled .wlalpha_61 - note that that order of the instruction of the memory conflict is preserved ( ldt $f1,8($18) is following stt $f11,0($20) )

In most instances of the linpack benchmark execution, code labeled .wlalpha_49: will be executed thus bringing performance improvement over the original code to 40%.

4.2 Software Pipelining

Software Pipelining is another powerful optimization of the inner loop body. It overlaps epilog code execution for current loop iteration with prolog code execution for the next loop iteration. Essentially, the two consecutive loop iterations are fitted in the block of the size of the loop iteration ( of size n ). m instruction are chosen from the bottom of the first iteration and combined with n - m instructions from the top of the second iteration. The generated m + ( n - m ) = n instructions are optimized. This is done for all m from 1 to n-1 and the best resulting code is chosen. Of course, all necessary checkups to preserve logic of the code are performed by dco and resulting code.

As an example, consider the kernel of the dot product

for ( i = 0; i < n; i++ )
{
s += x[i]*y[i];
}

as generated by the SHARC compiler ( fully optimized ):

_L$5:
        lcntr = r1, do _L$3-1 until lce;
        f4=dm(i4,m6),f2=pm(i12,m14);
        f8=f2*f4;
        f12=f12+f8;
_L$6:
_L$3:

Execution of this loop takes 3 clocks/point.

Running vectorization on this code, dco produces the following result:

        r0=r1-1;
        if gt jump (pc,__vr$dco_29);
        f12=f8+f12, f4=dm(i4,m6), f2=PM(i12,m14);
        f8=f2*f4;
        f12=f8+f12;
        jump (PC,__vr$dco_30);
__vr$dco_29:
        f4=dm(i4,m6);
        f2=PM(i12,m14);
        f12=f8+f12;
        f8=f2*f4;
__vr$dco_31:
        lcntr=r0, do __vr$dco_32-1 until lce ;
        f12=f8+f12, f4=dm(i4,m6), f2=PM(i12,m14);
        f8=f2*f4, f12=f8+f12;
__vr$dco_32:
        f12=f8+f12;
__vr$dco_30:
__dole$dco_0:
_L$3:

Optimized loop:

__vr$dco_31:
        lcntr=r0, do __vr$dco_32-1 until lce ;
        f12=f8+f12, f4=dm(i4,m6), f2=PM(i12,m14);
        f8=f2*f4, f12=f8+f12;
__vr$dco_32:

execution takes 2 clocks/point - 33% improvement over the compiler-generated code. Better results may be achieved by combining vectorization with loop unrolling ( also supported by dco ).

4.3 Space optimization

Often reducing the size of the code is as important as improving its execution time. dco provides support for code-size reduction.

The following code is generated by the tigerSHARC compiler ( fully optimized ) and represents a kernel of a DSP benchmark ( code size –39 instructions ):

_L$750011:
        XR15 = [J26 + 60];;
        XR0 = [J26 + 72];;
        l[J26 + 58] = XR5:4;;
        XR21 = [J26 + 74];;
        XR12 = [J26 + 58];;
        XR1 = PASS R21; XR21 = ASHIFT R21 BY -16; l[J26 + 60] = XR7:6;;
        XR22 = [J26 + 61]; XR17 = PASS R12; XR12 = ASHIFT R12 BY -16;;
        XR13 = [J26 + 60]; XR1 = LSHIFT R1 BY 16;;
        XR2 = XR22; XR17 = LSHIFT R17 BY 16;;
        XR18 = PASS R13; XR22 = ASHIFT R22 BY -16; l[J26 + 72] = XR5:4;;
        XR13 = ASHIFT R13 BY -16; l[J26 + 74] = XR7:6;;
        XR23 = R21 * R22 (I); XR2 = LSHIFT R2 BY 16;;
        XR14 = R12 * R13 (I); XR18 = LSHIFT R18 BY 16;;
        XR0 += FDEP R23 BY R16;;
        XR15 += FDEP R14 BY R16;;
        XR1 = ASHIFT R1 BY -16; [J26 + 63] = XR0;;
        XR2 = ASHIFT R2 BY -16; [J26 + 62] = XR15;;
        XR17 = ASHIFT R17 BY -16;;
        XR18 = ASHIFT R18 BY -16; XR3 = R1 * R2 (I);;
        XR19 = R17 * R18 (I);;
        XR0 += FDEP R3 BY R20;;
        XR15 += FDEP R19 BY R20; [J1 += 1] = XR0;;
        [J1 += 1] = XR15;;
        if nlc0e, jump _L$750011 ;;

When instructed to reduce code size, dco generates the following code ( time - 25 clocks, size 25 instructions - 36% code/size reduction ):

      xR8=pass R20; yR16=xR16;;
      yR8=xR20;;
_L$750011:
        yR15=[J26 + 72];;
        xR15=[J26 + 60];;
        L[J26 + 58]=xR5:4;;
        yR12=[J26 + 74];;
        xR12=[J26 + 58];;
        L[J26 + 60]=xR7:6;;
        yR13=[J26 + 61]; xyR10= ashift R12 by -16 ;;
        xR13=[J26 + 60]; xyR17= lshift R12 by 16 ;;
        xyR0= ashift R13 by -16 ; L[J26 + 72]=xR5:4;;
        xyR18= lshift R13 by 16 ; L[J26 + 74]=xR7:6;;
        xyR14= R10 * R0 (I); xyR2= ashift R17 by -16 ;;
        xyR0= ashift R18 by -16 ;;
        xyR15+= fdep R14 by R16 ;;
        xyR19= R2 * R0 (I); [J26 + 63]=yR15;;
        [J26 + 62]=xR15;;
        xyR15+= fdep R19 by R8 ;;
        [J1 += 1]=yR15;;
        if nLC0E, jump _L$750011; [J1 += 1]=xR15;;

When instructed to produce the fastest code, dco generates the following code ( time 23 clocks, size 33 instructions ):

      yR7=xR16;;
        yR6=xR20;;
_L$750011:
        yR4=[J26 + 72];;
        xR15=[J26 + 60];;
        L[J26 + 58]=xR5:4;;
        yR12=[J26 + 74];;
        xR12=[J26 + 58];;
        L[J26 + 60]=xR7:6;;
        yR3=[J26 + 61]; xyR17= lshift R12 by 16 ;;
        xR13=[J26 + 60]; xyR2= ashift R12 by -16 ;;
        yR1= ashift R3 by -16 ; xR17= ashift R17 by -16 ; L[J26 + 72]=xR5:4;;
        yR5= lshift R3 by 16 ; xR8= ashift R13 by -16 ; L[J26 + 74]=xR7:6;;
        xR18= lshift R13 by 16 ; yR3= R2 * R1 (I); yR1= ashift R17 by -16 ;;
        yR2= ashift R5 by -16 ; xR14= R2 * R8 (I);;
        xR18= ashift R18 by -16 ; yR4+= fdep R3 by R7 ;;
        yR0= R1 * R2 (I); xR15+= fdep R14 by R16 ; [J26 + 63]=yR4;;
        xR19= R17 * R18 (I); [J26 + 62]=xR15;;
        yR4+= fdep R0 by R6 ;;
        xR15+= fdep R19 by R20 ; [J1 += 1]=yR4;;
        if nLC0E, jump _L$750011; [J1 += 1]=xR15;;