Many applications spend bulk of their time inside inner loop(s) thus making it extremely important to optimize code of the inner loop. One of the efficient ways to optimize loops is to perform loop unrolling - to duplicate the body of a loop specified number of times ( dco parameter -lu # ). Generating more instructions in the body of the loop often creates more opportunities for code optimizations. However, before using this powerful optimization, it is important to understand what exactly dco performs and how it may affect the performance of the code being optimized.

In this article we will

Understanding Loop Unrolling

The standard double precision dot product loop:

for ( s = .0, i = 0; i < n; i++ )
  {
    s += x[i]*y[i];
  }

is translated by the compiler ( gcc version 4.2.2 invoked with the following list of options:
-S -O3 -fomit-frame-pointer -ffast-math -march=pentium4 -mfpmath=sse -msse2 -mstackrealign ) into:

    pxor %xmm1,%xmm1
    xorl %eax,%eax
.L5:
    movsd    (%ebx,%eax,8), %xmm0
    mulsd    (%esi,%eax,8), %xmm0
    addsd    %xmm0, %xmm1
    addl    $1, %eax
    cmpl    %edx, %eax
    jne    .L5


which dco, when unrolling by 4 ( -lu 4 ), transfers into ( see the left column; the text in the right column of the table bellow describes different functional units in the unrolled code, generated by dco ):

    leal -1(,%edx),%ecx
    xorl %eax,%eax
    pxor %xmm1,%xmm1
    addl $-2,%ecx
    jg ___dcox86_lu_0_

Code marked in this color establishes %ecx as a loop count and checks if there are at least 4 iterations in the loop; if yes - jump to ___dcox86_lu_0_ to execute unrolled loop.
    subl $-3,%ecx
    jmp ___dcox86_wl_4_

Not enough iterations for unrolled loop - readjust the count %ecx and jump to___dcox86_wl_4_ to execute not unrolled loop.
___dcox86_lu_0_:
    pxor %xmm2,%xmm2
    unpcklpd %xmm2,%xmm1
    pxor %xmm0,%xmm0
    movl %ebx,%ebp
    andl $15,%ebp
    jne ___dcox86_wl_3_
___dcox86_wl_5_:
    movsd (%esi,%eax,8),%xmm5
    addl $4,%eax
    movhpd -24(%esi,%eax,8),%xmm5
    addl $-4,%ecx
    movsd -16(%esi,%eax,8),%xmm6
    movhpd -8(%esi,%eax,8),%xmm6
    mulpd -32(%ebx,%eax,8),%xmm5
    mulpd -16(%ebx,%eax,8),%xmm6
    addpd %xmm5,%xmm1
    addpd %xmm6,%xmm0
    jg ___dcox86_wl_5_
    jmp ___dcox86_cc_7_
___dcox86_wl_3_:
    movsd (%ebx,%eax,8),%xmm7
    addl $4,%eax
    movhpd -24(%ebx,%eax,8),%xmm7
    addl $-4,%ecx
    movsd -32(%esi,%eax,8),%xmm3
    movhpd -24(%esi,%eax,8),%xmm3
    movsd -16(%ebx,%eax,8),%xmm4
    movhpd -8(%ebx,%eax,8),%xmm4
    movsd -16(%esi,%eax,8),%xmm6
    movhpd -8(%esi,%eax,8),%xmm6
    mulpd %xmm3,%xmm7
    mulpd %xmm6,%xmm4
    addpd %xmm7,%xmm1
    addpd %xmm4,%xmm0
    jg ___dcox86_wl_3_
Unrolled loop. Note the optimizations performed by dco, such as use of SIMD instructions, conditionally-generated code to use aligned memory access ( e.g.
mulpd -32(%ebx,%eax,8),%xmm5 ).
___dcox86_cc_7_:
    addpd %xmm0,%xmm1
    movhlps %xmm1,%xmm2
    addsd %xmm2,%xmm1
    subl $-3,%ecx
    jle ___dcox86_lu_2_

Exit from the unrolled loop execution. Code marked in this color checks if there are  iterations left for execution ( 3 or less ) - note that %ecx contains a loop count; if no iterations left - jump to ___dcox86_lu_2_.
___dcox86_wl_4_:
    movsd (%ebx,%eax,8),%xmm0
    addl $1,%eax
    mulsd -8(%esi,%eax,8),%xmm0
    addl $-1,%ecx
    addsd %xmm0,%xmm1
    jg ___dcox86_wl_4_

Not unrolled loop - executed if there are iterations left after the execution of the unrolled loop or it was established that number of iterations in the loop is less than unroll count. Note that this code is executed for number of iterations less than unroll count ( 3 or less ).
___dcox86_lu_2_: Exit label.

Note, that a general purpose register was allocated ( %ecx ) for use in the dco generated code. Failure to allocate such a register would prohibit loop unrolling. Note also that in the dco-generated unrolled code values of the registers, used by the loop before the first iteration, are equal to the values of the original ( not unrolled ) loop - this may be important during optimization and is not a case with the unroller provided by gcc.

Optimization Results

We used the C version of the Livermore loops benchmark. The code was modified to eliminate calibration, thus ensuring that on every run the same number of iterations are executed on the same input data. This makes it possible to compare the execution times of the program ( and not the estimate amount of MFlops as in the original implementation ).

All benchmarks were executed under Fedora Linux operating system running on the 2.8GHz Pentium4 computer with 512MB RAM installed. It was ensured that benchmarks run under the same conditions on the system with the minimal possible load.

Every benchmark was executed 3 times with the time reported being neither the best nor the worst.

Speedups of the compiler generated not unrolled code

This page evaluates the speedup achieved by utilization of the loop unrolling, supported by dco, over the compiler generated not unrolled code; see this for comparison between dco-implemented loop unrolling and loop unrolling provided by the gcc.

Jump to Conclusions.

Results

The following table presents collected execution data.
Cases where use of the unroller improved code speed ( by more than 5% ) are shown in this color.

Kernel# gcc 4.2.2 dco dcoU dco/
gcc
dcoU/
gcc
dcoU/
dco
1 5.03 4.05 2.96 19.48% 41.15% 26.91%
2 2.84 2.78 2.78 2.11% 2.11% 0.0%
3 5.06 5.18 4.02 -2.37% 20.55% 22.39%
4 5.33 5.24 5.2 1.69% 2.44% 0.76%
5 5.08 4.99 1.73 1.77% 65.94% 65.33%
6 16.25 16.22 4.08 0.18% 74.89% 74.85%
7 5.55 4.8 4.8 13.51% 13.51% 0.0%
8 3.91 3.9 3.92 0.26% -0.26% -0.51%
9 5.04 5.69 3.72 -12.9% 26.19% 34.62%
10 4.97 4.35 4.53 12.47% 8.85% -4.14%
11 5.13 0.82 0.82 84.02% 84.02% 0.0%
12 4.42 5.58 4.99 -26.24% -12.9% 10.57%
13 4.65 4.66 4.6 -0.22% 1.08% 1.29%
14 4.37 4.35 4.33 0.46% 0.92% 0.46%
15 5.45 5.4 5.4 0.92% 0.92% 0.0%
16 5.5 6.67 6.66 -21.27% -21.09% 0.15%
17 4.87 4.17 4.11 14.37% 14.37% 0.0%
18 4.58 3.64 4.01 20.52% 12.45% -10.46%
19 7.04 3.68 3.7 47.73% 47.44% -0.54%
20 4.84 4.73 4.73 2.27% 2.27% 0.0%
21 7.15 7.11 6.86 0.56% 4.06% 3.52%
22 4.83 4.79 4.78 0.83% 1.04% 0.21%
23 3.67 2.98 2.58 18.8% 29.7% 13.42%
24 4.85 1.19 0.89 75.46% 81.65% 25.21%
Geometric Mean 5.13 4.25 3.62 17.15% 29.56% 14.98%

Conclusions

In 8 ( out of 24 ) cases utilizing dco's unroller proved to be beneficial; for one case ( the kernel 18 ) dco generated faster code without utilizing the unroller. On the average, use of the unroller added 15% of the speed improvement to the dco optimized code - without unroller dco improved gcc generated code by 17%, with the unroller dco improved the same code by 30%.

Speedups of the compiler generated unrolled code

This page
Jump to Conclusions.

Results

The following table presents collected execution data.
Kernel# gccU gccU+dco gcc+dcoU gccU+dco/
gccU
gcc+dcoU/
gcc
gcc+dcoU/
gccU+dco
1 5.03 3.24 2.96 35.59% 41.15% 8.64%
2 2.46 2.36 2.78 4.07% 2.11% -17.8%
3 4.99 2.49 4.02 50.1% 20.55% -61.45%
4 5.04 3.84 5.2 23.81% 2.44% -35.42%
5 5.33 1.76 1.73 66.98% 65.94% 1.7%
6 16.24 5.06 4.08 68.84% 74.89% 19.37%
7 5.16 4.16 4.8 19.38% 13.51% -15.38%
8 3.87 3.91 3.92 -1.03% -0.26% -0.26%
9 4.95 4.01 3.72 18.99% 26.19% 7.23%
10 4.94 3.38 4.53 31.58% 8.85% -34.02%
11 4.93 0.85 0.82 82.76% 84.02% 3.53%
12 5.2 5.52 4.99 -6.15% -12.9% 9.6%
13 4.63 4.66 4.6 -0.65% 1.08% 1.29%
14 4.41 4.23 4.33 4.08% 0.92% -2.36%
15 5.44 4.54 5.4 16.54% 0.92% -18.94%
16 4.86 4.52 6.66 7.0% -21.09% -47.35%
17 4.87 4.17 4.11 14.37% 14.37% 0.0%
18 4.57 3.61 4.01 21.01% 12.45% -11.08%
19 5.82 4.1 3.7 29.55% 47.44% 9.76%
20 4.53 4.43 4.73 2.21% 2.27% -6.77%
21 7.16 8.85 6.86 -23.6% 4.06% 22.49%
22 4.79 4.8 4.78 -0.21% 1.04% 0.42%
23 3.67 2.82 2.58 23.16% 29.7% 8.51%
24 4.85 0.84 0.89 82.68% 81.65% -5.95%
Geometric Mean 5.02 3.44 3.62 31.47% 29.56% -5.12%

Conclusions

On the average, optimizing gcc unrolled code proved to be 5% faster than utilizing dco's unroller. However in 7 ( out of 24 ) cases dco's unroller achieved better results ( cases shown in this color in the above table ).
In some cases dco fails to unroll a loop, while gcc succeeds - this happens due to unavailability of a general purpose register needed by dco's unroller ( see this ); even when dco successfully unrolls the loop, it may generate a slower code ( than gcc unrolled dco optimized code ), because of inability to allocate and use a general purpose register needed to perform other optimizations. For example, in the kernel 3, dco successfully unrolled the loop ( and achieved 21% improvement by doing that ) but failed to performed aligned memory access optimization because a general purpose register, needed to implement this optimization, wasn't available after unrolling the loop - a register, available in the not unrolled loop, was used up by the dco's unroller; a register was available in the gcc unrolled code making this optimization possible ( and achieved 50% improvement by doing that ).

combining dco's loop unroller with loop unroller supported by gcc

In order to generate the fastest code, it might be necessary to combine dco's loop unroller with loop unroller, supported by gcc - to unroll ( using dco ) the code already unrolled by the compiler.

The following table represents optimization results achieved while optimizing
the double precision convolution on arrays of 1000 points ( see this for details ). The code was compiled by the gcc version 4.2.2 compiler and executed on the 64-bit Linux operating system based on the 2.66GHz Core2 processor.

The
column under gcc header presents execution times ( in seconds ) of the gcc generated code. The columns under -lu # header present execution time ( in seconds ) of  the gcc generated code that was optimized by dco with the loop unroller as specified by the header.

In the raw named gcc the code was compiled
with the following compiler options:
-O3 -fomit-frame-pointer -ffast-math -march=nocona -mfpmath=sse -msse3
( note that compilers unroller was not enabled )
In the raw named gccU the code was compiled with the following compiler options:
-O3 -fomit-frame-pointer -funroll-all-loops -ffast-math -march=nocona -mfpmath=sse -msse3
( note that compilers unroller was enabled )


gcc -lu 1 -lu 2 -lu 3 -lu 4
gcc 2.9916 2.9975 1.7947 2.0497 1.7877
gccU 2.9106 1.7927 1.3348 1.8267 1.3758

Conclusions

The fastest code ( shown in this color in the above table ) is 54% faster then the best compiler generated code and was achieved when dco unrolled ( by 2 ) the gcc unrolled code.