|
Many
applications spend bulk of their time inside inner loop(s) thus making
it extremely important to optimize code of the inner loop. One of the
efficient ways to optimize loops is to perform loop unrolling - to
duplicate the body of a loop specified number of times ( dco
parameter -lu #
). Generating more instructions in the body of the loop
often creates more opportunities for code optimizations.
However, before using this powerful optimization, it is important to
understand what exactly dco
performs and how it may affect the performance of the code being
optimized.
In this article we will
Understanding
Loop
UnrollingThe standard
double precision dot product
loop:
for
( s = .0, i = 0; i < n; i++ )
{ s
+= x[i]*y[i];
}
is
translated by the compiler ( gcc
version 4.2.2 invoked with the following list of
options: -S
-O3 -fomit-frame-pointer -ffast-math -march=pentium4
-mfpmath=sse -msse2 -mstackrealign ) into:
pxor
%xmm1,%xmm1 xorl %eax,%eax .L5:
movsd (%ebx,%eax,8), %xmm0
mulsd (%esi,%eax,8), %xmm0
addsd %xmm0, %xmm1
addl $1, %eax
cmpl %edx, %eax
jne .L5
which
dco, when unrolling
by 4 ( -lu 4 ),
transfers into ( see the left column; the
text in the right column of the above table describes different
functional units in the unrolled code, generated by dco ):
leal -1(,%edx),%ecx
xorl
%eax,%eax pxor %xmm1,%xmm1
addl $-2,%ecx jg ___dcox86_lu_0_
| Code
marked in this color
establishes %ecx
as a loop count and checks if there are at least 4 iterations in the
loop; if yes - jump
to ___dcox86_lu_0_
to execute unrolled loop. |
subl $-3,%ecx jmp
___dcox86_wl_4_
| Not enough iterations for unrolled loop -
readjust the count %ecx
and jump to___dcox86_wl_4_
to execute not unrolled loop. |
___dcox86_lu_0_:
pxor %xmm2,%xmm2 unpcklpd
%xmm2,%xmm1 pxor %xmm0,%xmm0
movl %ebx,%ebp andl $15,%ebp
jne ___dcox86_wl_3_ ___dcox86_wl_5_:
movsd (%esi,%eax,8),%xmm5 addl
$4,%eax movhpd
-24(%esi,%eax,8),%xmm5 addl
$-4,%ecx movsd
-16(%esi,%eax,8),%xmm6 movhpd
-8(%esi,%eax,8),%xmm6 mulpd
-32(%ebx,%eax,8),%xmm5 mulpd
-16(%ebx,%eax,8),%xmm6 addpd
%xmm5,%xmm1 addpd %xmm6,%xmm0
jg ___dcox86_wl_5_ jmp
___dcox86_cc_7_ ___dcox86_wl_3_:
movsd (%ebx,%eax,8),%xmm7 addl
$4,%eax movhpd
-24(%ebx,%eax,8),%xmm7 addl
$-4,%ecx movsd
-32(%esi,%eax,8),%xmm3 movhpd
-24(%esi,%eax,8),%xmm3 movsd
-16(%ebx,%eax,8),%xmm4 movhpd
-8(%ebx,%eax,8),%xmm4 movsd
-16(%esi,%eax,8),%xmm6 movhpd
-8(%esi,%eax,8),%xmm6 mulpd
%xmm3,%xmm7 mulpd %xmm6,%xmm4
addpd %xmm7,%xmm1 addpd
%xmm4,%xmm0 jg
___dcox86_wl_3_ | Unrolled loop. Note the optimizations
performed by dco,
such as use of SIMD instructions, conditionally-generated code to use
aligned memory access ( e.g. mulpd -32(%ebx,%eax,8),%xmm5 ). |
___dcox86_cc_7_:
addpd %xmm0,%xmm1 movhlps
%xmm1,%xmm2 addsd %xmm2,%xmm1
subl $-3,%ecx
jle
___dcox86_lu_2_
| Exit
from the unrolled loop execution. Code
marked in this color
checks if there are iterations left for execution
( 3 or less ) - note that %ecx
contains a loop count; if no iterations left - jump
to ___dcox86_lu_2_. |
___dcox86_wl_4_:
movsd (%ebx,%eax,8),%xmm0 addl
$1,%eax mulsd
-8(%esi,%eax,8),%xmm0 addl
$-1,%ecx addsd %xmm0,%xmm1
jg ___dcox86_wl_4_
| Not
unrolled loop - executed if there are iterations left after the
execution of the unrolled loop or it was established that number of
iterations in the loop is less than unroll count. Note that this code
is executed
for number of iterations less than unroll count ( 3 or less ). | |
___dcox86_lu_2_: | Exit label. |
Note, that a general
purpose register was allocated ( %ecx
) for use in the dco
generated code. Failure to allocate such a register would prohibit loop
unrolling. Note also that in the dco-generated
unrolled code values of
the registers, used by the loop before the first iteration, are equal
to
the values of the original ( not unrolled ) loop - this may be
important during
optimization and is not a case with the unroller provided by gcc. Optimization
ResultsWe
used the C version of the Livermore loops benchmark. The code was
modified to eliminate calibration, thus ensuring that on every run the
same number of iterations are executed on the same input data. This
makes it possible to compare the execution times of the program ( and
not the estimate amount of MFlops as in the original implementation ).
All
benchmarks were executed under Fedora
Linux operating system
running on the 2.8GHz Pentium4 computer with 512MB RAM installed. It
was ensured that benchmarks run under the same conditions on the system
with the minimal possible
load.
Every benchmark was executed 3 times with the time reported being
neither the best nor the worst.
Speedups
of the compiler generated not unrolled codeThis page
evaluates the
speedup achieved by utilization of the loop unrolling, supported
by dco,
over the compiler generated not unrolled code; see this
for comparison between dco-implemented
loop unrolling and loop unrolling provided by the gcc.
Jump
to Conclusions.
The following table presents
collected execution data.
- The
column under gcc 4.2.2
header presents
execution times
( in seconds ) achieved by the gcc generated code;gcc version
4.2.2 compiler with the following compiler options:
-O3
-fomit-frame-pointer -ffast-math
-march=pentium4
-mfpmath=sse -msse2
-mstackrealign ( note that compiler's unroller was not
enabled
)
- The
column
under dco header present execution
times
( in seconds ) of the
dco ( version 1.0.4
) optimized code
without unroller being enabled
- The column under dcoU header presents dco
generated code with
loops unrolled by 4 ( -lu 4
command line option )
- The column
under the dco/gcc
header lists the improvements achieved by utilizing dco over the gcc generated code
- The column under the dcoU/gcc
lists the
improvements achieved by utilizing dco
unroller over the gcc generated code
- The column under the dcoU/dco
shows how much faster is dco
unrolled code than
dco
not unrolled code
Cases
where use of the
unroller improved code speed ( by more than 5% ) are shown in this
color.
| Kernel# |
gcc
4.2.2 | dco |
dcoU |
dco/ gcc |
dcoU/ gcc |
dcoU/ dco |
| 1 |
5.03 | 4.05 |
2.96 |
19.48% | 41.15% | 26.91% |
| 2 | 2.84 |
2.78 | 2.78 | 2.11% | 2.11% | 0.0% | | 3 | 5.06 | 5.18 | 4.02 | -2.37% | 20.55% | 22.39% |
| 4 | 5.33 | 5.24 | 5.2 | 1.69% | 2.44% | 0.76% | | 5 | 5.08 | 4.99 |
1.73 |
1.77% | 65.94% | 65.33% |
| 6 | 16.25 | 16.22 |
4.08 |
0.18% | 74.89% | 74.85% | | 7 | 5.55 | 4.8 |
4.8 |
13.51% | 13.51% | 0.0% |
| 8 | 3.91 | 3.9 |
3.92 |
0.26% |
-0.26% |
-0.51% | | 9 | 5.04 | 5.69 |
3.72 |
-12.9% | 26.19% | 34.62% |
| 10 | 4.97 | 4.35 |
4.53 |
12.47% | 8.85% | -4.14% | | 11 | 5.13 | 0.82 |
0.82 | 84.02% | 84.02% | 0.0% |
| 12 | 4.42 | 5.58 |
4.99 |
-26.24% | -12.9% | 10.57% | | 13 | 4.65 | 4.66 |
4.6 | -0.22% | 1.08% | 1.29% |
| 14 | 4.37 | 4.35 |
4.33 |
0.46% | 0.92% | 0.46% | | 15 | 5.45 | 5.4 |
5.4 | 0.92% |
0.92% | 0.0% |
| 16 | 5.5 | 6.67 |
6.66 |
-21.27% | -21.09% | 0.15% | | 17 | 4.87 | 4.17 |
4.11 | 14.37% | 14.37% | 0.0% |
| 18 | 4.58 | 3.64 |
4.01 |
20.52% | 12.45% | -10.46% | | 19 | 7.04 | 3.68 |
3.7 | 47.73% | 47.44% | -0.54% |
| 20 | 4.84 | 4.73 |
4.73 |
2.27% | 2.27% | 0.0% | | 21 | 7.15 | 7.11 |
6.86 |
0.56% | 4.06% | 3.52% |
| 22 |
4.83 |
4.79 |
4.78 |
0.83% |
1.04% |
0.21% |
| 23 |
3.67 |
2.98 |
2.58 |
18.8% |
29.7% |
13.42% |
| 24 |
4.85 |
1.19 |
0.89 |
75.46% |
81.65% |
25.21% |
| Geometric
Mean | 5.13 | 4.25 | 3.62 | 17.15% | 29.56% | 14.98% |
In 8 ( out of
24 ) cases utilizing dco's
unroller proved to be beneficial; for one case ( the kernel
18 ) dco
generated faster code without utilizing the unroller. On the average,
use of the unroller added 15%
of the speed improvement to the dco
optimized code - without unroller dco
improved gcc
generated code by 17%,
with the unroller dco
improved the same code
by 30%.
Speedups
of the compiler generated unrolled codeThis page
- evaluates
the
speedup achieved by optimization
of the compiler generated unrolled code
- compares
between dco-implemented
loop unrolling and loop unrolling provided by the gcc
Jump
to Conclusions.The following table presents
collected execution data.
- The
column under gccU
header presents
execution times
( in seconds ) achieved by the gcc generated code; gcc version
4.2.2 compiler with the following compiler options:
-O3 -fomit-frame-pointer -funroll-all-loops
-ffast-math
-march=pentium4
-mfpmath=sse -msse2
-mstackrealign ( note that compilers unroller was enabled
) - The
column
under gccU+dco header presents execution
times
( in seconds ) of the
dco optimized code
without unroller being enabled
while optimizing gcc
generated unrolled code
- The column under gcc+dcoU header presents dco
generated code with
loops unrolled by 4 ( -lu 4
command line option ) while optimizing gcc
generated not unrolled code
- The column
under the gccU+dco/gccU
header lists the improvements achieved by utilizing dco over the gcc generated unrolled code
- The column under the gcc+dcoU/gcc
lists the
improvements achieved by utilizing dco
unroller over the gcc generated not unrolled code
- The
column under the gcc+dcoU/gccU+dco
shows how much faster is dco
unrolled code (
optimizing
gcc generated
not unrolled code ) than dco
not unrolled code ( optimizing
gcc
generated unrolled code
); cases with improvement by more than 5% are shown
in this
color
| Kernel# |
gccU |
gccU+dco |
gcc+dcoU |
gccU+dco/ gccU |
gcc+dcoU/ gcc |
gcc+dcoU/ gccU+dco |
| 1 |
5.03 | 3.24 |
2.96 |
35.59% | 41.15% | 8.64% |
| 2 | 2.46 |
2.36 | 2.78 | 4.07% | 2.11% | -17.8% | | 3 | 4.99 | 2.49 | 4.02 | 50.1% | 20.55% | -61.45% |
| 4 | 5.04 | 3.84 | 5.2 | 23.81% | 2.44% | -35.42% | | 5 | 5.33 | 1.76 |
1.73 |
66.98% | 65.94% | 1.7% |
| 6 | 16.24 | 5.06 |
4.08 |
68.84% | 74.89% | 19.37% |
| 7 |
5.16 | 4.16 |
4.8 |
19.38% | 13.51% | -15.38% |
| 8 | 3.87 | 3.91 |
3.92 |
-1.03% |
-0.26% |
-0.26% | | 9 | 4.95 | 4.01 |
3.72 |
18.99% | 26.19% | 7.23% |
| 10 | 4.94 |
3.38 |
4.53 |
31.58% | 8.85% | -34.02% | | 11 | 4.93 | 0.85 |
0.82 | 82.76% | 84.02% | 3.53% |
| 12 | 5.2 | 5.52 |
4.99 |
-6.15% | -12.9% | 9.6% |
| 13 | 4.63 | 4.66 |
4.6 | -0.65% | 1.08% | 1.29% |
| 14 | 4.41 | 4.23 |
4.33 |
4.08% | 0.92% | -2.36% | | 15 | 5.44 | 4.54 |
5.4 | 16.54% |
0.92% | -18.94% |
| 16 | 4.86 | 4.52 |
6.66 |
7.0% | -21.09% | -47.35% | | 17 | 4.87 | 4.17 |
4.11 | 14.37% | 14.37% | 0.0% |
| 18 | 4.57 | 3.61 |
4.01 |
21.01% | 12.45% | -11.08% | | 19 | 5.82 | 4.1 |
3.7 | 29.55% | 47.44% | 9.76% |
| 20 | 4.53 |
4.43 |
4.73 |
2.21% | 2.27% | -6.77% | | 21 | 7.16 | 8.85 |
6.86 |
-23.6% | 4.06% | 22.49% |
| 22 |
4.79 |
4.8 |
4.78 |
-0.21% |
1.04% |
0.42% |
| 23 |
3.67 |
2.82 |
2.58 |
23.16% |
29.7% |
8.51% |
| 24 |
4.85 |
0.84 |
0.89 |
82.68% |
81.65% |
-5.95% |
| Geometric
Mean | 5.02 | 3.44 | 3.62 | 31.47% | 29.56% | -5.12% |
On the
average, optimizing gcc
unrolled code proved to be 5%
faster than utilizing dco's
unroller.
However in 7 ( out of
24 ) cases dco's
unroller achieved better results (
cases shown in this
color in the above table ). In some cases dco
fails to unroll a loop, while gcc
succeeds - this happens due to unavailability of a general
purpose register needed by dco's
unroller ( see this
); even when dco
successfully unrolls the loop, it may generate a
slower code ( than gcc
unrolled dco
optimized code ), because of inability
to allocate and use a general purpose register needed to perform other
optimizations. For example, in the kernel 3, dco successfully
unrolled
the loop ( and achieved 21%
improvement by doing that ) but failed to performed aligned memory
access optimization because a general purpose register, needed to
implement this optimization, wasn't available after unrolling the loop
- a register, available in the not unrolled loop, was used up by the
dco's
unroller; a
register was available in the gcc
unrolled code making this optimization possible ( and achieved 50%
improvement by
doing that ).
combining dco's
loop unroller with loop unroller supported by gccIn order to generate the fastest code, it
might be necessary to combine dco's
loop unroller with loop unroller, supported by gcc - to unroll ( using
dco ) the code
already unrolled by the compiler.
The following table represents
optimization results achieved while optimizing the double
precision convolution
on arrays of 1000 points ( see this for details ). The code was compiled by the gcc version 4.2.2
compiler and executed on
the
64-bit Linux operating system
based on the 2.66GHz Core2 processor.
The column
under gcc header presents execution
times
( in seconds ) of the gcc
generated code. The columns under -lu # header present
execution time ( in seconds ) of the gcc generated code that
was optimized by dco
with the loop unroller as specified by the header.
In
the raw named gcc
the code was compiled with the following compiler options:
-O3 -fomit-frame-pointer -ffast-math
-march=nocona -mfpmath=sse -msse3
( note that compilers unroller was not
enabled
) In the raw
named gccU
the code was compiled with the following compiler options:
-O3 -fomit-frame-pointer -funroll-all-loops
-ffast-math
-march=nocona
-mfpmath=sse -msse3
( note that compilers unroller was enabled
)
| gcc | -lu 1 | -lu 2 | -lu 3 | -lu 4 | | gcc | 2.9916 | 2.9975 | 1.7947 | 2.0497 | 1.7877 | | gccU | 2.9106 | 1.7927 | 1.3348 | 1.8267 | 1.3758 |
ConclusionsThe fastest code ( shown in this
color in the above table ) is 54% faster then the best compiler generated code and was
achieved when dco
unrolled ( by 2 ) the gcc
unrolled code.
| |