x86
offers the 'Single Instruction Multiple Data' (SIMD) instructions
which allow simultaneous execution of multiple data elements that must
be properly packed inside ( register ) operands. The part of dco
utilizing these instructions is called SIMDinator ( pronounced seem-d-ney-ter
). While it may be one of the important sources of optimization that
can
potentially improve the execution time of code it is very complicated,
difficult to implement and expensive to use. In
this article we will:
A SIMD instruction functionally replaces a few scalar instructions,
e.g.
execution of addpd
is functionally equivalent to a pair of addsd
instructions
that are being executed in parallel. Before converting scalar
instructions into SIMD instructions, it is necessary to identify them.
While
identifying the sequences of scalar instructions, that have a SIMD
equivalent and may be executed in parallel ( independent of one another
), SIMDinator maximizes the total number of candidates that are to be
packed.
After the instruction sequences were created, the corresponding
instructions from these sequences are moved together, their operands
are
adjusted to satisfy the requirements of x86 SIMD instructions and then
conversion takes place.
Consider the following code ( part of the Livermore kernel benchmark,
which was improved
36% by dco
)
for ( k=0 ; k<n ; k++ ) { x[k] = u[k] + r*( z[k] + r*y[k] ) + t*( u[k+3] + r*( u[k+2] + r*u[k+1] ) + t*( u[k+6] + q*( u[k+5] + q*u[k+4] ) ) ); }
|
.L1012: leal as1+32032(,%edx,8),%eax unpcklpd %xmm3,%xmm3 unpcklpd %xmm4,%xmm4 unpcklpd %xmm5,%xmm5 ___dcox86_wl_0_: movsd -31992(%eax),%xmm2 addl $2,%edx movhpd -32000(%eax),%xmm2 addl $16,%eax cmpl %ecx,%edx movsd -32000(%eax),%xmm7 movhpd -32008(%eax),%xmm7 movsd -32032(%eax),%xmm6 movhpd -32040(%eax),%xmm6 movsd -8(%eax),%xmm1 movhpd -16(%eax),%xmm1 mulpd %xmm4,%xmm2 movsd 8000(%eax),%xmm0 movhpd 7992(%eax),%xmm0 addpd %xmm7,%xmm2 movsd -32024(%eax),%xmm7 movhpd -32032(%eax),%xmm7 mulpd %xmm3,%xmm6 mulpd %xmm4,%xmm2 mulpd %xmm3,%xmm1 addpd %xmm7,%xmm6 movsd -31992(%eax),%xmm7 movhpd -32000(%eax),%xmm7 addpd %xmm0,%xmm1 movsd -32040(%eax),%xmm0 movhpd -32048(%eax),%xmm0 mulpd %xmm3,%xmm6 mulpd %xmm3,%xmm1 addpd %xmm7,%xmm2 movsd -32016(%eax),%xmm7 movhpd -32024(%eax),%xmm7 addpd %xmm0,%xmm1 mulpd %xmm5,%xmm2 addpd %xmm7,%xmm6 addpd %xmm2,%xmm6 mulpd %xmm5,%xmm6 addpd %xmm6,%xmm1 movhpd %xmm1,-8024(%eax) movsd %xmm1,-8016(%eax) jne ___dcox86_wl_0_
|
.L1012: leal 1(%edx), %esi movapd %xmm3, %xmm6 mulsd as1+32032(,%edx,8), %xmm6 addsd as1+40040(,%edx,8), %xmm6 mulsd %xmm3, %xmm6 addsd as1(,%edx,8), %xmm6 movapd %xmm3, %xmm1 mulsd as1(,%esi,8), %xmm1 addsd as1+16(,%edx,8), %xmm1 mulsd %xmm3, %xmm1 addsd as1+24(,%edx,8), %xmm1 movapd %xmm4, %xmm0 mulsd as1+24(,%esi,8), %xmm0 addsd as1+32(,%esi,8), %xmm0 mulsd %xmm4, %xmm0 addsd as1+40(,%esi,8), %xmm0 mulsd %xmm5, %xmm0 addsd %xmm0, %xmm1 mulsd %xmm5, %xmm1 addsd %xmm1, %xmm6 movsd %xmm6, as1+24016(,%esi,8) addl $2, %edx movapd %xmm3, %xmm6 mulsd as1+32032(,%esi,8), %xmm6 addsd as1+40040(,%esi,8), %xmm6 mulsd %xmm3, %xmm6 addsd as1(,%esi,8), %xmm6 movapd %xmm3, %xmm7 mulsd as1(,%edx,8), %xmm7 addsd as1+16(,%esi,8), %xmm7 mulsd %xmm3, %xmm7 addsd as1+24(,%esi,8), %xmm7 movapd %xmm4, %xmm2 mulsd as1+24(,%edx,8), %xmm2 addsd as1+32(,%edx,8), %xmm2 mulsd %xmm4, %xmm2 addsd as1+40(,%edx,8), %xmm2 mulsd %xmm5, %xmm2 addsd %xmm2, %xmm7 mulsd %xmm5, %xmm7 addsd %xmm7, %xmm6 movsd %xmm6, as1+24016(,%edx,8) cmpl %ecx, %edx jne .L1012
| Consider the SIMD instruction mulpd %xmm4,%xmm2
which was generated by the SIMDinator. It's operands are provided by movsd -31992(%eax),%xmm2
and movhpd -32000(%eax),%xmm2
and it's result is consumed by addpd %xmm7,%xmm2.
In the compiler generated code the scalar instructions which are marked
by this
color were
packed to generate this SIMD instruction. Note that their memory
operands were substituted by explicit memory loads ( marked by
this
color in
the dco generated
code ) and
their consumers, marked by this
color, were also packed. The
SIMDinator is enabled by default and is disabled by the -no-packing
option of dco.
The following study shows that, although SIMDinator may cause
significant code speed ups, it is not alway the case and sometimes
code,
generated by the SIMDinator, is slower than compiler generated.
This happens partially due to the requirements imposed on the operands
of SIMD instructions. To satisfy these requirements SIMDinator
is
often forced to use special purpose instructions (
pack/unpack/shuffle/movh ) which are expensive and may cancel the
improvement achieved by the use of the SIMD instructions.
The following table presents the execution data collected while
performing benchmarking of the Livermore loops - see this
for the description of the code that was benchmarked. The columns under gcc, gcc+dco and -np
headers present execution times ( in seconds ) and execution speeds (
in MFlops ) which are achieved by the compiler generated code, dco
optimized code with
the SIMDinator enabled
and dco optimized code with the SIMDinator
disabled
respectively. The column under the gcc+dco/gcc
header lists the
improvement achieved by utilizing dco with the
SIMDinator enabled as compared to
the compiler generated code. The column under the -np/gcc header lists the
improvement achieved by utilizing dco with the
SIMDinator disabled in comparison with
the compiler generated code. The column under the -np/gcc+dco
header shows how much faster the code generated without use of the
SIMDinator when compared to the code generated with the SIMDinator
enabled.
The best results are shown in this
color.
| Kernel# | gcc 4.1.2 | gcc+dco | -np | gcc+dco/gcc | -np/gcc | -np/gcc+dco |
| 1 | 4.96 | 1707.57 |
3.32 |
2551.83 | 4 | 2117.69 | 33.06% |
19.35% |
-20.48% |
| 2 | 2.38 | 1462.97 |
2.32 | 1500.79 | 2.32 | 1500.79 | 2.52% | 2.52% | 0.00% |
| 3 |
5.93 |
1136.8 | 3.55 | 1898.05 | 3.84 | 1754.86 | 40.13% |
35.24% |
-8.17% |
| 4 | 4.66 | 1239.76 |
4.12 |
1401.69 | 3.8 | 1519.63 | 11.59% |
18.45% |
7.77% | | 5 | 5.2 | 183.98 |
2.07 |
462.66 | 2.07 |
462.66 | 60.19% |
60.19% |
0.00% | | 6 |
4.53 |
718.38 |
3.9 |
834.32 | 3.63 | 896.31 | 13.91% |
19.87% |
6.92% | | 7 | 4.87 | 1677.7 |
3.12 |
2619.9 | 3.96 | 2063.61 | 35.93% |
18.69% |
-26.92% |
| 8 | 5 | 991.29 |
5.4 |
917.93 | 3.88 | 1277.1 | -8.00% |
22.40% |
28.15% | | 9 | 4.6 | 1540.48 |
3.95 |
1794.09 | 4.23 | 1675.28 | 14.13% |
8.04% |
-7.09% |
| 10 | 4.94 | 353.19 |
3.87 |
450.96 | 3.38 | 516.43 | 21.66% |
31.58% |
12.66% | | 11 | 5.78 | 96.75 |
1.52 |
368.67 | 1.52 |
368.67 | 73.70% |
73.70% |
0.00% | | 12 | 5.18 | 681.16 |
4.98 |
708.5 | 4.39 | 803.66 | 3.86% | 15.25% |
11.85% | | 13 | 4.57 | 198.85 | 5.56 |
163.47 | 4.58 |
198.42 | -21.66% |
-0.22% |
17.63% |
| 14 | 4.71 | 256.27 |
4.71 |
256.27 | 4.26 | 283.34 | 0.00% | 9.55% | 9.55% | | 15 | 3.72 | 586.53 | 3.72 | 586.53 | 3.72 |
586.53 | 0.00% | 0.00% | 0.00% | | 16 | 5.61 | 698.71 |
5.31 | 738.16 | 5.29 | 740.95 | 5.35% | 5.70% | 0.38% | | 17 | 5.01 | 593.81 | 4.99 | 596.19 | 4.99 |
596.19 | 0.40% |
0.40% |
0.00% | | 18 | 4.7 | 826.95 |
3.96 | 981.54 | 3.95 | 984.03 | 15.74% |
15.96% |
0.25% | | 19 | 5.81 | 372.45 |
4.1 |
527.69 | 4.1 |
527.69 | 29.43% |
29.43% |
0.00% | | 20 | 4.53 | 374.26 |
4.43 | 382.71 | 4.43 | 382.71 | 2.21% | 2.21% | 0.00% |
| 21 |
4.88 | 362.2 |
4.61 |
383.41 | 4.61 |
383.41 | 5.53% |
5.53% |
0.00% | | 22 | 4.88 | 191.36 | 6.21 |
150.35 | 4.86 | 192.14 | -27.25% |
0.41% | 21.74% | | 23 | 4.17 | 681.74 |
4.09 |
695.08 | 4.09 |
695.08 | 1.92% |
1.92% |
0.00% | | 24 | 4.85 | 132.3 |
0.77 | 837.82 | 0.77 | 837.82 | 84.12% |
84.12% |
0.00% |
| Geometric
Mean | 4.75 | | 3.64 | | 3.53 | | 23.37% |
25.68% |
3.02% |
As it can be observed
- SIMDinator-generated code may achieve
spectacular speed ups ( like in kernels 1, 7 )
- SIMDinator-generated may also produce
slower code ( like in kernels 13, 22 )
- there are many cases where dco
improves the code without utilizing the SIMD instructions ( like in
kernels 11, 24 )
- there are cases where SIMDinator does
improve the code but disabling it leads to faster results ( like in
kernels 4, 10 ).
|