x86 offers the 'Single Instruction Multiple Data' (SIMD) instructions which allow simultaneous execution of multiple data elements that must be properly packed inside ( register ) operands. The part of dco utilizing these instructions is called SIMDinator ( pronounced seem-d-ney-ter ). While it may be one of the important sources of optimization that can potentially improve the execution time of code it is very complicated, difficult to implement and expensive to use. In this article we will:

how the SIMDinator operates

A SIMD instruction functionally replaces a few scalar instructions, e.g. execution of addpd is functionally equivalent to a pair of addsd instructions that are being executed in parallel. Before converting scalar instructions into SIMD instructions, it is necessary to identify them. While identifying the sequences of scalar instructions, that have a SIMD equivalent and may be executed in parallel ( independent of one another ), SIMDinator maximizes the total number of candidates that are to be packed.

After the instruction sequences were created, the corresponding instructions from these sequences are moved together, their operands are adjusted to satisfy the requirements of x86 SIMD instructions and then conversion takes place.

example of the SIMDinator-generated code

Consider the following code ( part of the Livermore kernel benchmark, which was improved 36% by dco )

code to be optimized ( the inner loop )
for ( k=0 ; k<n ; k++ )
{
x[k] = u[k] + r*( z[k] + r*y[k] ) +
t*( u[k+3] + r*( u[k+2] + r*u[k+1] ) +
t*( u[k+6] + q*( u[k+5] + q*u[k+4] ) ) );
}
dco generated code
.L1012:
leal as1+32032(,%edx,8),%eax
unpcklpd %xmm3,%xmm3
unpcklpd %xmm4,%xmm4
unpcklpd %xmm5,%xmm5
___dcox86_wl_0_:
movsd -31992(%eax),%xmm2
addl $2,%edx
movhpd -32000(%eax),%xmm2
addl $16,%eax
cmpl %ecx,%edx
movsd -32000(%eax),%xmm7
movhpd -32008(%eax),%xmm7
movsd -32032(%eax),%xmm6
movhpd -32040(%eax),%xmm6
movsd -8(%eax),%xmm1
movhpd -16(%eax),%xmm1
mulpd %xmm4,%xmm2
movsd 8000(%eax),%xmm0
movhpd 7992(%eax),%xmm0
addpd %xmm7,%xmm2
movsd -32024(%eax),%xmm7
movhpd -32032(%eax),%xmm7
mulpd %xmm3,%xmm6
mulpd %xmm4,%xmm2
mulpd %xmm3,%xmm1
addpd %xmm7,%xmm6
movsd -31992(%eax),%xmm7
movhpd -32000(%eax),%xmm7
addpd %xmm0,%xmm1
movsd -32040(%eax),%xmm0
movhpd -32048(%eax),%xmm0
mulpd %xmm3,%xmm6
mulpd %xmm3,%xmm1
addpd %xmm7,%xmm2
movsd -32016(%eax),%xmm7
movhpd -32024(%eax),%xmm7
addpd %xmm0,%xmm1
mulpd %xmm5,%xmm2
addpd %xmm7,%xmm6
addpd %xmm2,%xmm6
mulpd %xmm5,%xmm6
addpd %xmm6,%xmm1
movhpd %xmm1,-8024(%eax)
movsd %xmm1,-8016(%eax)
jne ___dcox86_wl_0_
compiler generated code
.L1012:
leal 1(%edx), %esi
movapd %xmm3, %xmm6
mulsd as1+32032(,%edx,8), %xmm6
addsd as1+40040(,%edx,8), %xmm6
mulsd %xmm3, %xmm6
addsd as1(,%edx,8), %xmm6
movapd %xmm3, %xmm1
mulsd as1(,%esi,8), %xmm1
addsd as1+16(,%edx,8), %xmm1
mulsd %xmm3, %xmm1
addsd as1+24(,%edx,8), %xmm1
movapd %xmm4, %xmm0
mulsd as1+24(,%esi,8), %xmm0
addsd as1+32(,%esi,8), %xmm0
mulsd %xmm4, %xmm0
addsd as1+40(,%esi,8), %xmm0
mulsd %xmm5, %xmm0
addsd %xmm0, %xmm1
mulsd %xmm5, %xmm1
addsd %xmm1, %xmm6
movsd %xmm6, as1+24016(,%esi,8)
addl $2, %edx
movapd %xmm3, %xmm6
mulsd as1+32032(,%esi,8), %xmm6
addsd as1+40040(,%esi,8), %xmm6
mulsd %xmm3, %xmm6
addsd as1(,%esi,8), %xmm6
movapd %xmm3, %xmm7
mulsd as1(,%edx,8), %xmm7
addsd as1+16(,%esi,8), %xmm7
mulsd %xmm3, %xmm7
addsd as1+24(,%esi,8), %xmm7
movapd %xmm4, %xmm2
mulsd as1+24(,%edx,8), %xmm2
addsd as1+32(,%edx,8), %xmm2
mulsd %xmm4, %xmm2
addsd as1+40(,%edx,8), %xmm2
mulsd %xmm5, %xmm2
addsd %xmm2, %xmm7
mulsd %xmm5, %xmm7
addsd %xmm7, %xmm6
movsd %xmm6, as1+24016(,%edx,8)
cmpl %ecx, %edx
jne .L1012

Consider the SIMD instruction mulpd %xmm4,%xmm2 which was generated by the SIMDinator. It operands are provided by movsd -31992(%eax),%xmm2 and movhpd -32000(%eax),%xmm2 and it result is consumed by addpd %xmm7,%xmm2. In the compiler generated code the scalar instructions which are marked by this color were packed to generate this SIMD instruction. Note that their memory operands were substituted by explicit memory loads ( marked by this color in the dco generated code ) and their consumers, marked by this color, were also packed.

analysis of the optimizations achieved by the SIMDinator

The SIMDinator is enabled by default and is disabled by the -no-packing option of dco. The following study shows that, although SIMDinator may cause significant code speed ups, it is not always the case and sometimes code, generated by the SIMDinator, is slower than compiler generated. This happens partially due to the requirements imposed on the operands of SIMD instructions. To satisfy these requirements SIMDinator is often forced to use special purpose instructions ( pack/unpack/shuffle/movh ) which are expensive and may cancel the improvement achieved by the use of the SIMD instructions.

The following table presents the execution data collected while performing benchmarking of the Livermore loops - see this for the description of the code that was benchmarked. The columns under gccgcc+dco and -np headers present execution times ( in seconds ) and execution speeds ( in MFlops ) which are achieved by the compiler generated code, dco optimized code with the SIMDinator enabled and dco optimized code with the SIMDinator disabled respectively. The column under the gcc+dco/gcc header lists the improvement achieved by utilizing dco with the SIMDinator enabled as compared to the compiler generated code. The column under the -np/gcc header lists the improvement achieved by utilizing dco with the SIMDinator disabled in comparison with the compiler generated code. The column under the -np/gcc+dco header shows how much faster the code generated without use of the SIMDinator when compared to the code generated with the SIMDinator enabled. The best results are shown in this color.

Kernel# gcc 4.1.2 gcc+dco -np gcc+dco/gcc -np/gcc -np/gcc+dco
1 4.96 1707.57 3.32 2551.83 4 2117.69 33.06% 19.35% -20.48%
2 2.38 1462.97 2.32 1500.79 2.32 1500.79 2.52% 2.52% 0.00%
3 5.93 1136.8 3.55 1898.05 3.84 1754.86 40.13% 35.24% -8.17%
4 4.66 1239.76 4.12 1401.69 3.8 1519.63 11.59% 18.45% 7.77%
5 5.2 183.98 2.07 462.66 2.07 462.66 60.19% 60.19% 0.00%
6 4.53 718.38 3.9 834.32 3.63 896.31 13.91% 19.87% 6.92%
7 4.87 1677.7 3.12 2619.9 3.96 2063.61 35.93% 18.69% -26.92%
8 5 991.29 5.4 917.93 3.88 1277.1 -8.00% 22.40% 28.15%
9 4.6 1540.48 3.95 1794.09 4.23 1675.28 14.13% 8.04% -7.09%
10 4.94 353.19 3.87 450.96 3.38 516.43 21.66% 31.58% 12.66%
11 5.78 96.75 1.52 368.67 1.52 368.67 73.70% 73.70% 0.00%
12 5.18 681.16 4.98 708.5 4.39 803.66 3.86% 15.25% 11.85%
13 4.57 198.85 5.56 163.47 4.58 198.42 -21.66% -0.22% 17.63%
14 4.71 256.27 4.71 256.27 4.26 283.34 0.00% 9.55% 9.55%
15 3.72 586.53 3.72 586.53 3.72 586.53 0.00% 0.00% 0.00%
16 5.61 698.71 5.31 738.16 5.29 740.95 5.35% 5.70% 0.38%
17 5.01 593.81 4.99 596.19 4.99 596.19 0.40% 0.40% 0.00%
18 4.7 826.95 3.96 981.54 3.95 984.03 15.74% 15.96% 0.25%
19 5.81 372.45 4.1 527.69 4.1 527.69 29.43% 29.43% 0.00%
20 4.53 374.26 4.43 382.71 4.43 382.71 2.21% 2.21% 0.00%
21 4.88 362.2 4.61 383.41 4.61 383.41 5.53% 5.53% 0.00%
22 4.88 191.36 6.21 150.35 4.86 192.14 -27.25% 0.41% 21.74%
23 4.17 681.74 4.09 695.08 4.09 695.08 1.92% 1.92% 0.00%
24 4.85 132.3 0.77 837.82 0.77 837.82 84.12% 84.12% 0.00%
Geometric Mean 4.75
3.64
3.53
23.37% 25.68% 3.02%

As it can be observed