Modern x86 processors offer multiple execution units ( "cores"  ) which allow simultaneous ( "parallel" ) execution of multiple code sequences. dco utilizes this functionality and offers powerful automatic parallelization that capable to identify code-patterns that are suitable for parallelization, and create the optimized code that will be executed by all the cores available. The part of dco that implements auto-parallelizer is called PARALLator ( pronounced paral-a-ter ). In this article we will:

how the PARALLator operates

dco implemented automatic parallelization is suitable for shared memory multi-core systems with uniform memory access ( UMA ), meaning multiple identical processors ( cores ) on a single chip that have equal access and access times to memory.

dco utilizes threads model with a single process creating and managing multiple threads that are executed in parallel by different cores. It is implemented by calling subroutines from the OpenMP library thus requiring this library to be linked ( one way to achieve that is to specify
 -fopenmp option during linking ).

The PARALLator is enabled by the -parallel command line option of dco. PARALLator identifies code sequences suitable for parallelization performing, if necessary, the following actions:
While generating parallel code dco performs all necessary resource saving/restoration, memory redirection, code splitting, code modification, thread startup/termination etc.

compare with OpenMP

OpenMP is an application program interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism. It has become a successful user-model for developing parallel applications and has recently emerged as the de facto standard for shared-memory parallel programming. Therefore it is important to distinguish dco from OpenMP and to establish advantages of the use of dco over OpenMP.

dco and OpenMP are used very differently. OpenMP requires from user to
dco's parallel code generation is fully automated and does not require any involvement from the user.

Besides the differences in use, dco is capable to parallelize code that OpenMP is not capable; see this for example.

parallel code explained

dco generated parallel program for a sequential code consists of two parts:
  1. master thread
  2. parallel thread

master thread

parallel thread

parallel thread acts as a separate subroutine that:

overhead of the parallel code

Both, master and parallel, threads use subroutines from the OpenMP library some of which come at a considerable cost. This overhead may significantly impact the performance of parallelized code. While analysing the code to be parallelized, dco considers the overhead of the use of OpenMP subroutines avoiding parallelization if it determines that the cost of parallel execution exceeds it benefits.

It was estimated that on the Core2 system the overhead of utilizing OpenMP subroutines is 3500 clocks. It means that, for example, on the dual core system parallelization may be beneficial only for sequential loops running for at least 7000 clocks:

if execution time of a sequential loop is ExecTime clocks, execution time of the parallel code on the dual core system will be ( roughly ) ExecTime/2 + Overhead.
for parallelization to be beneficial, execution time of the parallel code should not exceed the execution time of the sequential code:
Overhead + ExecTime/2 < ExecTime
thus
ExecTime > 2*Overhead

Note that cost of the OpenMP library routines,utilized by dco, is only one ( but not the only ) factor that affects the performance of the parallel code.

optimizations achieved by the PARALLator

evaluation of the results of optimization achieved by the PARALLator

The evaluation was performed on the 64-bit Linux operating systems running on different x86 CPU's. The gcc version 8.1.1 compiler, used to process the benchmark, was invoked ( in most cases, see this for an exception and the explanation why was it done differently ) with the following command line options:

-S -O2 -fomit-frame-pointer -ffast-math -mfpmath=sse -msse4.1 -fno-dwarf2-cfi-asm -fno-asynchronous-unwind-tables

The compiler generated assembly file was optimized by dco using -parallel command line option and the optimized assembly code was processed ( assembled/linked ) by gcc using -fopenmp command line option.

The generated executable was executed on the systems with the minimal possible load. The execution times reported are the elapsed real time between invocation and termination of the program.


samples of the results of optimization achieved by the PARALLator

For every benchmark listed bellow we include x86 Linux executables created for the code generated by the compiler and dco. You may download them and execute on your system. The data is provided as a compressed archive ( 'nameOfExecutable'.zip ) that consists of the following items:
Unzip the provided archive and run the script/executables on the system with the minimal possible load.
Should you run this script, please share with us the execution times that were obtained on your system. Please note that we intend to eventually publish this data ( comparing optimization results for different x86 processors ) - doing that, your private information will not be revealed.


results of optimizations achieved by the PARALLator

Here we present the performance results for the set of selected benchmarks. We will only present cases that are interesting and/or important and/or difficult and avoid providing endless list of kernels ( dot-product etc. ) that were successfully parallelized by dco. When possible, the sample of a sequential code, that was successfully parallelized, is shown.
Optimization results for the following benchmarks, that were successfully parallelized by dco, are listed bellow:

The following table presents execution data collected so far for the specified selected benchmarks under different hardware.
Each row represents data for a curtain x86 CPU. For each benchmark, the data listed consists of two columns:

not parallel on OpenMP*
serial version of the NPB's EP code
alvinn
Molecular Dynamics
E8600 @ 3.33GHz
2 cores
4.96 37% 1.85 27% 0.44 39% 54.44 40%
3.14 1.35 0.27 32.49
i5-4440 @ 3.10GHz
4 cores
2.13 17% 38.64 63%
1.77 14.39

* see this for more discussion regarding this benchmark

not parallel on OpenMP

The following is example of a recurrence and may not be directly parallelized by OpenMP:


It was successfully parallelized by dco achieving expected execution time improvement: the times reported here are for ARRAY_DIM being set to 100000000 ( which made the overhead of the parallel code to be neglectable ), and, considering the cost of execution of the splitted code, are in the range of the improvement derived from the Amdahl's law.
See this for more discussion regarding this benchmark.

Click this to download the executables for this benchmark. See this for instructions how to run these executable.

serial version of the NPB's EP code

EP - "Embarrassingly Parallel" suite from the NAS Parallel Benchmarks ( NPB ). Actually it appears to be one of the toughest cases of NPB and, definitely, there is nothing embarrassing in successfully parallelizing this benchmark.



The code, as it is presented in NPB serial suite, is not parallelizable. The line 10: q(l) = q(l) + 1.d0 introduces data dependency - the way it is solved by NAS in NPB OpenMP suite is by significantly altering the source of the benchmark. dco doesnt require user to change the source code and performs all necessary alterations "automaticaly". The only change we did apply is to replace common/storage/ x(2*nk), q(0:nq-1) by common/storage/ x(2*nk);common/storage1/ q(0:nq-1) creating different memory segment for q thus establishing disambiquation of this memory segment from other memory segments; however data dependency, outlined above, wasn't eliminated by this alteration.

We run this benchmark for class W - running for other classes produces different timing but the ratio between time of the compiler generated code and time of the dco generated code remains the same.
See this for execution data achieved.

Click this to download the executables for this benchmark. See this for instructions how to run these executable.

alvinn

alvinn is a CPU intensive single-precision application from the SPEC floating-point benchmark suite. It represents neural network and robotic applications. Most of the loops of this benchmarks are small and executed for 30 iterations only, ( which made the overhead of the parallel code to be significant ). See this for execution data achieved.

All of the important loops of this benchmark were fully parallelized, for example, the following one ( the body of the routine "input_hidden" which accounts for 64% of the execution time of the sequential version of the code ):



Click this to download the executables for this benchmark. See this for instructions how to run these executable.

Molecular Dynamics

This benchmark is a CPU intensive double-precision application that implements a molecular dynamics simulation using the velocity Verlet time integration scheme. In order to demonstrate the abilities and power of dco, while compiling the code for this benchmark we added -fno-inline-functions compiler options ( although this doesn't seem to be necessary when compiling with -O2 ) - this ensured that parallelization of the function compute will require full data flow analysis thus making parallelization task to be more difficult and challenging.
We run this benchmark with the following parameters:
2 500 500 0.01
See this for execution data achieved.



Click this to download the executables for this benchmark. See this for instructions how to run these executable.