Table of Contents

General Overview

Features and Description

Current Implementation

Source of optimization

Basic Blocks

Calling Conventions

Stack Alignment

Direction flag df

Invocation

Getting help

Code options

Selected code optimization

Call to the special functions

Use of the special comment lines

Changing optimization options

Auto parallelization

Suggestions for using dco

Incorporating with a compiler

Choosing command options

Exact Results

quick optimization

Preparing source code for optimization

Insert calls to the special functions

Insert special comment lines

choosing compiler options

disable generation of debug information

generate optimized code

generate appropriate code

General Overview

The generation of optimal code for the x86 architecture requires technical methods that were not utilized for the previous hardware architecture's and therefore not known to the most programmers and not incorporated into the x86 tools currently available on the market. This all leads to inefficient and time consuming coding for the x86, and generated code, in most cases, is far from being optimal.
We can help you to solve this problem. We developed an optimizing/parallelizing code system for the x86 family of processors. dco is a software package specifically designed to optimize x86 assembly code by taking full advantage of the options and features provided by the x86 processor.
dco shall be used to optimize compiler-generated code. The programmer uses a compiler (C, Fortran etc.) to translate his code into x86 assembly code. This code would be used as an input to dco. The output, generated by the dco, will be a highly optimized x86 assembly code that is logically identical to the original one; dco will rearrange the existing code, performing optimizations that take full advantage of the functionality offered by the x86 processor. To create a final object file the generated code should be assembled.
Note that dco does not require preprocessing or any other involvement from the user. It is fully automated and may be incorporated into makefiles or other product generation tools.
Use of dco will greatly improve the quality of the generated code. It, therefore, may prove to be a vital contribution to the production of a winning x86 solution.

Features and Description

Current Implementation

The currently available implementation of the optimizer shall be used to optimize code generated only by a gcc compiler. It accepts x86 assembly code in the so-called AT&T assembler syntax ( see here for some explanation ); so
addsd %xmm1,%xmm2
is interpreted as
%xmm2 = %xmm2 + %xmm1

dco currently supports IA-32 and x86-64 architectures featuring SSE, SSE2, SSE3 and SSE4 extensions.
dco is currently available for Linux OS. The experimental version of the product for 64 bit Windows OS is also provided. See this for more details.

Source of optimization

dco is a software package that optimizes a x86 code by taking full advantage of the options and features provided by the processor. It implements great number of optimization techniques among which the following seems to be of a particular importance: By default dco will perform most of the optimizations that are available. It is possible to enable or disable any number of the optimization techniques.

Basic Blocks

A basic block is a sequence of x86 non-branching instructions in which flow of control enters at the top. dco assumes as a basic block any sequence of instructions preceded and/or followed by a label or a branch instruction.

Calling Conventions

dco determines resource usage of subroutines defined in the code to be optimized and assumes the following x86 programming calling conventions for subroutine calls:

    for the IA-32 ( 32-bit mode ) Linux code:
    for the x86-64 ( 64-bit mode ) Linux code:
    for the x86-64 ( 64-bit mode ) Windows code:

Stack Alignment

Although the stack word size is 8 bytes, it is assumed that the stack pointer is aligned by 16 at ( before ) any call instruction. Consequently we rely on
RSP = 8 modulo 16
at every function entry. It was observed that when compiling with option -O0 ( and, likely, when using -Os or -mpreferred-stack-boundary=2 ) this assumption doesn't hold and therefore such a code may not be optimized by dco.

Direction flag df

It is assumed that the direction flag df is cleared by default. If the direction flag is set, then it assumed that it will be cleared again before any call or return. If the directional flag is set by dco, it will be cleared right after it use.

Invocation

The invocation of dco has the form:

dco <parameter list>

The following is a list of the parameters and descriptions of their functionality:

[-i] file_name - specifies an input source file which contains a x86 assembly code to be optimized. It is the only parameter that must be specified.

-o file_name - specifies an output file which contains a x86 assembly code generated by the dcostdout is used if this parameter is not specified.

-32 - processes code for 32-bit environment ( IA-32 ); this option is available only on Linux.

-64 - processes code for 64-bit environment ( x86-64 ); this option is available only on Linux.

-er - exact results - ensures that optimized code will generate results identical to the original one. See this for more explanation.

-noer - allows optimizations that may generate results not identical to the original one. See this for more explanation.

-packing - enables the SIMDinator that attempts to pack SIMD instructions.

-nopacking - disables the SIMDinator that attempts to pack SIMD instructions; note that SIMD instructions may still be used.

-parallel - enables auto-parallelization. See this for more explanation.

-noparallel - disables auto-parallelization.

-lu # - specifies the loop unrolling parameter ( number of times the loop will be unrolled ). The default is 1 ( no loop unrolling ). See this for more explanation.

-bbs # - basic block size - specified the maximum number of instructions in a basic block to be processed. The default is 100.

-space - space optimizations, improves code size rather that execution time of the code ( that is a default ).

-nospace - speed optimizations, improves the execution time of the code possibly making it large.

-quick - quick optimizations - performs optimizations only for basic blocks that are bodies of loops.

-noquick - performs optimizations for all basic blocks of the code.

-slct - causes only selected areas of code to be optimized. See this for more explanation.

The default invocation parameters are:

-64  -noer  -lu -bbs 100  -noparallel -noquick  -packing 

Getting help

To obtain help information from dco, invoke it without any parameters.
To obtain brief description of the input options, invoke: dco -h.
To obtain brief description of the product, invoke: dco -about.

Code options

dco excepts as an input x86 assembly files in the so-called AT&T assembler syntax ( see here for details ). All comment lines are ignored except for the exceptions described in the following two sections.

Selected code optimization

dco allows you to optimize only selected portions of your code. To do this you must specify the -slct option during command invocation and select the portions of the code to be optimized. Code selection is done by enclosed the desire portion of the code to be optimized.

Code selection may be done in the following ways.

Call to the special functions

Code selection can be specified by enclosing the desirable portion of the code between the calls to the following functions:

[text]dco_start[text]()
and
[text]dco_end[text]()

[text] preceding preceding/following dco_start, dco_end may used to comment about portion of the code being selectively optimized and doesn't have to be the same for start/end function name. Note that [text]dco_start[text] and [text]dco_end[text] shall be valid function names.
See this for example of how to use this specification.

Use of the special comment lines

Code selection can be specified by enclosing the desirable portion of the code between the comment lines:

#.dco_start <option list>
and
#.dco_end

#.dco_start <option list> indicates the beginning of the portion of the code that should be selectively optimized. The selectively optimized code extends till the end of the input file or till the comment line #.dco_end. <option_list> is list of options that will be in effect while optimizing the selected portion of the code; this options, if specified, alters default parameters or parameters specified on the invocation line. The options that may be specified include all the options listed here except -slct-i and  -o .

Note that selection of the code portion shall be done exactly as written above. For example,

# .dco_start

will be considered as just a comment line ( # followed by a space ).
See this for more on selected code optimization.

Changing optimization options

dcoallows you to change the options that where in affect during invocation. This is done by specifying comment line:

#.dco_options <option list>

<option_list> is list of options that will be in effect while optimizing the following portion of the code; this options, if specified, alters default parameters or parameters specified on the invocation line. Specifying #.dco_options without <option_list> will cause the options to be restored to those of the invocation of dco.

The options that may be specified include all the options listed here except -slct-i and -o.
Note that the comment for option specification should look exactly as written above. For example,

# .dco_options

will be considered as just a comment line.

Auto parallelization

dco offers powerful auto-parallelization that capable to identify code-patterns that are suitable for parallelization, and create the optimized code that will be executed by all the cores available. See this for additional information about auto parallelization provided by dco.

To enable auto parallelization you must specify the -parallel option and request the use of the OpenMP library during linkage ( one way to achieve that is to specify -fopenmp option while linking ). For example, to parallelized 'test.c' creating the executable 'test' do the following:
gcc -S test.c
dco -i test.s -o otest.s -parallel
gcc -o test -fopenmp otest.s
rm otest.s test.s

dco auto-parallelizes code sequences spanning numerous basic blocks that may include function calls. It is assumed that functions from the standard libraries are ISO C and POSIX compliant satisfying requirement specified here. Do not use auto parallelizer if that is not the case on your development system.
Auto-parallelizer shall not be attempted on the code that is already parallelized, e.g. by OpenMP or dco.

Suggestions for using dco

This section contains hints and suggestions on using features provided by dco. It should not be considered a comprehensive guide to the usage of dco. As you gain experience using the product, you will develop other techniques which suit your needs and professional habits.

Incorporating with a compiler

dco is designed to work with gcc compiler on Linux or port of the gcc compiler on Windows ( e.g. mingw-w64 ) which generates an assembly output of the compiled code - the way it is achieved is by specifying -S option during compilation. Compiler generated code shall confirm to the mode of operation that dco invoked:
For Linux:
if -32 is specified - compiler shall generate 32-bit code
if -64 is specified - compiler shall generate 64-bit code
For Windows:
compiler shall always generate 64-bit code

While in 64-bit mode on Linux, dco was fully verified to work with the clang compiler version 9.0.1 - in the current version of the dco we provide support for the clang as "experimental".

Assume that the compiler driver gcc is available on your system. To optimize the file 'test.c' do the following:
gcc -S test.c
dco -i test.s -o otest.s
mv otest.s test.s
gcc -c test.s
rm test.s

gcc -S test.c compiles the input file 'test.c' and generates assembly output file 'test.s'. You may specify other compiler options ( to perform optimization etc.), see this for information about that.

dco -i test.s -o otest.s optimizes the input file 'test.s' generating as an output file 'otest.s'.

mv otest.s test.s renames file 'otest.s' to 'test.s'.

gcc -c test.s assembles file 'test.s' producing as an output object file 'test.o'.

rm test.s deletes file 'test.s'.


The described procedure may be easily incorporated into makefiles, batch files or other product generation tools. For example, makefiles often used to generated object files by specifying rules to translate C-source into object-file, e.g.

.c.o:

$(CC) $(CFLAGS) -c $<

In order to incorporate dco the rule may be rewritten as:

.c.o:

$(CC) $(CFLAGS) -S $<
dco -i $*.s -o $*.so $(DCO_OPT)
mv $*.so $*.s
$(CC) $(CFLAGS) -c $*.s
rm $*.s

Choosing command options

This paragraph explains the usage of the basic command options. Note that, in most cases, disabling an optimization option will decrease the quality of resulting code.

Exact Results

dco always produces code that is mathematically equivalent to the original. However, due to the inexact nature of the floating point execution, the results of the optimized code may differ from that of the original code. For example, the original code:

addsd %xmm2,%xmm6
addsd %xmm3,%xmm6
addsd %xmm4,%xmm6

may be substituted by:

addsd %xmm2,%xmm3
addsd %xmm4,%xmm6
addsd %xmm3,%xmm6

which, although being mathematically equivalent to the original, may produce a different value in the register xmm6.

Optimizations that may cause such a behaviour may be disabled by using the parameter -er.

quick optimization

Choosing this option ( -quick ) may significantly reduce the CPU time of the package execution without having great impact on the quality of the produced code.


Preparing source code for optimization

No special preparations are necessary for the source code to be optimized by dco. However it is strongly suggested to optimized only portions of the code that program spends most of the time executing; to do that use selected code optimization.
To use selected code optimization and/or to change optimizers options ( as specified here ) the source of the program shall be altered before the compilation and/or the assembly input to dco shall be changed before the optimization.

Insert calls to the special functions

The following shows how to prepare the block of the Fortran code for optimization by dco. Note that calls to the special functions dco_start and dco_end are compiled conditionally thus allowing to compile original code without any modification. Should dco be used, -DDCO shall be specified as compiler option during compilation.

#ifdef DCO
	call dco_start
#endif
         do 140 i = 1, nk
            x1 = 2.d0 * x(2*i-1) - 1.d0
            x2 = 2.d0 * x(2*i) - 1.d0<
            t1 = x1 ** 2 + x2 ** 2
            if (t1 .le. 1.d0) then
               t2   = sqrt(-2.d0 * log(t1) / t1)
               t3   = (x1 * t2)
               t4   = (x2 * t2)
               l    = max(abs(t3), abs(t4))
               q(l) = q(l) + 1.d0
               sx   = sx + t3
               sy   = sy + t4
            endif
 140     continue

#ifdef DCO
	call dco_end
#endif

Insert special comment lines

gcc provides asm function that allows to change the C source before the compilation as following:
	.
	asm( "#.dco_start" );
	
	Code to be optimized by dco
	
	asm( ".dco_end" );
	.

choosing compiler options

dco is expecting high quality optimized code as it input. Therefore it is necessary to use appropriate compiler options to generate such a code. In order to further facilitate the optimization, it is recommended also to include -fomit-frame-pointer and -fno-optimize-sibling-calls options. The following are compiler options we used to evaluate dco:
-S -O2 -fomit-frame-pointer -fcf-protection=none -ffast-math -march=x86-64 -m64 -mfpmath=sse -msse2 -msse3 -fno-dwarf2-cfi-asm -fno-asynchronous-unwind-tables -fno-optimize-sibling-calls -freorder-blocks-algorithm=simple

disable generation of debug information

Use the following options to disable debugging information from being generated and/or code from being prepared for debugging:
-fno-asynchronous-unwind-tables
-fno-dwarf2-cfi-asm
-fomit-frame-pointer
-fcf-protection=none

generate optimized code

It is recommended to use the following option to generate appropriate code:
-ffast-math
-march=x86-64
-m64
-mfpmath=sse
-msse2
-msse3
-fno-optimize-sibling-calls

It was observed gcc optimizations to generate code of a dubious quality with uncertain merits ( see example bellow ). It is strongly recommended to disable some optimizations, particulary the one that affect data flow - use -freorder-blocks-algorithm=simple or, better, -fno-reorder-blocks.
However general optimizations shall be utilized but it is strongly recommended to use -O2 compiler option for enabling compiler optimizations. Avoid using -O3 or higher.
The following example demonstrates the reason for that:

( portion of the ) compiled code
if ( a[i] > 0.01 )
{
ret = c[i];
}
compiler generated code, -O2 used
	comisd	a(%rdx), %xmm1
	jnb	.L
	movsd	c(%rdx), %xmm0
.L:
    
compiler generated code, -O3 used
	movapd	a(%rax), %xmm6
	cmpltpd	%xmm6, %xmm7
	movapd	%xmm6, %xmm3
	movapd	%xmm1, %xmm6
	movapd	%xmm9, %xmm10
	cmplepd	%xmm1, %xmm3
	cmplepd	%xmm1, %xmm10
	movapd	%xmm13, %xmm14
	cmplepd	%xmm1, %xmm14
	pand	%xmm7, %xmm8
	pandn	%xmm12, %xmm7
	movdqa	%xmm7, %xmm4
	por	%xmm8, %xmm4
	andpd	%xmm3, %xmm15
	movdqa	%xmm0, %xmm12
	andnpd	c-80(%rax), %xmm3
	orpd	%xmm3, %xmm15
	andpd	%xmm10, %xmm15            
    

Although -O3 generated code uses packed data and eliminates conditional jump, it doesnt appear to be more efficient and is much more difficult to process than -O2 generated code.

generate appropriate code

dco has certain assumptions for the code it processes ( see Stack Alignment ). Avoid using compiler options that may cause generated code not to satisfy these assumptions. Generally avoid using esoteric and not well understood compiler options, such as:
-O0
-Os
-mpreferred-stack-boundary=

Although, if necessary, disable code for unsupported extensions ( e.g. AVX extendtions ) see this for clarifications.
It was observed that in certain cases it is necessary to specify -no-pie option while linking object files created from the dco optimized code.