Optimizations Corner:
Cleaning Memory and Partial Register Stalls in Your Code

Contents

Maximize Your Application's Potential

Understanding Partial Stalls

Detecting Partial Stalls

Detecting Partial Stalls

The first stage in cleaning the partial stalls in the code is their detection. A partial stall is significant to the performance of the application only if it is executed frequently in the code. The best tool to evaluate the significance of the problem and to detect the partial stalls is VTune(TM) Performance Analyzer. Partial stalls can be detected statically by static code analysis and dynamically using event-based sampling. To sample the event, create a test example. The test example should have the following characteristics:
  • Represent the workload of your product and execute the important components of the code.
  • Be repeatable and preferably automated.
  • Run for 10 seconds or more if your code is executed almost 100% of the time. If not, create a longer test. For example, if your code is utilized only 5% of the time, run the test for 200 seconds in order to spend about 10 seconds in your code.
Sample code: int a;
main()

{

  int i;

  for (i = 0; i < 100000000; i++)

       a = i | 0xff00;

}

The relevant events are partial stall events and partial stall cycles. Select these events using the menu option Configure | Options | Sampling | Processor Events for EBS. In addition,sample for clock ticks too. Then look at the sampling session's summary view. You can see the partial stall events and cycles using the VTune Performance Analyzer, shown in Figure 1.

Figure 1. The VTune Performance Analyzer

This example had about 88M partial stalls and as a result lost 594M cycles. The machine was a 300Mhz Pentium II processor. One processor was in idle mode and lost about two seconds of the 3.4 seconds it took to run the code.

If the total partial stall cycles is very low, saving the lost cycles does not cause a significant speedup. If it is higher you can zoom into hot spots of partial stalls. The hot spot view appears in Figure 2.


Figure 2. The hot spot view.

Only one hot spot has all the partial stall events. Click on the hot spot to display the offending line of code; in this case

a = i | 0xff00;
It is not immediately clear why this line causes a partial stall, so examine the mixed assembly/source view, shown in Figure 3.

Figure 3. The mixed assembly/source view.

The compiler elected to "optimize" the logical or (|) operation and use an 8-bit register instead of a 32-bit register. Click on the instruction marked ! to get the full story with context-sensitive help.

In this case the solution is to use the Pentium Pro code generation strategy instead of the default blended strategy in the compiler code generation options. In many other cases changing the compiler flags may not solve the problem. The compiler may generate partial stalls as a result of mixing unsigned chars with integer code amongst other reasons. In most cases, by changing the code you can work around the problem.

Statically Detecting Partial and MOB Stalls

Another simple way to detect partial stalls is to open the static code analysis view, request a detailed view of the functions with code, and click on the warning column to sort the functions by warning. Click on functions with Pentium Pro warnings to jump to the relevant source code. Partial stalls are marked with PPro_Partial_Stall and MOB stalls with PPro_Mem_Stall, so search for the string PPro_ using the Find option. This helps you navigate to all the instructions indicated by the ! mark. For every warning, ensure that the code is active in this area (look for clock tick samples). Optimizing code that is executed infrequently does not help overall performance.

Dynamically Detecting MOB Stalls

Unfortunately there is no event that counts MOB stalls. Several other events can indicate a performance problem possibly due to a MOB stall. The relevant events are:

Each one of these events can have a high count as a result of several other performance issues well as a MOB stall. The Resource related stall and the low parallelization events are general events that indicate a problem. We can use the number of those events as an indication of the cost of the problem. Resource Related Stalls event counts the number of clock cycles executed while a resource-related stall occurs. This includes stalls due to register renaming buffer entries, memory buffer entries, branch misprediction recovery, and delay in retiring mispredicted branches. If there are no performance problems, resource stalls should not be a concern. However, if the system is not running at full speed, the event may indicate one of the several possible problems including MOB stalls.

The Store Buffer Block event indicates the number of times that a load is blocked and stopped from executing. As such it can be used to count the number of MOB stalls. The reasons for this event include:

The last source of this event is the one that will be created by MOB stall. The problem with this event is that the first two reasons are more common, and as a result it may require a lot of work to identify the MOB stall cases.

The following example shows a simple loop with a MOB stall that executes 100M times. T The C function and and assembly code for this loop is:

char a;
int b;

main()
 {
    int i,j;
   for (i = 0; i < 100000000; i++)
   {
       a= 2;
      b = ((int *)a);
   }
}

LOOP:   mov byte ptr _a, 2
               mov edx, dword ptr _a
               inc eax
               mov dword ptr _b, edx
               jne loop

A byte value is written to variable a and in the next instruction a 32-bit value is read from the same variable.

After selecting the code area of the loop and using the event ratio icon you can cat the total event table. In this case I got:

Event or Ratio Value
Clockticks 1104000000
Resource Related Stalls 785232000
Store Buffer Block 94374130
Micro-Ops Retired - Low parallelization 398280000

In this example according to the Clock Tick event, the MOB stall causes a loss of 8 cycles. The Micro-Ops retired - Low parallelization event indicates an average loss of about four cycles every iteration. The resource-related stall is about 7.8 cycles for iteration. The store buffer block number is about 0.94 times every iteration. These events indicate the presence of a MOB stall or other store buffer-blocking problem.

In this example, no pending cache miss writes were waiting in the store buffer. If there was a MOB stall while waiting to empty the write buffer, the penalty could have been tens of cycles or more.

Conclusion

You can improve the performance of your application by detecting and fixing microarchitectural bottlenecks (also known as "glass jaws") such as partial and MOB stalls. Compiler performance problems or faulty assembly coding can cause partial stalls. When the compiler is at fault, using the Pentium Pro code generation instead of the default blending option may fix the problem. Otherwise, minor code modifications can help, such as using 32-bit values instead of 8-bit for the problematic variable. In assembly code, algorithmic reasons such as packing several colors in one pixel register may cause a partial stall. In those cases, use shifts rather than trying to utilize the 8-bit parts of a register.

For example, examine this code:

RGB MACRO red, green, blue
XOR EAX,EAX
MOV AH,blue
SHL EAX,8
MOV AH,green
MOV AL,red
ENDM

Two partial stalls are created; one during the SHL instruction, and the other when using the EAX register as the result of the macro. Change the code to prevent partial stalls as follows:

RGB MACRO red, green, blue
XOR EAX,EAX
MOV EAX,blue
SHL EAX,8
OR EAX,green
SHL EAX,8
OR EAX,red
ENDM

MOB stalls can result from algorithmic needs, such as packing elements together for SIMD Streaming Extensions or MMX intructions. In these cases, try to pack all the elements of the array before using the results. If the distance between writing the value and reading it again is long, we can assume that the value is already in memory. Another source of MOB stalls can be a compiler performance bug. For example, in most compilers the code to convert an unsigned 32-bit value to floating-point looks like this:

MOV dword ptr,0
FILD qword ptr

If you know that the most significant bit of your value is zero, cast the value to int before the conversion. The compiler will use the dword version of FILD instead of the qword version, thus shortening the code and preventing the partial stall.

This article describes ways to detect and eliminate partial stalls. Applying these techniques is not difficult. In most cases, the gain in speed is not dramatic, but the return on investment is relatively high. Future articles in this column will address similar instances. For more information about specific microarchitectural optimizations, see http://developer.intel.com/vtune/cbts/cbts.htm

Haim Barad has a Ph.D. in Electrical Engineering (1987) from the University of Southern California. His areas of concentration are in 3D graphics, video and image processing. Haim was on the Electrical Engineering faculty at Tulane University before joining Intel in 1995. Haim is a staff engineer and currently leads the Media Team at Intel's Israel Design Center (IDC) in Haifa, Israel.

________________________________________________________

Maximize Your Application's Potential