Optimizations Corner:
Cleaning Memory and Partial Register Stalls in Your Code

Contents

Maximize Your Application's Potential

Understanding Partial Stalls

Detecting Partial Stalls

Understanding Partial Stalls

The Pentium II and Pentium III microarchitectures are based on dynamic data flow analysis. The processor decodes its instructions into simpler micro-ops that are called uops (internal units of work). These uops are added to the reorder buffer, and if all their resources are ready they are issued to the execution unit. The resource can be the result of another uop, or a real register or memory value. When more than one uop generates a resource, the processor cannot know from where to take the result of the resource, so it must wait for the value of the architectural register. The value only becomes available when the instructions generating the resource retire. For example:
mov al,mem1
mov ah,mem2

add bx,ax
The uop of the last instruction depends on the resources BX and AX. The value in register AX is generated by two uops, and therefore the last instruction can execute only after the first two instructions retire. So to formalize the problem, the instruction for which the partial stall is issued reads from a large register (for example, EAX) after a previous instruction writes to one of its partial registers (for example, AL, AH, AX). The read stalls until the write retires, even if the instructions are not adjacent. This stall applies to all register pairs involving a larger register with any of its sub-components. Examples of larger registers with one of its partial registers are AX with EAX, BL with BX, and SI with ESI. The stall does not occur if the write has already retired when the read begins execution.

A partial stall also occurs in the following cases because the processor operates on 32 bits internally (even though it seems to be operating on only 16 bits): A MOV instruction writes to any partial register, and subsequently the MOVSX (move with sign-extend) or MOVZX (move with zero-extend) instructions read from the same partial register. For example:

mov ax, 7
movsx ebx,ax
A MOV instruction writes to any partial register, and subsequently copies the contents to any segment register. For example:
mov ax, 7
mov ss, ax
The actual cycle loss for a partial stall varies depending on the number of cycles before the source instruction retires. The average cost is 7-10 cycles if the instruction uses the large register immediately after setting the small register. If more instructions are executed between this pair, the performance loss is smaller. If more than 40 uops are decoded between the setting of the small register and the usage of the large one we can be certain that the instruction that sets the register already retired, and a penalty is avoided.

The XOR and SUB instructions can be used to clear the upper bits of a large register before an instruction writes to one of its partial registers. When the upper bits of the larger register are cleared in this way, reading it after writing to one of its partial registers does not cause a stall. Other methods of clearing the upper bits of the large register do not prevent a stall.
 
Original  Optimized 
mov al, mem8
inc eax
xor eax, eax
mov al, mem8 
inc eax

The INC instruction uses the entire EAX register. The preceding MOV instruction uses just the lower portion of the EAX register: AL. This causes a partial stall. Use the XOR instruction before reading the partial register to clear all bits in EAX and prevent the stall. If a mispredicted branch or interrupt occurs between the XOR and the setting of the small register the partial stall is not prevented.

Understanding MOB Stalls

The MOB (memory order buffer) is a memory subsystem that acts as a reservation station and a reorder buffer. It holds suspended loads and stores, and redispatches them when the blocking condition (dependency or resource) disappears. If an instruction needs to read a larger data element after a previous instruction wrote a smaller data element from the same address, the MOB cannot be used to forward the data. As a result the data must be loaded from memory instead of forwarding the value from the buffer.

The goal of the MOB is to prevent loads (LD) from being blocked by a store (ST) in the MOB. The idea is to allow store to forward data to load, instead of blocking the loads. Every load blocked by the MOB costs the Pentium II processor six to nine or more clocks. Stores are buffered in the MOB and are placed in memory in the background upon store instruction retirement (using spare cycles). All loads are checked against the previous store in the MOB to detect LD/ST conflicts so that memory ordering is maintained. Non-conflicting loads may pass stores.

A conflicting load (i.e. load to same address as a previous store) may receive its data directly from the store in the MOB, or it may be blocked until the store executes to memory. Certain conditions must be met for a store to forward data to load. In effect, this mechanism provides memory renaming. Store forwarding is a performance win. If a load is blocked (i.e., a conflict is detected but store cannot forward), there is a significant performance penalty. MOB allows store to forward data to a conflicting load only if the following conditions are met:

Otherwise, the MOB will block the conflicting load. To prevent this stall, use the same data sizes and address alignments when you read and write data to the same address or an overlapping range. When you must write and then read differently-sized data elements from the same area of memory, reorder the code so that the read is as far as possible after the write (tens of cycles).

________________________________________________________

Detecting Partial Stalls