Contents

The Pentium III Processor and SIMD Extensions

Motion Compensation Using Streaming SIMD Extensions

Special Memory Instructions

Motion Compensation Using Streaming SIMD Extensions

The next example shows the same implementation using Streaming SIMD Extensions. In this example, you can see that there is no need to do any conversions when using the 'pavgb' instruction.

 

Motion_Comp_Loop:

Movq mm0,// read eight pixels from one block.
Movq mm2,// next eight pixels.

Movq mm1,// read eight pixels from second block.
Movq mm3,// next eight pixels.

Pavgb mm0,mm1 // calculate the average values.
Pavgb mm1,mm3 // calculate the average values.

Movq ,mm1 // store results.

// Increment pointer to the next line.
Jmp back while not end of macro block

Example 2. Motion Compensation Using Streaming SIMD Extensions

Another basic operation in ME algorithms is "Block Matching", taking two blocks and calculating the energy of the difference block.

The following example shows the basic code for block matching using MMX technology:

Motion_Est_Loop:

Movq mm1,// read 8 pixels of ref block.
Movq mm3,// read next 8 pixels of ref block.
Movq mm0,// read 8 pixels of predicted block.
Movq mm2,// read next 8 pixels of predicted block.

Movq mm4,mm0
Psubusb mm0,mm1 // difference between pixels of two blocks.
Psubusb mm1,mm4 // difference other way.
Por mm0,mm1 // absolute difference of pixels in mm0.

Movq mm4,mm2
Psubusb mm2,mm3 // difference between pixels of two blocks.
Psubusb mm3,mm4 // difference other way.
Por mm2,mm3 // absolute difference of pixels in mm2.

// Calculation of the sum of absolute differences.
Movq mm1,mm0
Punpcklbw mm0,mm6 // mm6 was initialized to be zero.
Punpckhbw mm1,mm6 // converts 8 bytes in one mm to 8 shorts.

Movq mm3,mm2
Punpcklbw mm2,mm6
Punpckhbw mm3,mm6 // converts 8 bytes in one mm to 8 shorts.

Paddusw mm0,mm1 // summing them up.
Paddusw mm0,mm2
Paddusw mm0,mm3

IF there_is_threshold
// Need to calculate final sum.
Pmaddwd mm0,MASK64 // mult every word by 1 and sum it up to dwords.
Movq mm1,mm0
Psrlq mm1,32
. Paddd mm0,mm1 // final sum of differences.
Paddd mm7,mm0 // 1 total sum of differences.

Movd esi,mm0
Cmp esi,Threshold_Energy
Jge Fast_Out
ELSE
Paddusw mm7,mm0 // 4 sum of differences.
END
// Increment pointer to the next line.
Jmp back while not end ofmacro block

IF not there_is_threshold
// Final sum.
Pmaddwd mm7,MASK64 // mult every word by 1 and sum it up to dwords.
Movq mm1,mm7
Psrlq mm1,32
. Paddd mm7,mm1 // final sum of differences.
END

Fast_Out:

Example 3. Block Matching Using MMX Technology

Since MMX technology does not contain a horizontal operation such as a sum of four short elements in one MMX technology register, and since the sum of the absolute differences takes more than eight bits, the implementation of the block matching algorithm must converted to short format and perform 3 extra adds in each iteration. At the end of the loop, you need to sum all four difference values to produce one final result.

Moreover, when using a threshold energy to avoid unnecessary calculations (which is typically the case in ME algorithms) the overhead is large, since using that method (there_is_threshold=TRUE ) means you must calculate the final sum for each iteration, for comparison with the threshold energy, Using the 'psadbw' instruction enable a quick and efficient comparison at each iteration.

The following example shows the same implementation using the 'psadb' instruction, which is specifically designed to solve these problems.

Motion_Est_Loop:

Movq mm1,// read next 8 pixels of ref block.
Movq mm0,// read 8 pixels of ref2 block.
Movq mm2,// read next 8 pixels of ref2 block.

Psadbw mm1,mm0 // mm1 = sum of absolute difference of 8 pixels.
Psadbw mm3,mm2 // mm3 = sum of absolute difference of 8 pixels.

Paddd mm7,mm1
Paddd mm7,mm3 // 1 total sum of differences.
IF there_is_threshold
Movd esi,mm7
Cmp esi,Threshold_Energy
Jge Fast_Out
END
// Increment pointer to the next line.
Jmp back while not end ofmacro block

Fast_Out:

Example 4. Block Matching Using Streaming SIMD Extensions

Table 1 shows possible performance boosts to be gained by using Streaming SIMD Extensions. The measurements assume Block size: 16x16 pixels and hot cache.

Operation
Total Cycles
Total Mops
Motion estimation with MMX technology
308
432
Motion estimation with Streaming SIMD Extensions
208 -> 48%
272 -> 58%
Motion Compensation with MMX technology
298
528
Motion Compensation with Streaming SIMD Extensions
134 -> 222%
192 -> 175%
Table 1. MMX Technology vs. Streaming SIMD Extensions Implementation for ME & MC


Streaming SIMD Extensions include more instructions that can improve performance of integer-based algorithms. For MMX technology developers, these extensions can be easily integrated into previous implementations.

Special Memory Instructions