Diving Deeper: SSE4 Performance

One of Penryn's real strengths is in its support for SSE4, which has the ability to really provide a tremendous performance advantage for some time to come. Unfortunately, as is usually the case with new instructions, it's going to take a while for applications to actually utilize them. Such is the case with SSE4 as the only benchmarks we're able to bring you come directly from Intel, but thankfully they are of real world usage models. Both tests we've actually showed you in the past, during Intel's own sanctioned Penryn previews, and both involve some sort of encoding.

The most important test is a DivX encode using VirtualDub 1.7.6 and DivX 6.7. SSE4 comes in if you choose to enable a new full search algorithm for motion estimation, which is accelerated by two SSE4 instructions: MPSADBW and PHMINPOSUW. The idea is that motion estimation (figuring out what will happen in subsequent frames of video) requires a lot of computation of sums of absolute differences, as well as finding the minimum values of the results of those computations. The SSE2 instruction PSADBW can compute two sums of differences from a pair of 16B unsigned integers; the SSE4 instruction MPSADBW can do eight.

According to Intel's own research on motion estimation with SSE4, the same search algorithm can take 71 cycles per 16x16 pixel block using the SSE2 SAD (sum of abs differences) instruction, compared to only 26 cycles using the SSE4 version. The latency reduction results in an obvious performance increase.

We used VirtualDub 1.7.6 and DivX 6.7 with SSE4 Full Search enabled to measure the impact of this motion estimation optimization. Note that the motion estimation that's taking place here is more accurate than the default DivX setting, so both SSE4 and SSE2 versions of the algorithm result in slower performance (but better quality) than with it disabled.

SSE2 Search SSE4 Search
Intel Core 2 Extreme QX9650 (3.0GHz) 21.9 seconds 15.1 seconds
Intel Core 2 Extreme QX6850 (3.0GHz) 35.2 seconds N/A

On our QX9650, the full search with SSE4 enabled runs about 45% faster than with SSE2 only - impressive! Note also that the Penryn QX9650 offers better SSE2 performance in this test as well, coming in about 61% faster than the QX6850. The total performance increase from QX6850 SSE2 to QX9650 SSE4 in this test is an incredible 133%. Obviously, this is not going to be the norm in many other applications, but there's definitely some potential for meaningful optimizations in certain applications.

It's important to note that the PHMINPOSUW instructions doesn't appear to be in AMD's proposed SSE5 specification, although MPSADBW looks like it'll make it. AMD will eventually add full SSE4 support to its processors but not until the 2009/2010 time frame from what we've heard.

Our second benchmark from Intel is an MPEG-2 encode of an HD video using TMPGEnc 4.0.

TMPGEnc 4.0
Intel Core 2 Extreme QX9650 (3.0GHz) 103 seconds
Intel Core 2 Extreme QX6850 (3.0GHz) 135 seconds

The performance difference is a little less significant here, with the SSE4-less QX6850 taking about 31% more time to encode the input file than the QX9650.

Both of these are very real-world implementations of SSE4; unfortunately, it's tough to say how long it will be before we see widespread use of the new instructions.

Everything You Need to Know: Yorkfield vs. Kentsfield I've Got the Power: 45nm vs. 65nm
Comments Locked

16 Comments

View All Comments

  • Canadian87 - Monday, October 29, 2007 - link

    I'd like to point out that someone must have been tired when writing this. The graphs here on page 4 say "QX6950" VS "QX6850", simple reversal of the numbers, but I'd like to correct it for those that might be confused, took me a moment to figure out which was which myself the "QX6950" is ment to be the "QX9650", and obviously the "QX6850" is the correct naming.

    GL HF.
  • GlassHouse69 - Monday, October 29, 2007 - link

    ew.

    intel again ftw. blech. They made a great chip. power usage is fantastic. One could get even lower total wattages (by far) if they concentrated on doing so. a quad core that can be cooled near silently. neat :)

  • sprockkets - Monday, October 29, 2007 - link

    Just a question, what was the difference from Core to Core 2? All I could ever fine was cache size was increased.

    Now that I'm thinking about it, why not the name Quadro? Oh, nVidia ownz it.
  • defter - Monday, October 29, 2007 - link

    Core Duo (Yonah) was based on Pentium M.

    Core2 (Conroe) is a new architecture.
  • sprockkets - Monday, October 29, 2007 - link

    actually i found a comparison page about it, and core 2 isn't that much different from core. Yes, it updated a lot and gave improved performance. No, it is not a completely new architecture from PM, but you can say a big difference from the P4.

    http://www.anandtech.com/showdoc.aspx?i=2808&p...">http://www.anandtech.com/showdoc.aspx?i=2808&p...
  • sprockkets - Monday, October 29, 2007 - link

    On page 9 I believe you are grabbing some old benchmarks, old in the sense of your previous articles. I believe I pointed this out to you as a mistake, and now it is here in the bar graph. Again, how is it that the 2.33ghz C2D outperforms the 3ghz one?

Log in

Don't have an account? Sign up now