Diving Deeper: SSE4 Performance

One of Penryn's real strengths is in its support for SSE4, which has the ability to really provide a tremendous performance advantage for some time to come. Unfortunately, as is usually the case with new instructions, it's going to take a while for applications to actually utilize them. Such is the case with SSE4 as the only benchmarks we're able to bring you come directly from Intel, but thankfully they are of real world usage models. Both tests we've actually showed you in the past, during Intel's own sanctioned Penryn previews, and both involve some sort of encoding.

The most important test is a DivX encode using VirtualDub 1.7.6 and DivX 6.7. SSE4 comes in if you choose to enable a new full search algorithm for motion estimation, which is accelerated by two SSE4 instructions: MPSADBW and PHMINPOSUW. The idea is that motion estimation (figuring out what will happen in subsequent frames of video) requires a lot of computation of sums of absolute differences, as well as finding the minimum values of the results of those computations. The SSE2 instruction PSADBW can compute two sums of differences from a pair of 16B unsigned integers; the SSE4 instruction MPSADBW can do eight.

According to Intel's own research on motion estimation with SSE4, the same search algorithm can take 71 cycles per 16x16 pixel block using the SSE2 SAD (sum of abs differences) instruction, compared to only 26 cycles using the SSE4 version. The latency reduction results in an obvious performance increase.

We used VirtualDub 1.7.6 and DivX 6.7 with SSE4 Full Search enabled to measure the impact of this motion estimation optimization. Note that the motion estimation that's taking place here is more accurate than the default DivX setting, so both SSE4 and SSE2 versions of the algorithm result in slower performance (but better quality) than with it disabled.

SSE2 Search SSE4 Search
Intel Core 2 Extreme QX9650 (3.0GHz) 21.9 seconds 15.1 seconds
Intel Core 2 Extreme QX6850 (3.0GHz) 35.2 seconds N/A

On our QX9650, the full search with SSE4 enabled runs about 45% faster than with SSE2 only - impressive! Note also that the Penryn QX9650 offers better SSE2 performance in this test as well, coming in about 61% faster than the QX6850. The total performance increase from QX6850 SSE2 to QX9650 SSE4 in this test is an incredible 133%. Obviously, this is not going to be the norm in many other applications, but there's definitely some potential for meaningful optimizations in certain applications.

It's important to note that the PHMINPOSUW instructions doesn't appear to be in AMD's proposed SSE5 specification, although MPSADBW looks like it'll make it. AMD will eventually add full SSE4 support to its processors but not until the 2009/2010 time frame from what we've heard.

Our second benchmark from Intel is an MPEG-2 encode of an HD video using TMPGEnc 4.0.

TMPGEnc 4.0
Intel Core 2 Extreme QX9650 (3.0GHz) 103 seconds
Intel Core 2 Extreme QX6850 (3.0GHz) 135 seconds

The performance difference is a little less significant here, with the SSE4-less QX6850 taking about 31% more time to encode the input file than the QX9650.

Both of these are very real-world implementations of SSE4; unfortunately, it's tough to say how long it will be before we see widespread use of the new instructions.

Everything You Need to Know: Yorkfield vs. Kentsfield I've Got the Power: 45nm vs. 65nm
Comments Locked


View All Comments

  • emenk - Sunday, January 20, 2008 - link

    From first page (this article): "As we saw in our original Penryn preview, Penryn's cache performance remains unchanged; latencies in our final stepping are identical to Conroe."

    From the original Penryn preview (3rd page):
    "Not only is Wolfdale's L2 cache larger, but it also happens to be slightly faster than its predecessor. Intel has shaved off a single clock cycle from Wolfdale's L2 access time; we're already off to a good start."

    Isn't this a contradiction?

    Ignore this (testing quote tags):
    [quote]Quote goes here.[/quote]
  • IntelUser2000 - Tuesday, October 30, 2007 - link

    You know that will not be true in the true Phenom comparison right Anand?? Take a look here: http://techreport.com/articles.x/8236/14">http://techreport.com/articles.x/8236/14

    Dual Opteron is slower than a Single Opteron, yet you still used Dual Opteron against a single Barcelona. Why?? No really, WHY?!?

    "Because of these limitations we refrained from running any comparative benchmarks to desktop Athlon 64 X2s, instead we chose to run a single quad-core Opteron in our server platform against a pair of dual-core Opterons to simulate Phenom vs. K8 on the desktop."

    You could have took games like Oblivion with Single socket Opteron to see the real advantages. This is the worst comparison, ever. And to make it worse, you put "simulated" benchmarks.
  • victory - Tuesday, October 30, 2007 - link

    Wouldn't Intel be able to take immediate advantage of the new SSE4
    instructions in a new integrated graphics chipset perhaps then
    competing with nVidia as well as beating AMD's integrated chipsets?
  • magreen - Monday, October 29, 2007 - link

    It does 4GHz easily on the stock cooler? So why don't you strap a TR ultra 120 ex on there and tell us what it can really do? Cmon Anand, stop teasing us and tell us what we really want to know!
  • AnnihilatorX - Monday, October 29, 2007 - link

    It's a shame that they delay the release date of more affordable Yorkfields to January, just missed to Christmas sales.

    I am p0lanning to upgrade my computer and not sure whether to wait for Yorkfield or buy a Q6600.
  • idgaf13 - Monday, October 29, 2007 - link

    Intel is trying to suppress Christmas sales and have a negative influence on "other companies" earnings while relieving themselves of Old Inventory.
    45nm process is going to produce so many CPUs per wafer that prices will fall fast or inventory will rise quickly.
    With respect to the traditional cycle of product releases and price changes ,
    A January launch date allows for the longest possible time before prices begin to tumble
    typically after the trade shows in the first two quarters of the year.
    It also more time to perfect the production process.
    Question is do really need to be "the first on the block" to have this CPU ?
    Or can you wait until the price falls by 50% or June/July for the best price?
    Possibly even a faster CPU by then.
  • MGSsancho - Monday, October 29, 2007 - link

    anand, could you be so kind as to point to where you got the info on the new sse4 instructions? the chart would be cool but some pdfs or something from into would be awsome
  • jsaldate - Friday, November 9, 2007 - link

    Penryn SDK: http://softwarecommunity.intel.com/articles/eng/11...">http://softwarecommunity.intel.com/articles/eng/11...http://softwarecommunity.intel.com/articles/eng/11...
  • Ryan Smith - Monday, October 29, 2007 - link

    http://www.intel.com/technology/architecture-silic...">From Intel's website
  • MGSsancho - Monday, October 29, 2007 - link

    thanks a lot =)

Log in

Don't have an account? Sign up now