POST A COMMENT

102 Comments

Back to Article

  • MrSpadge - Wednesday, January 22, 2014 - link

    Thanks, such articles are really appreciated! Reply
  • Wolfpup - Wednesday, January 22, 2014 - link

    I concur :) Reply
  • ImSpartacus - Wednesday, January 22, 2014 - link

    Thirdeded. Reply
  • GrammarNietzsche - Wednesday, January 22, 2014 - link

    Is this comparison fair from a pricing perspective? This is comparing the less than $200 kaveri to the over $300 i7. Reply
  • rahulgarg - Wednesday, January 22, 2014 - link

    This is an architectural comparison. If you want to compare to a dual-core Haswell, you can just divide the per-cycle CPU numbers by 2 and multiply by frequency to get the theoretical gflops. Reply
  • Alexvrb - Thursday, January 23, 2014 - link

    You should also post results with clocks not artificially locked to base CPU and max GPU. One of the benefits of newer generations is improved turbo, for example. Not like it really matters, anyone could tell at a glance that this isn't what they are targeting with Kaveri. With that being said, it was still an interesting read.

    On a semi-related note: What I'd really like to see is Kaveri gaming results with faster memory and an overclocked GPU. I'd bet it outscales Richland.
    Reply
  • aruisdante - Wednesday, January 22, 2014 - link

    It's for HPC. In HPC applications, upfront cost is irrelevant, it's performance per Watt that matters. The longterm costs in electricity/cooling will eclipse the upfront costs of the CPU quite rapidly. And in performance per Watt Intel is literally miles ahead. Reply
  • SunLord - Wednesday, January 22, 2014 - link

    What does HPC have to do with GPU performance though? Especially when you consider the i7-4770R. If one is going to do HPC they would be using a Ivy Bridge-E processor with real GPUs. Reply
  • Klimax - Wednesday, January 22, 2014 - link

    Bit older and IIRC unconfirmed:
    http://www.cpu-world.com//news_2013/2013112001_Bro...
    Reply
  • Klimax - Wednesday, January 22, 2014 - link

    Damnit, replied to wrong comment. Reply
  • MrSpadge - Wednesday, January 22, 2014 - link

    The question the article tries to answer, theoretically, is "how capable is Kaveri for raw number crunching compared to several alternatives". And that's exactly what this could have to do with HPC.. if the numbers were better. At DP with 1/16th SP I don't think Kaveri is going anywhere in classic HPC. Could be used in special SP applications with HSA, though. Reply
  • aruisdante - Wednesday, January 22, 2014 - link

    Yes and no. There are plenty of places such as in academia where you might have computers in rack-mounts without room for dedicated GPU, but are doing HPC-like workloads.

    I agree that the use-case range is small, but that was kind of the conclusion of the article. Even with a (relatively speaking) beefy semi-discrete GPU in it, Kavari still falls short of the performance you can get out of the Haswells with Iris Pro.
    Reply
  • MySchizoBuddy - Wednesday, January 22, 2014 - link

    you cannot use iris pro for opencl Reply
  • JarredWalton - Wednesday, January 22, 2014 - link

    Yet. Reply
  • BMNify - Wednesday, January 22, 2014 - link

    you cant ?
    http://software.intel.com/en-us/videos/acceleratin...

    http://pocl.sourceforge.net/
    Reply
  • rahulgarg - Wednesday, January 22, 2014 - link

    Iris Pro has an OpenCL driver for Windows. Reply
  • nafhan - Wednesday, January 22, 2014 - link

    1. Intel's top iGPU vs. AMD's - seems reasonable to me.
    2. As you've shown, people who actually buy these things are aware of the price disparity.
    Reply
  • BMNify - Wednesday, January 22, 2014 - link

    oc apart from this being an architectural comparison these kaveri are the best amd officially make for the desktop as are the i7's from Intel, you simply cant buy anything better from AMD for the desktop, if they are not making what you the end consumer want to buy, then no sale=no profit for them this time alround. Reply
  • Death666Angel - Wednesday, January 22, 2014 - link

    Kaveri are still only the best AMD APUs. A 2 module APU does not translate to the best AMD has to offer for the desktop, FX processors are still much better if you are going to get a dedicated GPU. You wouldn't call the i7 4770K the best Intel has to offer, would you? There is a whole range of 2011 socket CPUs. Reply
  • nathanddrews - Wednesday, January 22, 2014 - link

    Is there any indication that GT3e will trickle down to non-H/R Haswell CPUs? Or that Broadwell will expand GT3e technology (in some form) to the rest of the Intel lineup? As of right now, GT3e may as well be vaporware as you can only get it in expensive, limited configurations. Kaveri spans the whole AMD product line at significantly lower cost, but then gets its butt kicked by similarly-priced Intel+dGPU setups. I would really like it if GT3e gets a lot cheaper and more widespread while Kaveri gets a lot more potent.

    Wake me up in two years.
    Reply
  • tipoo - Wednesday, January 22, 2014 - link

    I sure hope it's more common with Broadwell. GT3E is a decent performer, I do wish it would make its way to 13" laptops. Reply
  • Klimax - Wednesday, January 22, 2014 - link

    Bit older and IIRC unconfirmed:
    http://www.cpu-world.com//news_2013/2013112001_Bro...
    Reply
  • lefty2 - Wednesday, January 22, 2014 - link

    Indeed. The 4770R is only available to OEMs and more, or less unobtainable. Even if you could get your hands on one, you wouldn't want to. Firstly, it comes with a huge price tag, secondly you lose 2M of cache.. that effectively makes it a core i5. Reply
  • SunLord - Wednesday, January 22, 2014 - link

    It's OEM only because it's totally worthless any other way as it's a BGA only part :( Reply
  • Shadowmaster625 - Wednesday, January 22, 2014 - link

    So that huge GPU in kaveri cant even outperform an ivy let alone a haswell in terms of fp64/cycle. Why/how is AMD still in business? Reply
  • jabber - Wednesday, January 22, 2014 - link

    Oh I can imagine that's always the first question asked when anyone walks into a Best Buy etc. to buy a new PC. I know it keeps me awake at night. Meanwhile back in the real world... Reply
  • nathanddrews - Wednesday, January 22, 2014 - link

    The key to AMD's success with Kaveri will come on budget mobile notebooks and SFF, where the lack of a dGPU would heavily tilt the gaming advantage to AMD. While Intel HD4000/4600 can game pretty well at 768p, Kaveri would steamroll it and be competent up to 900p... assuming Broadwell IGP doesn't greatly improve. Reply
  • YuLeven - Wednesday, January 22, 2014 - link

    I'm not so sure just yet. I'm hoping for a strong Kaveri on laptops, but past experiences with Llano, Trinnity and Richland showed the clear desktop win from AMD APU's quickly eroding on portable due power constraints.

    This year, the gap is much smaller with strong contenders as HD 5000 and HD 5100 in many laptops. I'm not entirely sure about Kaveri's uphand in graphics performance will be large enough to justify the considerable loss of CPU performance and battery life (assuming that Kaveri will perform as poorly against Haswell as it's older brothers did). And then, Kaveri mobile will come just months before Broadwell, which is said to improve GPUs by quite a bit.
    Reply
  • PEJUman - Wednesday, January 22, 2014 - link

    I have a A6-1450 11.6" laptop that supposed to be a 9W 'SoC' with 30Wh battery.
    I also have a i3 ivy bridge 11.6" tablet that supposed to be a 17/14W 'SoC with 54Wh battery.

    Expected the A6 to be 60-70% of the i3 battery life based on a light load usage pattern. Got only about 40% of the i3 life.
    Rough calculation ended up with around 10W/hour average power consumption on the A6
    average power consumption of 5W/hour on the i3.

    Considering I only paid $280 for the A6 and $450 for the i3, I am still quite happy with it.
    but can't help but wonder if AMD's SDP/TDP is very different compared to intel's.

    To my understanding TDP means the max amount of heat you need to dissipate to keep everything running smoothly. Based on that understanding and the 15W power range, you can let the CPU/APU run hotter (thus rejecting 2-3W worth of heat into surrounding pieces: package, motherboard, case, etc) with the same heatsink TDP.
    Reply
  • Death666Angel - Wednesday, January 22, 2014 - link

    Can't really draw any conclusions from that. The SoC/APU/CPU is usually a very tiny amount of energy draw in modern laptops/tablets. The display accounts for most of the power usage and if there is even a small amount of brightness difference or indeed manufacturer difference, that can account for you scenario easily. Reply
  • TheinsanegamerN - Wednesday, January 22, 2014 - link

    ive noticed the same thing. gaming wise, the desktop a10 trinity creamed ivy bridge. on mobile, though, the performance difference was only 18% higher in favor of amd. with haswell, intel hits the same performance as mobile richland a10s in games, and ets better battery life to boot.
    on the other hand, the performance of the 45 watt a8-7600 makes me hopefull that amd will give us another 45 watt mobile fusion apu that would be as fast as the desktop version.
    Reply
  • YuLeven - Thursday, January 23, 2014 - link

    I'm dreaming on that too. It would be a shame if mobile Kaveri took the same huge performance hit that it's older brothers saw when moving from desktop to mobile.

    If history repeats itself, I think Broadwell will hit Kaveri-M very hard, relegating it to the same shady spot on poor budget designs that llano, trinnity and richland where. I would love to see an AMD APU performing strong on a good laptop. If Kaveri-M ever threats Broadwell, at least for gaming-focused folk, it would cause the healthy impact that competition causes on Intel. Lower prices, better parts.
    Reply
  • toyotabedzrock - Friday, January 24, 2014 - link

    If you want to know what is in store for Broadwell you have to watch the Linux kernel mailing lists or read a certain site that watches the video driver commits like a hawk.

    Intel has already been adding support for the broadwell gpu for some time. For Linux 3.14 they started adding the framework for a new Cpu feature in skylake and broadwell audio support.

    http://www.phoronix.com/scan.php?page=home
    Reply
  • Bob Todd - Wednesday, January 22, 2014 - link

    Hopefully, but that requires design wins which they have been sorely lacking compared to Intel. And AMD seems practically non-existent in the SFF space. Where is their NUC? Hell, where are their mITX boards? Newegg shows a whopping 3 FM2+ mITX boards and 2 FM2 boards. Intel has 24 just for Haswell, and another 19 for Sandy/Ivy. Reply
  • npz - Wednesday, January 22, 2014 - link

    gaming = fp32, 3D rendering = fp32, multimedia = fp32, CAD = fp32, good-enough precision scientific and engineering = fp32

    This includes SmallLuxGPU, Blender, Sony Vegas, HitFilm, etc

    In any case, the fp64 performance is always artificially capped for consumer GPUs. If they really wanted to, they could just uncap it. Of course they'd be shooting themselves with no reason for pros to buy the much more expensive workstation/compute cards.
    Reply
  • BMNify - Wednesday, January 22, 2014 - link

    90% multimedia = int32, int16,int8 actually Reply
  • wumpus - Saturday, February 08, 2014 - link

    good-enough precision scientific and engineering = fp32

    Be careful there. 32 bits are more than enough for any straightforward calculation. Any calculation that requires multiple iterations or multiple points (especially anything nonlinear or based on boundary conditions) is going to fail badly with 32 points.

    Double is needed due to accumulated rounding errors. It has nothing to do with significant figures (just how often do you have the 7 or so [decimal] figures a float can have). Try running an audio sample through a 64k point FFT to get a good idea what can happen if you need proof.

    As far as capping for consumer use, I have to wonder about that. Obviously, the 780 is capped (although how useful a titan is for calculations that require double without the ECC of the even more expensive variety is questionable), but I have to wonder since using some of the weaker GPUs wouldn't be cost effective considering the entire cost of the motherboard slot (motherboard, ram, CPU, power supply, some sort of boot disk...). I wouldn't be at all surprised if they are cheating a bit on the rounding of the float. Like the double, full IEEE754 isn't remotely useful to consumers (this might barely change with more HSA apps). One of the more painful parts of 754 is that the last bit of a multiply has to be rounded from the entire 112 bits of mantissa you get when you multiply two 56 bit mantissas together. Wimping out on float and doing doubles by the book (almost everyone who cares about rounding uses double) could easily make double 1/12 of float.
    Reply
  • sanaris - Saturday, March 01, 2014 - link

    Dear noob. Do not mix yourself with scientific community cause you do not belong to it.

    Scientific community does not think it needs any float numbers at all.
    It should be at least doubles, but preferably quad numbers.
    Reply
  • sanaris - Saturday, March 01, 2014 - link

    I was obviously to upper post. 32 bit precision is a complete useless thing in any computation. Most useful is 128 bit. Reply
  • MrSpadge - Wednesday, January 22, 2014 - link

    You know with DP at 1/16th SP they're not even trying. They could easily go up to 1/4th, though. Reply
  • wumpus - Saturday, February 08, 2014 - link

    Maybe, maybe not. I suspect they aren't trying, but I wouldn't write any code that expected strict IEE754 rounding in single (crypto, perhaps). Strict rounding needs close to 4 times the multiplies that you would need for an unrounded multiply, so they could be wimping out there.

    Personally, I'd rather have more floats that are off by a bit than strict 754 rounding on my floats, but can't see doing it as long as there are claims of "IEEE754" compatibility. Violating 754 has a *long* history (there have been plenty of -754strict compiler flags that kill performance), and there are plenty of ways to weasel a datasheet, but violating a spec is something an engineer *does* *not* *do*. When a careful engineer sees something like this, he won't go near the edge conditions (and rounding and the other 754 nastiness is about as edge condition as you can get).
    Reply
  • KenLuskin - Saturday, January 25, 2014 - link

    Kaveri was NOT created to be high priced chip.

    Kaveri was NOT even created to be a desktop chip.

    Kaveri was designed for laptops, but with enough GPU to run AAA games.

    Kaveri is designed to be AFFORDABLE for very low priced laptops.

    Most people do NOT need any more speed out of their CPU.

    But, they would like the ability to run AAA games.

    Kaveri 45 Watt for $120 blows away an i3 at $130 in grahics!

    And that is without MANTLE!
    Reply
  • sanaris - Saturday, March 01, 2014 - link

    Noone of laptop users needs direct access from shader units to caches.
    Any of real laptop users will buy Intel with NVidia descrete card, because they are supported with Linux.
    Reply
  • toyotabedzrock - Wednesday, January 22, 2014 - link

    I noticed both the companies gpu's are slower than the cpu for fp64. Reply
  • tipoo - Thursday, January 23, 2014 - link

    Probably because DP floating point calculations are crippled, so those who need it buy the full FirePro or Quadro parts. Reply
  • sanaris - Saturday, March 01, 2014 - link

    I have card with "unbound" DP performance. It is complete brick. It says it should get 400 gigaflops, but in reality It does Prime95 about 24 msec/iter. When Opteron 110W chip does it twice faster - about 12 msec.

    All AMD GPU efforts are turned into bricks cause they fail to test their designs with real software.

    Very bad AMD does not move into 32 core opteron chips cause that is what I need now.
    Reply
  • HalloweenJack - Wednesday, January 22, 2014 - link

    Iris Pro - FCBGA ONLY and OEM.... so why are you testing it anyway??? Reply
  • MrSpadge - Wednesday, January 22, 2014 - link

    Because it exists. Don't shout at AT for listing it but rather at Intel for not giving it to us. Reply
  • HalloweenJack - Wednesday, January 22, 2014 - link

    no , its pointless from AT for testing it - might as well get a 16 core G34 opteron for multithread as an `oooh shiny`... Reply
  • DigitalFreak - Wednesday, January 22, 2014 - link

    Why are you AMD fanboys always so butthurt? Reply
  • wumpus - Saturday, February 08, 2014 - link

    To show how badly 8 floating point cores running at half speed will do? I think we know that already. Reply
  • BMNify - Wednesday, January 22, 2014 - link

    you bitching because you cant afford it so dont want it tested or what ?

    as already said it exists https://www.system76.com/laptops/model/galu1
    its been benched for both windows and linux os http://www.phoronix.com/scan.php?page=article&...

    by all accounts its ok, i wish Intel would OC put it's followup on their mainstream mid/high i somethings and also improve its data throughput compared to that above linked test, we shall see when it arrives or not...
    Reply
  • michael2k - Wednesday, January 22, 2014 - link

    Can you compare Kaveri to the Bay Trail parts? The J2850 is a quad core part, though only 2.41GHz. It appears that the BT parts might be more congruent, if weaker GPU wise, in terms of CPU perf:
    http://hothardware.com/Reviews/Betting-On-Bay-Trai...
    Reply
  • rahulgarg - Wednesday, January 22, 2014 - link

    That review has major errors. The AMD APU they are testing (A4-5000) is not Kaveri at all even they keep calling it Kaveri. A4-5000 is actually the low-end Kabini. Kaveri is MUCH faster than Bay Trail. Reply
  • BMNify - Wednesday, January 22, 2014 - link

    oc Bay Trail even the quad has been crippled as it does NOT have AVX/AVX2 SIMD only at best SSE4 / SSE4.1 + SSE4.2 / Streaming SIMD Extensions 4 Reply
  • ash9 - Wednesday, January 22, 2014 - link

    Turning off turboboost may not compare equally if Kaveri's turbo core attributes 100% towards its productivity. Reply
  • MrSpadge - Wednesday, January 22, 2014 - link

    The problem with Turbo is that you can't be sure about which frequency will be achieved. So on what shall the calculations be based? The base clock is guaranteed, and scaling the result for that number up for higher clocks is trivial. Reply
  • Death666Angel - Wednesday, January 22, 2014 - link

    Is it guaranteed though? Seems like if your cooling is crap, any processor might throttle. And if your cooling is good, any processor might run its turbo 100% of the time. Mine always to anyway (AMD and Intel alike). Reply
  • Hrel - Wednesday, January 22, 2014 - link

    Wow, such speed, much compute.

    Amazing how far Intel has come with their integrated graphics.
    Reply
  • TeXWiller - Wednesday, January 22, 2014 - link

    It feels like the Kaveri execution resources have been scaled to the capacity of the memory interface considering the GPU requirements. Haswell might benefit really nicely from the four-channel DDR4 interface as well. Reply
  • Death666Angel - Wednesday, January 22, 2014 - link

    What 4 channel DDR4 interface? Reply
  • TeXWiller - Wednesday, January 22, 2014 - link

    The memory interface of the Haswell-E. Reply
  • BMNify - Wednesday, January 22, 2014 - link

    that's interesting and combined with is these latest phoronix tests for variable TDP testing http://openbenchmarking.org/embed.php?i=1401184-PL... and the related RAM speed tests http://openbenchmarking.org/embed.php?i=1401184-PL... it looks very odd they didn't add more bandwidth to feed both cpu and gfx on their best desktop kaveri APU's Reply
  • BMNify - Thursday, January 23, 2014 - link

    AMD Kaveri OpenCL Compared To Radeon & GeForce GPUs On Linux

    Published on 23 January 2014
    http://www.phoronix.com/scan.php?page=article&... doesn't look to good ether
    http://openbenchmarking.org/embed.php?i=1401193-PL...
    Reply
  • lmcd - Friday, January 24, 2014 - link

    "On Linux" Reply
  • BMNify - Friday, January 24, 2014 - link

    yeah , that kernel thing that runs on all the 1.81 billion mobile phone sales for all of 2013 not counting all of the other android devices today OC. Reply
  • BMNify - Friday, January 24, 2014 - link

    and you are aware that the AMD linux Radeon closed source driver as used here is considered to be on par with the windows driver as they use the same code base, and did you forget that kaveri and it's little slower brothers are supposed to be found in the mobile android devices running that kernel etc some day if they manage to get actual orders there to offset their lower windows PC sales today. Reply
  • moozoo - Wednesday, January 22, 2014 - link

    Thank you for this article.

    The reason the Intel GPU's don't have fp64 under opencl is because the math instruction that includes intrinsics and division doesn't support fp64. see page 134 of Intel Open Source Graphics Programmer's Reference Manual for the 2013 Intel Core Processor Family...: Volume 2b.

    From what I can tell GPU's have a larger number of intrinsics with greater numerical accuracy than AVX. Intel isn't correcting this until AVX-512 (see chapter 7.2 of the "Intel Architecture Instruction Set Extensions Programming Reference" and note the "less than 2^-23 relative error). I believe the normal accuracy is 2^-14.
    AVX does not have a native fp64 rsqrt.
    The native log and exp for Hawaii is precise to 1 ULP (http://semiaccurate.com/2013/10/23/long-look-amds-...

    The Intel OpenCL will not generate AVX2 FMA instructions.
    http://software.intel.com/en-us/forums/topic/40116...
    I assume the native AVX2 FMA is not compliant with Opencl requirements in someway.

    There may be a Workstation version of Kaveri on the way. This might have a better fp64:fp32 ratio than 1:16 (http://semiaccurate.com/2013/06/18/a-glimpse-of-fu...
    Reply
  • kantian - Thursday, January 23, 2014 - link

    Why don't you specify that CPU fpu64 numbers of Intel are for AVX2 instructions, but not for AVX? In this way you give unjust performance advantage to Intel! Intel CPU fpu64 has about 2x performance advantage over AMD fpu64 only with AVX2 instructions. That's why, your following statement seems quite untrue:

    "As a comparison point, one core in Haswell has the same floating point performance per cycle as two modules (or four cores) in Steamroller."
    Reply
  • kantian - Thursday, January 23, 2014 - link

    You can look at the following chart - http://images.hardwarecanucks.com/image//skymtl/CP... for some comparison numbers and examples. As you can see the FPU (VP8) results of the Haswell i3-4330 are about 2x than that of Kaveri A10-7850k. However the older FPU Ivy Bridge i3-3225 results are similar to that of A10-7850k. That's because the new Haswell processors have AVX2 instructions, but not the Ivy Bridge ones. You can also see that, if you compare the VP8 i7-4770K results to i7-3770K ones. That's why, i7-4770 has twice more performance than i7-3770K. Reply
  • rahulgarg - Thursday, January 23, 2014 - link

    From a floating point perspective, the only difference between AVX and AVX2 is that AVX2 contains FMA instructions while AVX does not. Kaveri/Steamroller do not support full AVX2 but do support FMA instructions. So, from a floating point perspective, Kaveri/Steamroller and Haswell support almost the same instruction set. if you look at the column, AVX with FMA, we already cover this case. Reply
  • kantian - Thursday, January 23, 2014 - link

    Thank you for your clarification! But as far as I know, Intel Haswell architecture has FMA 256 bit units compared to Ivy Bridge and Kaveri, etc., which have 128 bit FMA ones. That's the only Haswell's FPU big architectural advantage over the others. That can explain the double performance per FPU module, we can observe on the chart I have posted. And as you say, the AVX2 includes FMA instructions, where the big performance advantage is. However I cannot understand your table, where the regular AVX instructions have 4x advantage over Kaveri. As we can see on the chart (http://images.hardwarecanucks.com/image//skymtl/CP... the practical results show different picture. Haswell's FPU advantage over Kaveri (counting the same number of FPUs) is about 50% - 60%, but not more. Reply
  • rahulgarg - Thursday, January 23, 2014 - link

    Yes, well, our coverage is more about the theoretical peaks. In practical applications, differences will be smaller.
    About the 4x advantage of AVX over Kaveri, the difference is that each Haswell core has two 256-bit units. Thus, quad-core Haswell has total of eight 256-bit units.
    Steamroller modules only have two 128-bit units per module. Thus, quad-core Steamroller only has four 128-bit units. Thus, Haswell has twice the number of SIMD units and each unit is double the width, hence the 4x difference.
    Reply
  • kantian - Thursday, January 23, 2014 - link

    Thank you! I can absolutely agree with your calculations. However, I always thought that it is more accurately to compare the quad-core two-module Steamroller or Piledrivers with i3 2 core 4 thread processors. Because, as we know, the AMD quad-core processors have only 2 FPU and 4 Integer units. So they are only 2 core regarding the FPU and quad-core integer. I think the AMD definition for quad-core (or any other number of cores) is not quite correct. But that is another story... Reply
  • kantian - Thursday, January 23, 2014 - link

    And I think that your comment "About the 4x advantage of AVX over Kaveri, the difference is that each Haswell core has two 256-bit units. Thus, quad-core Haswell has total of eight 256-bit units." is just partly correct. Because those units are 256-bit FMA units. And FMA instructions are part of AVX2, but not AVX. That was the subject of my initial comment. Reply
  • rahulgarg - Thursday, January 23, 2014 - link

    AFAIK, the 256-bit units in Haswell can be used for non-FMA AVX ops as well. Reply
  • kantian - Thursday, January 23, 2014 - link

    True. A non-FMA AVX op will provide one 128 bit vector to one 256-bit unit at a time. But is it possible that it can provide two different 128 bit vectors in parallel, in order to take advantage of the full 256-bit unit potential? AFAIK, it is not. Reply
  • rahulgarg - Thursday, January 23, 2014 - link

    AVX includes 256-bit ops for both FMA and non-FMA. So there is a 256-bit add operation for example. Reply
  • kantian - Thursday, January 23, 2014 - link

    In that cases you are right. Reply
  • kantian - Friday, January 24, 2014 - link

    Following our discussion so far, I think, you have errors in your numbers for the CPU floating-point peak performance of Ivy Bridge 3770K processor. The AVX FMA units in Ivy Bridge processors are 128 bit. Only Haswell ones are 256 bit, which gives the 4x multiplier to Steamroller numbers. That means the following numbers in the table are not correct:
    - AVX fp32 (/cycle) - 64, correct 32
    - AVX fp64 (/cycle) - 32, correct 16
    - AVX fp32 (gflops), correct 112
    - AVX fp64 (gflops), correct 56
    Reply
  • BMNify - Thursday, January 23, 2014 - link

    i dont see your point ! it seems AMD where all over the shop wjile intel did one change so far
    https://en.wikipedia.org/wiki/FMA_instruction_set
    "May 2009: AMD changes the specification of their FMA instructions from the 3-operand DREX form to the 4-operand VEX form, compatible with the April 2008 Intel specification rather than the December 2008 Intel specification.[9]
    October 2011: AMD Bulldozer processor supports FMA4.[10]
    January 2012: AMD announces FMA3 support in future processors codenamed Trinity and Vishera; they are based on the Piledriver architecture.[11]
    May 2012: AMD Piledriver processor supports both FMA3 and FMA4.[10]
    June 2013: Intel Haswell processor supports FMA3.[12]
    It is currently uncertain whether the 3-operand VEX coded form (here called FMA3) or the 4-operand form (FMA4) will be the dominating standard in the future."

    the only thing that really matters OC is the fact that Different compilers provide different levels of support for FMA4:
    GCC supports FMA4 with -mfma4 since version 4.5.0[13] and FMA3 with -mfma since version 4.7.0

    NASM supports FMA3 instructions since version 2.03 and FMA4 instructions since 2.06.
    YAsm supports FMA3 and FMA4 instructions since version 1.1.0.
    Reply
  • kantian - Thursday, January 23, 2014 - link

    The non-FMA AVX ops are currently the most widely used vector instructions in the x86 applications. The newer AVX2 ones are not widely adopted, and thus have just tiny share. The non-FMA AVX 128 bit operands are executed using 256 bit FMA units in Haswell, but take no advantage of those 256 bits, as the 256 bit FMA unit can execute only one 128 bit operand at a time. That's why the 256 bit FMA units in Haswell give performance advantage only for FMA AVX 256 bit ops (AVX2), but not for the widely adopted non-FMA AVX ops. That is what I think and can explain in simple terms. Reply
  • milli - Friday, January 24, 2014 - link

    It's because AMD originally planned to support SSE5 with BD.
    http://en.wikipedia.org/wiki/SSE5
    Reply
  • silverblue - Saturday, January 25, 2014 - link

    Well, AMD drew up SSE5, and instead had to implement it differently in order to offer compatibility with AVX. Has AMD ever created an instruction set that Intel has adopted, besides AMD64? Reply
  • Th-z - Thursday, January 23, 2014 - link

    AMD is shooting itself in the foot if it doesn't have a Kaveri with full GPU FP64 capability similar to 7970. Together with HSA, it should be powerful for a new breed of applications that require FP64. It's a window of opportunity for them to popularize this product in HPC. In gaming, it also requires a "killer app" that utilizes HSA and iGPU to assist new techniques in rendering, e.g renderings that require dependency, compute-based rendering, and interactive GPU physics, and coupled with a dGPU only for rendering. Reply
  • silverblue - Thursday, January 23, 2014 - link

    Yes, but is it required for the target market? Reply
  • jabber - Thursday, January 23, 2014 - link

    Exactly. Thing is as AMD doesn't bother marketing/advertising to the target market, it's kind of a double fail. Reply
  • wumpus - Saturday, February 08, 2014 - link

    Hardly. Building it for the target market would increase power draw by a factor of four (well two since the GPU is half the chip). That would kill mobile sales and likely limit desktop power to Intel levels. Not going to happen.

    FP64 apps tend to be rare and price insensitive. Intel appears to be going there with the knights landing chip and AMD would get killed trying to make a chip that could compete with that *AND* fit in laptops/tablets (it would have enough trouble competing with that on the desktop).
    Reply
  • Gadgety - Friday, January 24, 2014 - link

    Regardless of how well the A10-7850 compares to Intel's offering in terms of fp64, I'm wondering what good the extra 33% Stream Processors are bringing compared to the rest of the Kaveri range, as in the 7700k and the A8-7600? Reply
  • Shadowmaster625 - Friday, January 24, 2014 - link

    That A6 is so weak that it stays pegged at 100% for much longer periods than an i3. The i3 is able to actually enter into low power states more often. Since an i3 will churn through its tasks faster, it can even result in reduced power consumption from the storage device since more I/O operations can be clustered together. Reply
  • twoodrow - Friday, January 24, 2014 - link

    I am developer who frequently uses OpenCL to accelerate proprietary image processing algorithms. Their code relies on compiler to vectorize which, in my experience using AMD and Intel's OpenCL SDKs, is often a mistake resulting in subpar performance.

    I never really considered the fact that benchmark code would be this naive. I assumed that since its purpose was to give an objective standpoint of realizable performance that they would take all steps to ensure maximal numbers. I won't make that mistake again.
    Reply
  • BMNify - Friday, January 24, 2014 - link

    "Their code relies on compiler to vectorize" do you also rely on the compilers abilities to vectorize or actually write your code as small independent modules with both assembly code and C code as fall back as it where x264 code style to maximize your data throughput.

    where can we find your OpenCL x264 image processing algorithms patches to improve that generic app for 1080P/UHD1 encoding
    Reply
  • twoodrow - Saturday, January 25, 2014 - link

    I don't understand what you are trying to say. Can you explain it more clearly?

    There are two ways to vectorize execution: explicitly (and there a few ways to do so) or letting the compiler figure it out from vector naïve code. The source does not explicitly vectorize by using the vector data types available in OpenCL.
    Reply
  • silverblue - Saturday, January 25, 2014 - link

    I'm confused about FlexFPU. Surely the idea was to allow for two SSE or one AVX instruction per cycle, and considering we're talking four units per dual module/quad core Kaveri, wouldn't that be equivalent to a Phenom II X4/Llano? The unit is supposedly designed to work in a HyperThreaded-style manner, could that be the limitation, or is it for SSE2 only?

    Also, as far as I recall, K10 doesn't support fused instructions. So, it's another reason to be confused about the results.
    Reply
  • kantian - Monday, January 27, 2014 - link

    I think, there are mistakes in the table “CPU floating-point peak performance” in the column for Ivy Bridge i7-3770K processor. The 3770K has 4 cores each having 1 FPU with 2 128-bit FMA units. That is total of 8 128-bit FMA units. Steamroller A10-7850 has 4 cores, each two sharing 1 FPU with 2 128-bit FMA units. That is 2 FPU times 2 FMA units, which gives total of 4 128-bit FMA units. Hence i7-3770K has twice more AVX peak performance power than Steamroller, Richland and Trinity. Therefore the following numbers in the table corresponding to 4 times more performance power are wrong:
    - i7-3770K, AVX fp32 (/cycle) 64. Should be 32;
    - i7-3770K, AVX fp64 (/cycle) 32. Should be 16;
    - i7-3770K, AVX fp32 (gflops) 224. Should be 112;
    - i7-3770K, AVX fp64 (gflops) 112. Should be 56.
    Reply
  • kantian - Monday, January 27, 2014 - link

    Or if you prefer, you can calculate the first 2 numbers like that:
    - i7-3770K, AVX fp32 (/cycle) -> 8*128/32 = 32 (8 FMA, 128-bit, fp32)
    - i7-3770K, AVX fp64 (/cycle) -> 8*128/64 = 16 (8 FMA, 128-bit, fp64)
    And the corresponding A10-7850K and i7-4770K numbers are correctly calculated like that:
    - A10-7850K, AVX fp32 (/cycle) -> 4*128/32 = 16 (4 FMA, 128-bit, fp32)
    - A10-7850K, AVX fp64 (/cycle) -> 4*128/64 = 8 (4 FMA, 128-bit, fp64)
    - i7-4770K, AVX fp32 (/cycle) -> 8*256/32 = 64 (8 FMA, 256-bit, fp32)
    - i7-4770K, AVX fp64 (/cycle) -> 8*256/64 = 32 (8 FMA, 256-bit, fp64)
    Reply
  • rahulgarg - Monday, January 27, 2014 - link

    Each Ivy Bridge core has two 256-bit ALU units and no FMA support. Ivy Bridge doesn't support FMA. Reply
  • kantian - Monday, January 27, 2014 - link

    Ok, you are right, I just didn't wish to go into such details. It doesn't change my calculations, because Intel Sandy Bridge/Ivy Bridge ALUs are used like that:
    - 8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication
    - 16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication
    If you multiply those numbers by the number of cores (i.e. 4) you get 4 x 4 = 16 (fp64) and 4 x 8 = 32 (fp32) Those are exactly the numbers in my previous comment.
    Reply
  • kantian - Monday, January 27, 2014 - link

    Or in other words, Sandy Bridge/Ivy Bridge ALU can execute either 1 256-bit addition or one 256-bit multiplication per cycle per core. While the two 128-bit Steamroller FMA units can group together to execute the same 1 256-bit addition or one 256-bit multiplication per cycle per module. Hence in the most cases, 1 Steamroller module should have the same throughput as 1 Ivy Bridge core. As non FMA AVX multiply and add operations are rarely mixed together, one could not expect many cases where both operations are performed on both 256-bit Ivy Bridge ALUs at the same cycle. In some ideal scenario, one of the Ivy Bridge hyper threads would provide 256-bit addition and the other - 256-bit multiplication. I can agree that in those cases the CPU will reach your maximum numbers of peak performance. Reply
  • kantian - Monday, January 27, 2014 - link

    'Or in other words, Sandy Bridge/Ivy Bridge FPU ..." above Reply
  • FellTheSky - Thursday, February 06, 2014 - link

    would gddr5 and better memory bus help kaveri in HSA enabled applications?

    There are some benchmarks around of opencalc and another app that supports hsa, but they are very simple test, and i would like to know if memory speed has a direct impact on hsa applications
    Reply
  • crunchmore - Friday, May 23, 2014 - link

    I'm not sure about my understanding, but maybe FPU in bulldozer don't work as a single core:
    "What he could tell me was that the 128-bit FP units are symmetrical, and that, on any cycle, either integer core can dispatch a 256-bit AVX instruction (assuming software compiled to support AVX). Or, both integer cores can dispatch a single 128-bit instruction at the same time."

    From: http://www.tomshardware.com/reviews/bulldozer-bobc...

    There are some test to run for make situation clear? Thanks.
    Reply

Log in

Don't have an account? Sign up now