Agner Fog, a Danish expert in software optimization is making a plea for an open and standarized procedure for x86 instruction set extensions. Af first sight, this may seem a discussion that does not concern most of us. After all, the poor souls that have to program the insanely complex x86 compilers will take care of the complete chaos called "the x86 ISA", right? Why should the average the developer, system administrator or hardware enthusiast care?

Agner goes in great detail why the incompatible SSE-x.x additions and other ISA extensions were and are a pretty bad idea, but let me summarize it in a few quotes:
  • "The total number of x86 instructions is well above one thousand" (!!)
  • "CPU dispatching ... makes the code bigger, and it is so costly in terms of development time and maintenance costs that it is almost never done in a way that adequately optimizes for all brands of CPUs."
  • "the decoding of instructions can be a serious bottleneck, and it becomes worse the more complicated the instruction codes are"
  • The costs of supporting obsolete instructions is not negligible. You need large execution units to support a large number of instructions. This means more silicon space, longer data paths, more power consumption, and slower execution.
Summarized: Intel and AMD's proprietary x86 additions cost us all money. How much is hard to calculate, but our CPUs are consuming extra energy and underperform as decoders and execution units are unnecessary complicated. The software industry is wasting quite a bit of time and effort supporting different extensions.
 
Not convinced, still thinking that this only concerns the HPC crowd? The virtualization platforms contain up to 8% more code just to support the incompatible virtualization instructions which are offering almost exactly the same features. Each VMM is 4% bigger because of this. So whether you are running Hyper-V, VMware ESX or Xen, you are wasting valuable RAM space. It is not dramatic of course, but it unnecessary waste. Much worse is that this unstandarized x86 extention mess has made it a lot harder for datacenters to make the step towards a really dynamic environment where you can load balance VMs and thus move applications from one server to another on the fly. It is impossible to move (vmotion, live migrate) a VM from Intel to AMD servers, from newer to (some) older ones, and you need to fiddle with CPU masks in some situations just to make it work (and read complex tech documents). Should 99% of market lose money and flexibility because 1% of the market might get a performance boost?

The reason why Intel and AMD still continue with this is that some people inside feel that can create a "competitive edge". I believe this "competitive edge" is neglible: how many people have bought an Intel "Nehalem" CPU because it has the new SSE 4.2 instructions? How much software is supporting yet another x86 instruction addition?
 
So I fully support Agner Fog in his quest to a (slightly) less chaotic and more standarized x86 instruction set.
Comments Locked

108 Comments

View All Comments

  • alxx - Wednesday, December 9, 2009 - link

    Sorry your a bit wrong there
    VLIW is still heavily used by TI in their dsp cores.
    Look at their C6000 series and C6400+ , also in the dsp unit in their OMAP cores used in a lot of mobile phones and in the dsps used in some base stations and a lot of other comms equipment.

    A more correct statement would be vliw failed in general purpose computing.

    http://www.eetasia.com/ART_8800445205_499489_NP_cb...">http://www.eetasia.com/ART_8800445205_499489_NP_cb...
    http://www.ece.umass.edu/ece/koren/architecture/VL...">http://www.ece.umass.edu/ece/koren/architecture/VL...
    http://focus.ti.com/paramsearch/docs/parametricsea...">http://focus.ti.com/paramsearch/docs/pa...ilyId=13...

    Interesting book
    Embedded computing. A VLIW approach to architecture, compilers & tools
  • wetwareinterface - Wednesday, December 9, 2009 - link

    You have missed the IA64 mark by a longshot. IA64 doesn't predicate logic in software, it allowed software to handle it's own data and instruction width more efficiently. For instance you have to compare 2 16 bit values and fetch a 32 bit float. On x86 with no dependencies thats a lot of operations, 2 fetch's for the 16 bit values, a compare, then at least 2 stores (because of a serious lack of registers) then another fetch. In IA64 it can do all three fetches at once then store locally in a register the result on the 16 bit compare. That's just one case. There are several instances where IA64 simply kicks the crap out of x86 for doing what cpus do. The vliw is a means to an end, in VAX's case there wasn't enough resources behind the concept to make it worthwhile. In IA64's there is an aboundance of cpu horsepower to handle the concept of vliw. The compiler just has the ability to pack more fetches together if it can and do the job of the cpu ahead of time in organizing dependencies in some cases. The dependency resolve in the compiler was a bonus in the compiler to save even further cpu cycles on IA64 code and was neccesary due to the software x86 emulation. It was only required for x86 emulation because Intel wanted to junk x86 entirely. In any system there is a lot of non dependant data being fetched. The problem with x86 is you can't get too much ahead of time because of a lack of resources in the cpu and not many means to grab more at once. You can fetch to level 1 or lvl 2 cache in 64 bit chunks but because of the crap isa of x86 taking them into the alu or registers is a one after the other step. IA64 sought to get rid of the limitations of x86 and go forward with a 64 bit isa that was new.

    Motorola/IBM/Apple did the exact same thing moving to Power PC, and it worked well for them. It meant slow software emulation for older code but a dramatic increase in new code and a new more modern isa without a lot of garbage they didn't need anymore. Intel was trying to do the same thing only they didn't have the partner in Microsoft that Motorola and IBM did in Apple. Meaning one focused on the mainstream desktop and willing to completely ditch legacy code and start over with a new cpu instruction set. Microsoft had a massivly larger user base and it was extremely varied and couldn't just drop everything the way Apple could.

    HP on the other hand in the server space could devote a seperate effort to IA64. For HPC IA64 kicked the crap out of everything that then existed under HP-UX on a per cpu basis. The isa was very good even running at a MHz handicap. It took IBM going to Power 5 and ramping up the Ghz and Intel not updating IA64 due to spending their resources on Core2 to finally beat it. Make no mistake Itanium was a monster even at low clock speed. It just didn't get any software to run on it's own isa except in a few instances and those for HPC or server roles. You can't compare what IA64 can do with desktop centric performance benchmarks because you aren't running any IA64 code at all. You are running a cross isa emulator. And give Intel some credit on their jit compiler because it rocked. It took a completely foreign instruction set and ran at nearly the same speed as the cpus it was designed to run on, but on a foreign cpu to the code. People complained about the speed of IA64 running office and similar x86 apps under emulation as being like an older generation x86 cpu. Try running Pear PC (a Power PC emulator) and just time the install of OS8 even today on a core i7 920 overclocked to 4GHz and tell me how bad Intel's Itanium was at x86 emulation.


    Lack of software on IA64 is what killed IA64 not the isa.

    Also it was actually Intel's intent to transition to IA64's isa for the mainstream. First was server, then workstation Xeon motherboards would take either IA64 or x86 Xeons. Then the mainstream parts would come after. AMD threw the monkey wrench in the whole Xeon Itanium/x86 transition with the Opteron/x86-64 move.
  • mgambrell - Friday, December 11, 2009 - link

    I just want to clarify something here. Apple's handful of toady developers can be pushed around, but Microsoft doesnt have that clout over their hundreds of thousands of developers. It isn't even possible. I enjoy watch them try just to kick people off XP, and you think you could get them to ditch x86? Ha.
  • cbemerine - Wednesday, December 30, 2009 - link

    "...but Microsoft doesn't have that clout over their hundreds of thousands of developers. It isn't even possible. I enjoy watch them try just to kick people off XP..."

    I do not know what planet you are living on, but they most certainly do have the clout to push every XP user off of it. While via the developers is one minor path; over the last 20+ years Microsoft has been more successful kicking people off older platform via the following methods: Hardware (Intel, Nvidia and others); Software (Corel, Novell and others); BIOS vendors: (all but Coreboot); and of course their own forced auto-updates and auto-upgrade process.

    Its total vendor lock-in and has been so since mid way through Windows 2000. The only way out is not to play...Linux, Unix or Mac OSX.

    Your delusional to ignore past abuses and facts, though you are hardly alone.

    My preferred method is to set a "7 Year Clock"; if after 7 years of actions on the part of Microsoft and those they influence, they are being a good corporate citizen and leaving FUD vendor lock-in tactics behind...based on their ACTIONS, not words...than and only than will I purchase their products. When a vendor causes problems with software/hardware I am running, I do not blame the software, but THAT VENDOR! It really is that simple.
  • yuhong - Sunday, December 6, 2009 - link

    They already did abandon their own SSE5 in favor of AVX.
  • psychobriggsy - Sunday, December 6, 2009 - link

    What about when AMD do the extensions first, and Intel does something different?

    Examples: 3DNow! and AMD's Virtualisation instructions (which were more functional than Intel's, at least early on).

    The sad thing is that it is the broken x86 architecture itself that requires special virtualisation instructions to be present.

    I say that in around 2015 32-bit compatibility x86 should be relegated to a separate 32-bit core in the CPU for all backward compatibility, and the main CPU cores should be 64-bit only, no backwards compatibility, maybe even have the ISA tweaked to account for this (64-bit instruction prefixes not required, for example).
  • darthscsi - Monday, December 7, 2009 - link

    Intel once thought as you did, and created a processor with a 64bit instruction set which was incompatible with x86. They wound up with a seperate execution unit for x86 initially but now have dropped that in favor of binary emulation in SW. But you don't run an Itanium do you? You have a processor with extensive backwards comparability. You want a cleaner ISA? Vote with your dollars. (Yes I've had several Alphas and have been sad to see that ISA die).
  • Lucky Stripes 99 - Thursday, December 17, 2009 - link

    Keep in mind that one major benefits of a CISC based instruction set is that you can theoretically achieve a greater code density than a RISC processor.

    Look at ARM as an example. You need at least one 32-bit op to fetch and one 32-bit op to work the data. Under M68K, the whole thing can be done with a single 48-bit op. More complex forms of indirect addressing may require several more ops for ARM in order to get your offset. Under M68K, the offset is just added to the single op.

    Sure, the ARM solution makes the prefetch and execution circuits much, much easier to implement. However, you end up taking a byte or two of overhead for each instruction versus the M68K. For IA32 which uses an even denser instruction set, the savings can be even greater.
  • Scali - Sunday, December 20, 2009 - link

    I don't think that's a benefit anymore. These days memory and cache are relatively cheap. It's much easier to slap a few extra MB onto a system than it is to improve its performance per instruction.
  • Shadowmaster625 - Monday, December 7, 2009 - link

    He's talking about having a dedicated x86 core to maintain backwards compatibility. This is a no-brainer. New multicore CPU's should only have one or two legacy cores, the rest should be more efficiently designed. I'm sure this will happen eventually, as soon as it becomes cheaper to design multicore CPUs is such an asymmetrical manner.

Log in

Don't have an account? Sign up now