One of the main features Intel was promoting at the launch of Haswell was TSX – Transactional Synchronization eXtensions. In our analysis, Johan explains that TSX enables the CPU to process a series of traditionally locked instructions on a dataset in a multithreaded environment without locks, allowing each core to potentially violate each other’s shared data. If the series of instructions is computed without this violation, the code passes through at a quicker rate – if an invalid overwrite happens, the code is aborted and takes the locked route instead. All a developer has to do is link in a TSX library and mark the start and end parts of the code.

News coming from Intel’s briefings in Portland last week boil down to an erratum found with the TSX instructions. Tech Report and David Kanter of Real World Technologies are stating that a software developer outside of Intel discovered the erratum through testing, and subsequently Intel has confirmed its existence. While errata are not new (Intel’s E3-1200 v3 Xeon CPUs already have 140 of them), what is interesting is Intel’s response: to push through new microcode to disable TSX entirely. Normally a microcode update would suggest a workaround, but it would seem that this a fundamental silicon issue that cannot be designed around, or intercepted at an OS or firmware/BIOS level.

Intel has had numerous issues similar to this in the past, such as the FDIV bug, the f00f bug and more recently, the P67 B2 SATA issues. In each case, the bug was resolved by a new silicon stepping, with certain issues (like FDIV) requiring a recall, similar to recent issues in the car industry. This time there are no recalls, the feature just gets disabled via a microcode update.

The main focus of TSX is in server applications rather than consumer systems. It was introduced primarily to aid database management and other tools more akin to a server environment, which is reflected in the fact that enthusiast-level consumer CPUs have it disabled (except Devil’s Canyon). Now it will come across as disabled for everyone, including the workstation and server platforms. Intel is indicating that programmers who are working on TSX enabled code can still develop in the environment as they are committed to the technology in the long run.

Overall, this issue affects all of the Haswell processors currently in the market, the upcoming Haswell-E processors and the early Broadwell-Y processors under the Core M branding, which are currently in production. This issue has been found too late in the day to be introduced to these platforms, although we might imagine that the next stepping all around will have a suitable fix. Intel states that its internal designs have already addressed the issue.

Intel is recommending that Xeon users that require TSX enabled code to improve performance should wait until the release of Haswell-EX. This tells us two things about the state of Haswell: for most of the upcoming LGA2011-3 Haswell CPUs, the launch stepping might be the last, and the Haswell-EX CPUs are still being worked on. That being said, if the Haswell-E/EP stepping at launch is not the last one, Intel might not promote the fact – having the fix for TSX could be a selling point for Broadwell-E/EP down the line.

For those that absolutely need TSX, it is being said that TSX can be re-enabled through the BIOS/firmware menu should the motherboard manufacturer decide to expose it to the user. Reading though Intel’s official errata document, we can confirm this:

We are currently asking Intel what the required set of circumstances are to recreate the issue, but the erratum states ‘a complex set of internal timing conditions and system events … may result in unpredictable system behaviour’. There is no word if this means an unrecoverable system state or memory issue, but any issue would not be in the interests of the buyers of Intel’s CPUs who might need it: banks, server farms, governments and scientific institutions.

At the current time there is no road map for when the fix will be in place, and no public date for the Haswell-EX CPU launch.  It might not make sense for Intel to re-release the desktop Haswell-E/EP CPUs, and in order to distinguish them it might be better to give them all new CPU names.  However the issue should certainly be fixed with Haswell-EX and desktop Broadwell onwards, given that Intel confirms they have addressed the issue internally.

Source: Twitter, Tech Report

 

POST A COMMENT

62 Comments

View All Comments

  • nbtech - Thursday, August 14, 2014 - link

    I had a similar reaction when I first read the comment.
    The tools for functional verification have improved over the past 10 years (emergence of OVM/UVM), so we'd expect to see a decrease in the occurrence of this type of issue, but the fact is that reaching 100% coverage is difficult given limited time and compute resources, and highly dependent upon writing good testbenches. Its not an easy thing to do.
    Reply
  • aamartin - Thursday, August 14, 2014 - link

    I agree with nbtech. I would just like to add that debugging and fixing a bug found in silicon is reeeally hard. Narrowing down the sequence of events to make the failure repeatable is an art. Remember, a 3GHz CPU is launching instructions roughly at the rate of 3 billion per second (not even counting multi-core and multiple issue). Software-based simulators and even hardware-based emulators run orders of magnitude slower-- if you can't cause the failure in a couple of seconds, you have to debug on the silicon itself, which has limited visibility of the internal state. Reply
  • yuhong - Tuesday, August 12, 2014 - link

    Personally, I really hope there will be a new stepping of Haswell CPUs with the TSX errata fixed. Reply
  • Gigaplex - Wednesday, August 13, 2014 - link

    Haswell? Unlikely. Broadwell isn't far away. Reply
  • psyq321 - Wednesday, August 13, 2014 - link

    This feature is predominantly intended for the server market.

    It is highly likely that Haswell EP 2S will get a new stepping (C2 for high core versions).
    I suppose 4S models will ship with fixed stepping.

    The question is, will Intel also update the 1S Haswells which are pretty much identical to the desktop versions with enabled support for ECC.

    Destkop/Mobile Haswells, I do not think they'd bother, but if they update 1S server SKUs, I see no reason for Intel not to silently roll out updated steppings for desktop SKUs as well, since it is the same silicon as 1S server SKUs.
    Reply
  • TerdFerguson - Tuesday, August 12, 2014 - link

    Intel's response here is atrocious. TSX was a selling point for the chip, so they need to make good on it or offer refunds. Reply
  • Gondalf - Wednesday, August 13, 2014 - link

    Really??? Do you utilize TSX??? come on! Reply
  • r3loaded - Wednesday, August 13, 2014 - link

    If you develop or use software that actually utilises TSX, you're probably in the market for the big iron Xeon EX processors. In that case, it's not an issue as Haswell-EX will have working TSX. Reply
  • barleyguy - Wednesday, August 13, 2014 - link

    TSX is essentially a convenience feature, to allow lock free code without as much work from a developer. The same thing can be accomplished by rewriting the code using compare and set instructions instead of blocking locks. So much like many new features, TSX doesn't enable any magic that couldn't be accomplished before, it just saves time on the development end.

    From that perspective, I'm betting nobody will scream about it being missing, especially since the slightest inclination that it's unreliable would keep people from using it anyway.

    When it's done and reliable, then release it. In the meantime, possibly allow an "at your own risk" feature toggle for people that live on the bleeding edge.

    $.02
    Reply
  • Senti - Wednesday, August 13, 2014 - link

    Let's take some parallels: "hardware AES is just convenience feature feature – it just saves developers the work to implement it in software". Sure! Reply

Log in

Don't have an account? Sign up now