Users may have been following Wendell from Level1Tech’s battle with researching the reasons behind why some benchmarks have regressed performance on quad-die Threadripper 2 compared to dual-die configurations. Through his research, he found that this problem was limited to Windows, as cross-platform software on Linux did not have this issue, and that the problem was not limited just to Threadripper 2, but quad-die EPYCs were also affected.

At the time, most journalists and analysts noted that the performance was lower, and that the Linux/Windows differences existed, but pointed the finger at the reduced memory performance of the large Threadripper 2 CPUs. At the time, Wendell discovered that removing CPU 0 from the thread pool, after the program starts running, it actually regained all of the performance loss on Windows.

After some discussions about what the issue was exactly, I helped Wendell with some additional testing, by running our CPU suite through an affinity mask at runtime to remove CPU 0 from the options at runtime. The results were negative, suggesting that the key to CPU 0 was actually changing it at run time.

After this, Wendell did his testing on an EPYC 7551 processor, one of the big four-die parts, and confirmed this was not limited to just Threadripper – the problem wasn’t memory, it was almost certainly the Windows Scheduler.

'Best NUMA Node' and Windows Hotfix for 2-NUMA

The conclusion was made that in a NUMA environment, Windows’ scheduler actually assigns a ‘best NUMA node’ for each bit of software and the scheduler is programmed to move those threads to that node as often as possible, and will actually kick out threads that also have the same ‘best NUMA node’ settings with abandon. When running a single binary that spawns 32/64 threads, every thread from that binary is assigned the same ‘best NUMA node’, and these threads will continually be pushed onto that node, kicking out threads that already want to be there. This leads to core contention, and a fully multi-threaded program could spend half of its time shuffling around threads to comply with this ‘best NUMA node’ situation.

The point of this ‘best NUMA node’ environment was originally meant to be for running VMs, such that each VM would run in its own runtime and be assigned different ‘best NUMA nodes’ depending on what else was currently on the system.

One would expect this issue to come up in any NUMA environment, such as dual processors or dual-die AMD processors. It turns out that Microsoft has a hotfix in place in Windows for dual-NUMA environments that disables this ‘best NUMA node’ situation. Ultimately at some point there were enough dual-socket workstation platforms on the market that this made sense, pushing the ‘best NUMA node’ implementation down the road to 3+ NUMA environments. This is why we see it in quad-die Threadripper and EPYC, and not dual-die Threadripper.

Wendell has been working with Jeremy from BitSum, creator of the CorePrio software, in developing a way of soft-fixing this issue. The CorePrio software now has an option called ‘NUMA Disassociator’ which probes which software is active every few seconds and adjusts the thread affinity while the software is running (rather than running an affinity mask which has no affect).

This is a good temporary solution for sure, however it needs to be fixed in the Windows scheduler.

AMD Comments On The Findings

There have been questions about how much AMD/Microsoft know about this issue, who they are in contact with, and what is being done. AMD was happy to make some comments on the record.

AMD stated that they have support and update tickets open with Microsoft’s Windows team on the issue. They believe they know what the issue is, and commends Wendell for being very close to what the actual issue is (they declined to go into detail). They are currently comparing notes with Bitsum, and actually helped Bitsum to develop the original tool for affinity masking, however the ‘NUMA Disassociator’ is obviously new.

The timeline for a fix will depend on a number of factors between AMD and Microsoft, however there will be announcements when the fix is ready and what exactly that fix will affect performance. Other improvements to help optimize performance will also be included. AMD is still very pleased with the Threadripper 2 performance, and is keen to stress that for the most popular performance related tests the company points to reviews that show that the performance in rendering is still well above the competition, and is working with software vendors to push that performance even further.

Relevant Links:

POST A COMMENT

41 Comments

View All Comments

  • colonelclaw - Monday, January 14, 2019 - link

    FWIW I've tested CorePrio with 3DS Max 2019 and V-Ray 4.1, and I can't see a meaningful difference, so it looks like V-Ray is unaffected. Furthermore the 32-core TR scales almost exactly as expected from the 16-core TR (taking into account frequency differences).
    Conclusion: If you render with V-Ray and TR everything's good.
    Reply
  • IGTrading - Monday, January 14, 2019 - link

    Problem is that Microsoft did this on purpose!

    Please, don't go crazy about anti-AMD conspiracies, because this is not the case at all (while there are clearly some other cases where reasonable doubt is more than resonable :) but not here , not today , please )

    What I mean is this : every project has a PM.

    When Intel told Microsoft they want a patch to fix/improve Xeon 18-core, Microsoft was probably happy to oblige, but when that development was ready, testing is ABSOLUTELY MANDATORY @Microsoft.

    Of course Microsoft WHQL will tests the patch in most x86 platforms, including AMD EPYC and AMD ThreadRipper.

    My question is: how the hell did the PM of that project decide that the side effects that the new Microsoft patch for Intel Xeon is causing are not only acceptable, but several higher-ups on the the approval chain decided to move this into production and release the patch ?!

    How the heck did they decide to force the update on AMD machines as well, clearly knowing the side effects it has on them ?!

    I don't believe this is an anti-AMD Microsoft+Intel coalition at all.

    What I DO suspect might be possible is that some guy/guys from Microsoft which were involved with this patch, received some nice team meetings in beautiful expensive settings or other sort of benefits from their Intel counterparts ... to quicken the release of the new patch and overlook the AMD side effects & any decent minimal mitigation measures.

    Otherwise, moving such a patch into production is absolutely impossible to explain in a company like Microsoft for their most important core product, Windows.
    Reply
  • HStewart - Monday, January 14, 2019 - link

    This is ridiculous to think Microsoft did this on purpose, after the Xbox uses the AMD APU's in Xbox One - unless you think Microsoft switching process in future generations.

    Also is possible that there is a problem with design of AMD Zen processors that is causing this issue - what is so different with there processors that will make it be a problem?
    Reply
  • IGTrading - Monday, January 14, 2019 - link

    @HStewart please mate, don't go there. :) This is not "Microsoft" the company, but somebody @Microsoft made the decision to go forward with this crap patch for Intel Xeon.

    It has nothing to do with AMD ThreadRipper architecture.

    It has to do with ignoring the internal quality testing Microsoft has surely done and pushing out a patch which trashes Microsoft clients with AMD ThreadRipper-based and EPYC-based systems.
    Reply
  • HStewart - Monday, January 14, 2019 - link

    People love to blame Microsoft for chip problems - I would expect if Microsoft has dependencies on AMD that they work with AMD to write specific drivers to support - so why did Threadripper or previous EPYC systems not have this problem - I guess it simpler just to blame Microsoft for bad design in TR 2 Reply
  • rahvin - Monday, January 14, 2019 - link

    The Scheduler is not a driver. Reply
  • tamalero - Tuesday, January 15, 2019 - link

    1) this is a windows scheduler issue, not a driver issue
    2) this does NOT happen on linux.
    Reply
  • lobz - Tuesday, January 15, 2019 - link

    @HStewart man you just love taking sucker punches, don't you? Chip problems? Bad design? Man... you're killing it today :) Reply
  • npz - Tuesday, January 15, 2019 - link

    EPYC and TR *DOES* have this problem from the very beginning. You should read the article and watch the video. It's not too problematic for EPYC because that market segment by far uses Linux. Reply
  • boeush - Monday, January 14, 2019 - link

    If it was a Zen design issue, it wouldn't affect exclusively Windows. Since Linux is unaffected, this is clearly Microsoft's problem, not AMD's. Reply

Log in

Don't have an account? Sign up now