The 64 Core Threadripper 3990X CPU Review: In The Midst Of Chaos, AMD Seeks Opportunity
by Dr. Ian Cutress & Gavin Bonshor on February 7, 2020 9:00 AM ESTThe Windows and Multithreading Problem (A Must Read)
Unfortunately, not everything is just as straightforward as installing Windows 10 and going off on a 128 thread adventure. Most home users that have Windows typically have versions of Windows 10 Home or Windows 10 Pro, which are both fairly ubiquitous even among workstation users. The problem that these operating systems have rears its ugly head when we go above 64 threads. Now to be clear, Microsoft never expected home (or even most workstations) systems to go above this amount, and to a certain extent they are correct.
Whenever Windows experiences more than 64 threads in a system, it separates those threads into processor groups. The way this is done is very rudimentary: of the enumerated cores and threads, the first 64 go into the first group, the second 64 go into the next group, and so on. This is most easily observed by going into task manager and trying to set the affinity of a particular program:
With our 64 core processor, when simultaneous multithreading is enabled, we get a system with 128 threads. This is split into two groups, as shown above.
When the system is in this mode, it becomes very tricky for most software to operate properly. When a program is launched, it will be pushed into one of the processor groups based on load – if one group is busy, the program will be spawned in the other. When the program is running inside the group, unless it is processor group aware, then it can only access other threads in the same group. This means that if a multi-threaded program can use 128 threads, if it isn’t built with processor groups in mind, then it might only spawn with access to 64.
If this sounds somewhat familiar, then you may have heard of NUMA, or non-uniform memory architecture. This occurs when the CPU cores in the system might have different latencies to main memory, such as within a dual socket system: it can be quick to access the memory directly attached to its own core, but it can be a lot slower if a core needs to access memory attached to the other physical CPU. Processor groups is one way around this, to stop threads jumping from CPU to CPU. The only issue here is that despite having 128 threads on the 3990X, it’s all one CPU!
In Windows 10 Pro, this becomes a problem. We can look directly at Task Manager:
Here we see all 64 cores and 128 threads being loaded up with an artificial load. The important number here though is the socket count. The system thinks that we have two sockets, just because we have a high number of threads in the system. This is a big pain, and the source of a lot of slowdowns in some benchmarks.
(Interestingly enough, Intel’s latest Xeon Phi chips with 72 lightweight cores and 4-way HT for 288 threads show up as five sockets. How’s that for pain!)
Of course, there is a simple solution to avoid all of this – disable simultaneous multithreading. This means we still have 64 cores but now there’s only one processor group.
We still have most of the performance on the chip (and we’ll see later in the benchmarks). However, some of the performance has been lost – if I wanted 64 threads, I’d save some money and get the 32-core! There seems to be no easy way around this.
But then we remember that there are different versions of Windows 10.
From Wikipedia
Microsoft at retail sells Windows 10 Home, Windows 10 Pro, Windows 10 Pro for Workstations, and we can also find keys for Windows 10 Enterprise for sale. Each of these, aside from the usual feature limitations based on the market, also have limitations on processor counts and sockets. In the diagram above, we can see where it says Windows 10 Home is limited to 64 cores (threads), whereas Pro/Education versions go up to 128, and then Workstation/Enterprise to 256. There’s also Windows Server.
Now the thing is, Workstation and Enterprise are built with multiple processor groups in mind, whereas Pro is not. This has comes through scheduler adjustments, which aren’t immediately apparent without digging deeper into the finer elements of the design. We saw significant differences in performance.
In order to see the differences, we did the following comparisons:
- 3990X with 64 C / 128 T (SMT On), Win10 Pro vs Win10 Ent
- Win 10 Pro with 3990X, SMT On vs SMT Off
This isn’t just a case of the effect SMT has on overall performance – the way the scheduler and the OS works to make cores available and distribute work are big factors.
In 3DPM, with standard non-expert code, the difference between SMT on and off is 8.6%, however moving to Enterprise brings half of it back.
When we move to hand-tuned AVX code, the extra threads can be used and per-thread gets a 2x speed increase. Here the Enterprise version again gets a small lead over the Pro.
DigiCortex is a more memory bound benchmark, and we see here that disabling SMT scores a massive gain as it frees up CPU-to-memory communication. Enterprise claws back half that gain while keeping SMT enabled.
Photoscan is a variable threaded test, but having SMT disabled gives the better performance with each thread having more resources on tap. Again, W10 Enterprise splits the difference between SMT off and on.
Our biggest difference was in our new NAMD testing. Here the code is AVX2 accelerated, and the difference to watch out for is with SMT On, going from W10 Pro to W10 Ent is a massive 8.3x speed up. In regular Pro, we noticed that when spawning 128 threads, they would only sit on 16 actual cores, or less than, with the other cores not being utilized. In SMT-Off mode, we saw more of the cores being used, but the score still seemed to be around the same as a 3950X. It wasn’t until we moved to W10 Enterprise that all the threads were actually being used.
On the opposite end of the scale, Corona can actually take advantage of different processor groups. We see the improvement moving from SMT off to SMT On, and then another small jump moving to Enterprise.
Similarly in our Blender test, having processor groups was no problem, and Enterprise gets a small jump.
POV-Ray benefits from having SMT disabled, regardless of OS version.
Whereas Handbrake (due to AVX acceleration) gets a big uplift on Windows 10 Enterprise
What’s The Verdict?
From our multithreaded test data, there can only be two conclusions. One is to disable SMT, as it seems to get performance uplifts in most benchmarks, given that most benchmarks don’t understand what processor groups are. However, if you absolutely have to have SMT enabled, then don’t use normal Windows 10 Pro: use Pro for Workstations (or Enterprise) instead. At the end of the day, this is the catch in using hardware that's skirting the line of being enterprise-grade: it also skirts the line with triggering enterprise software licensing. Thankfully, workstation software that is outright licensed per core is still almost non-existent, unlike the server realm.
Ultimately this puts us in a bit of a quandary for our CPU-to-CPU comparisons on the following pages. Normally we run our CPUs on W10 Pro with SMT enabled, but it’s clear from these benchmarks that in every multithreaded scenario, we won’t get the best result. We may have to look at how we test processors >16 cores in the future, and run them on Windows 10 Enterprise. Over the following pages, we’ll include W10 Pro and W10 Enterprise data for completeness.
279 Comments
View All Comments
GreenReaper - Saturday, February 8, 2020 - link
64 sockets, 64 cores, 64 threads per CPU - x64 was never intended to surmount these limits. Heck, affinity groups were only introduced in Windows XP and Server 2003.Unfortunately they hardcoded the 64-CPU limit in by using a DWORD and had to add Processor Groups as a hack added in Win7/2008 R2 for the sake of a stable kernel API.
Linux's sched_setaffinity() had the foresight to use a length parameter and a pointer: https://www.linuxjournal.com/article/6799
I compile my kernels to support a specific number of CPUs, as there are costs to supporting more, albeit relatively small ones (it assumes that you might hot-add them).
Gonemad - Friday, February 7, 2020 - link
Seeing a $4k processor clubbing a $20k processor to death and take its lunch (in more than one metric) is priceless.If you know what you need, you can save 15 to 16 grand building an AMD machine, and that's incredible.
It shows how greedy and lazy Intel has become.
It may not be the best chip for, say, a gaming machine, but it can beat a 20-grand intel setup, and that ensures a spot for the chip, not being useless.
Khenglish - Friday, February 7, 2020 - link
I doubt that really anyone would practically want to do this, but in Windows 10 if you disable the GPU driver, games and benchmarks will be fully CPU software rendered. I'm curious how this 64 core beast performs as a GPU!Hulk - Friday, February 7, 2020 - link
Not very well. Modern GPU's have thousands of specialized processors.Kevin G - Friday, February 7, 2020 - link
The shaders themselves are remarkably programmable. The only thing really missing from them and more traditional CPU's in terms of capability is how they handle interrupts for IO. Otherwise they'd be functionally complete. Granted the per-thread performance would be abyssal compared to modern CPUs which are fully pipelined, OoO monsters. One other difference is that since GPU tasks are embarrassing parallel by nature, these shaders have hardware thread management to quickly switch between them and partition resources to achieve some fairly high utilization rates.The real specialization are in in the fixed function units for their TMUs and ROPs.
willis936 - Friday, February 7, 2020 - link
Will they really? I don’t think graphics APIs fall back on software rendering for most essential features.hansmuff - Friday, February 7, 2020 - link
That is incorrect. Software rendering is never done by Windows just because you don't have rendering hardware. Games no longer come with software renderers like they used to many, many moons ago.Khenglish - Friday, February 7, 2020 - link
I love how everyone had to jump in and said I was wrong without spending 30 seconds to disable their GPU driver and try it themselves and finding they are wrong.There's a lot of issues with the Win10 software renderer (full screen mode mostly broken, only DX11 seems supported), but it does work. My Ivy Bridge gets fully loaded at 70W+ just to pull off 7 fps at 640x480 in Unigine Heaven, but this is something you can do.
extide - Friday, February 7, 2020 - link
No -- the Windows UI will drop back to software mode but games have not included software renderers for ~two decades.FunBunny2 - Friday, February 7, 2020 - link
" games have not included software renderers for ~two decades."which is a deja vu experience: in the beginning DOS was a nice, benign, control program. then Lotus discovered that the only way to run 1-2-3 faster than molasses uphill in winter was to fiddle the hardware directly, which DOS was happy to let it do. it didn't take long for the evil folks to discover that they could too, and virus was born. one has to wonder how much exposure these latest GPU hardware present?