The Windows and Multithreading Problem (A Must Read)

Unfortunately, not everything is just as straightforward as installing Windows 10 and going off on a 128 thread adventure. Most home users that have Windows typically have versions of Windows 10 Home or Windows 10 Pro, which are both fairly ubiquitous even among workstation users. The problem that these operating systems have rears its ugly head when we go above 64 threads. Now to be clear, Microsoft never expected home (or even most workstations) systems to go above this amount, and to a certain extent they are correct.

Whenever Windows experiences more than 64 threads in a system, it separates those threads into processor groups. The way this is done is very rudimentary: of the enumerated cores and threads, the first 64 go into the first group, the second 64 go into the next group, and so on. This is most easily observed by going into task manager and trying to set the affinity of a particular program:

 

With our 64 core processor, when simultaneous multithreading is enabled, we get a system with 128 threads. This is split into two groups, as shown above.

When the system is in this mode, it becomes very tricky for most software to operate properly. When a program is launched, it will be pushed into one of the processor groups based on load – if one group is busy, the program will be spawned in the other. When the program is running inside the group, unless it is processor group aware, then it can only access other threads in the same group. This means that if a multi-threaded program can use 128 threads, if it isn’t built with processor groups in mind, then it might only spawn with access to 64.

If this sounds somewhat familiar, then you may have heard of NUMA, or non-uniform memory architecture. This occurs when the CPU cores in the system might have different latencies to main memory, such as within a dual socket system: it can be quick to access the memory directly attached to its own core, but it can be a lot slower if a core needs to access memory attached to the other physical CPU. Processor groups is one way around this, to stop threads jumping from CPU to CPU. The only issue here is that despite having 128 threads on the 3990X, it’s all one CPU!

In Windows 10 Pro, this becomes a problem. We can look directly at Task Manager:

Here we see all 64 cores and 128 threads being loaded up with an artificial load. The important number here though is the socket count. The system thinks that we have two sockets, just because we have a high number of threads in the system. This is a big pain, and the source of a lot of slowdowns in some benchmarks.

(Interestingly enough, Intel’s latest Xeon Phi chips with 72 lightweight cores and 4-way HT for 288 threads show up as five sockets. How’s that for pain!)

Of course, there is a simple solution to avoid all of this – disable simultaneous multithreading. This means we still have 64 cores but now there’s only one processor group.

We still have most of the performance on the chip (and we’ll see later in the benchmarks). However, some of the performance has been lost – if I wanted 64 threads, I’d save some money and get the 32-core! There seems to be no easy way around this.

But then we remember that there are different versions of Windows 10.


From Wikipedia

Microsoft at retail sells Windows 10 Home, Windows 10 Pro, Windows 10 Pro for Workstations, and we can also find keys for Windows 10 Enterprise for sale. Each of these, aside from the usual feature limitations based on the market, also have limitations on processor counts and sockets. In the diagram above, we can see where it says Windows 10 Home is limited to 64 cores (threads), whereas Pro/Education versions go up to 128, and then Workstation/Enterprise to 256. There’s also Windows Server.

Now the thing is, Workstation and Enterprise are built with multiple processor groups in mind, whereas Pro is not. This has comes through scheduler adjustments, which aren’t immediately apparent without digging deeper into the finer elements of the design. We saw significant differences in performance.

In order to see the differences, we did the following comparisons:

  • 3990X with 64 C / 128 T (SMT On), Win10 Pro vs Win10 Ent
  • Win 10 Pro with 3990X, SMT On vs SMT Off

This isn’t just a case of the effect SMT has on overall performance – the way the scheduler and the OS works to make cores available and distribute work are big factors.

3D Particle Movement v2.1

In 3DPM, with standard non-expert code, the difference between SMT on and off is 8.6%, however moving to Enterprise brings half of it back.

3D Particle Movement v2.1 (with AVX)

When we move to hand-tuned AVX code, the extra threads can be used and per-thread gets a 2x speed increase. Here the Enterprise version again gets a small lead over the Pro.

DigiCortex 1.20 (32k Neuron, 1.8B Synapse)

DigiCortex is a more memory bound benchmark, and we see here that disabling SMT scores a massive gain as it frees up CPU-to-memory communication. Enterprise claws back half that gain while keeping SMT enabled.

Agisoft Photoscan 1.3.3, Complex Test

Photoscan is a variable threaded test, but having SMT disabled gives the better performance with each thread having more resources on tap. Again, W10 Enterprise splits the difference between SMT off and on.

NAMD 2.31 Molecular Dynamics (ApoA1)

Our biggest difference was in our new NAMD testing. Here the code is AVX2 accelerated, and the difference to watch out for is with SMT On, going from W10 Pro to W10 Ent is a massive 8.3x speed up. In regular Pro, we noticed that when spawning 128 threads, they would only sit on 16 actual cores, or less than, with the other cores not being utilized. In SMT-Off mode, we saw more of the cores being used, but the score still seemed to be around the same as a 3950X. It wasn’t until we moved to W10 Enterprise that all the threads were actually being used.

Corona 1.3 Benchmark

On the opposite end of the scale, Corona can actually take advantage of different processor groups. We see the improvement moving from SMT off to SMT On, and then another small jump moving to Enterprise.

Blender 2.79b bmw27_cpu Benchmark

Similarly in our Blender test, having processor groups was no problem, and Enterprise gets a small jump.

POV-Ray 3.7.1 Benchmark

POV-Ray benefits from having SMT disabled, regardless of OS version.

Handbrake 1.1.0 - 1080p60 HEVC 3500 kbps Fast

Whereas Handbrake (due to AVX acceleration) gets a big uplift on Windows 10 Enterprise

What’s The Verdict?

From our multithreaded test data, there can only be two conclusions. One is to disable SMT, as it seems to get performance uplifts in most benchmarks, given that most benchmarks don’t understand what processor groups are. However, if you absolutely have to have SMT enabled, then don’t use normal Windows 10 Pro: use Pro for Workstations (or Enterprise) instead. At the end of the day, this is the catch in using hardware that's skirting the line of being enterprise-grade: it also skirts the line with triggering enterprise software licensing. Thankfully, workstation software that is outright licensed per core is still almost non-existent, unlike the server realm.

Ultimately this puts us in a bit of a quandary for our CPU-to-CPU comparisons on the following pages. Normally we run our CPUs on W10 Pro with SMT enabled, but it’s clear from these benchmarks that in every multithreaded scenario, we won’t get the best result. We may have to look at how we test processors >16 cores in the future, and run them on Windows 10 Enterprise. Over the following pages, we’ll include W10 Pro and W10 Enterprise data for completeness.

Frequency, Temperature, and Power AMD 3990X Against Prosumer CPUs
Comments Locked

279 Comments

View All Comments

  • Logic28 - Monday, May 11, 2020 - link

    Link or it didn't happen.

    8180 which has only 28 cores has a list price on NewEgg right now of $11000
    vs the 4k 3990X Threadripper....

    I don't get this need to push out information that is clearly not truthful. The price of these procs need to eventually fall, right now Intel is living off the upgrade path many studies are dug in on, and so you have IT trying to justify a much worse cpu so they dont' have to do a bunch of work replacing all the machines currently getting their assets kicked by a consumer cpu, again at a fraction of the cost.
  • sharath.naik - Saturday, February 8, 2020 - link

    Agree, for a 64 core processor to be fully utilized you need more ram capacity. But we do have 64gb rams already available which means that you can go up to 512GB today. It is an unnecessary limitation.
  • antus - Sunday, February 9, 2020 - link

    It still has use for scientific workloads. Its up to the user to decide if this many cores in this configuration at this low price works for them.
    Its a pitty this article centered so much on windows limitations. Sure some people might want this many cores in a HEDT configuration but i'd like to see linux benchmarks due to it being a free OS that can handle this cpu properly and run scientific workloads. It likely would have a place in the racks of university where I work.
  • GreenReaper - Sunday, February 9, 2020 - link

    Ultimately this is a Windows shop, you need to look to Phoronix or ServeTheHome (which did both). Takeaway is the same but they do more traditional server workloads. For parallel sever tasks, it's great. Most people will want to use one of the cut-down CPUs and use the savings on for RAM/storage.
  • alysdexia - Monday, May 4, 2020 - link

    It's, whom, I'd, CPU, should
  • kardonn - Tuesday, February 11, 2020 - link

    I run a very high end VFX studio and do simulation work for big features, high end commercials, and big productions for Amazon/Netflix. I assure you, 256GB RAM is way more than I've ever needed and will easily be futureproof enough until larger UDIMMS become available one day to unlock the 512GB potential.

    All of my current workstations are 128GB of RAM and it's very rare for me to work on jobs that even approach that limit. 256GB is tons for 99% of the work people will be throwing a 3990X at.
  • alysdexia - Monday, May 4, 2020 - link

    its, hick
  • Logic28 - Monday, May 11, 2020 - link

    You guys are flat out wrong about the usefulness in vfx. I work in vfx, Blur used this chip to render Dark Fate - Terminator. And no single render is going over 128GB in more renders. You don't treat this like a standard server where you are running 4-8 frames/jobs on one machine like you would with say a 8280 with 56 cores, and enough ram to give each job 128 GB for instance.
    You instead put this on lighting artists desk, or a Houdini Physics sims, or you can use it as a server, but only pushing through 1-2 frames at a time on it.
    But here is the kicker people need to compare this to.
    This proc is literally priced at 1/7th to 1/10th the price of the Xeon, and it destroys it in rendering speed.
    So you can increase lighting artist working speed by like several orders of magnitude.

    And no you cannot find the Xeon for $4700 that is comparable. What are you guys fake bots pushing intel prop? Seriously just looked on Newegg.com you can get the 8180 which has 28 cores, for $11000. Which is like less then half the speed of the 3990x. Which is $4k. So you need 2 xeons, at $22000 and dual motherboard add another 2k extra for setup costs, etc.

    So what would you have one Xeon 8280 server with 2 process for $24k and 128GB * 6 Ram
    or
    6 full Xeon 3990x Threadrippers servers each with 128-258GB of ram

    Option 2 gives you literally 7-8 times the rendering power for the same price? I mean, seriously.
    No use, you have no idea about hardware if you think that a machine that is destroying a server 3 times the price.

    Yea it has a place, under my bloody desk, or terradici'd from my closest.

    Again, Blur did brilliant work on Dark Fate, a heavy CG movie, no problem with a server room full of these babies.

    And that is not even talking the fact that the upgrade path for the x3990 has much more potential with a x3999 future, vs the Xeon which is basically on a beast of a die that consumes twice the power consumption for less rendering speed.

    Seriously. Even Premiere benchmarks fall to this and the Ryzen 3950X beast as well vs inteal.

    It is amazing how people just refuse to admit AMD is winning...
  • Santoval - Sunday, February 9, 2020 - link

    It depends on how you define "enthusiasts". If you mean enthusiast *creators* who need a workstation for their work then sure, that's the CPU for them. Video editors, photographers, graphics designers, industrial designers, game designers ... these kinds of creators. It's not just for playing games or merely running benchmarks though. Even for a professional musician it might be overkill.
  • WaltC - Friday, February 7, 2020 - link

    I found this article a bit baffling, frankly. I did not understand the "out of chaos" titling at all...;) But anyway--it should be obvious what AMD is doing here--people running desktops for gaming running Win10 home or Pro are *not* the people the CPU is aimed at--the CPU is aimed at Prosumers who would rather not spend $20k for Intel's inferior solutions but would rather spend $4k for a faster cpu solution and save a cool $16k in the bargain and come out with something appreciably faster. Yes, people are going to run this with Enterprise--duh...;) You aren't going to spend money on a 128t cpu and then run it with a 64t OS--don't even know why Win10 and Win10 Pro were mentioned at all--other than to state they shouldn't be used with the CPU--which would take but a single sentence. Then there the handful of benchmarks used here--how many threads do each of these benchmarks support at maximum? Article didn't say--so that was sort of a strike out, etc. I think Anandtech needs to come back and do this review properly--as it stands, this one makes it seem like the only "chaos" involved is the obvious confusion in the minds of the AT reviewers....;) (No offense) Simply put: if Intel couldn't sell $20k cpu systems Intel wouldn't make them--so obviously, there's a market for 128t cpus--again, duh. You can do much better than Intel at a fraction of the cost--and there's your market! No chaos at all. Also: this CPU is very new--there remain the usual AGESA bios improvements that need to be made in the upcoming months, etc. That fact should have garnered at least a sentence, don't you think? In the past I've seen much better reviews than this--especially for the world's first and only 128t single CPU!

Log in

Don't have an account? Sign up now