Execution, Load/Store, INT and FP Scheduling

The execution of micro-ops get filters into the Integer (INT) and Floating Point (FP) parts of the core, which each have different pipes and execution ports. First up is the Integer pipe which affords a 168-entry register file which forwards into four arithmetic logic units and two address generation units. This allows the core to schedule six micro-ops/cycle, and each execution port has its own 14-entry schedule queue.

The INT unit can work on two branches per cycle, but it should be noted that not all the ALUs are equal. Only two ALUs are capable of branches, one of the ALUs can perform IMUL operations (signed multiply), and only one can do CRC operations. There are other limitations as well, but broadly we are told that the ALUs are symmetric except for a few focused operations. Exactly what operations will be disclosed closer to the launch date.

The INT pipe will keep track of branching instructions with differential checkpoints, to cut down on storing redundant data between branches (saves queue entries and power), but can also perform Move Elimination. This is where a simple mov command between two registers occurs – instead of inflicting a high energy loop around the core to physically move the single instruction, the core adjusts the pointers to the registers instead and essentially applies a new mapping table, which is a lower power operation.

Both INT and FP units have direct access to the retire queue, which is 192-entry and can retire 8 instructions per cycle. In some previous x86 CPU designs, the retire unit was a limiting factor for extracting peak performance, and so having it retire quicker than dispatch should keep the queue relatively empty and not near the limit.

The Load/Store Units are accessible from both AGUs simultaneously, and will support 72 out-of-order loads. Overall, as mentioned before, the core can perform two 16B loads (2x128-bit) and one 16B store per cycle, with the latter relying on a 44-entry Store queue. The TLB buffer for the L2 cache for already decoded addresses is two level here, with the L1 TLB supporting 64-entry at all page sizes and the L2 TLB going for 1.5K-entry with no 1G pages. The TLB and data pipes are split in this design, which relies on tags to determine if the data is in the cache or to start the data prefetch earlier in the pipeline.

The data cache here also has direct access to the main L2 cache at 32 Bytes/cycle, with the 512 KB 8-way L2 cache being private to the core and inclusive. When data resides back in L1 it can be processed back to either the INT or the FP pipes as required.

Moving onto the floating point part of the core, and the first thing to notice is that there are two scheduling queues here. These are listed as ‘schedulable’ and ‘non-schedulable’ queues with lower power operation when certain micro-ops are in play, but also allows the backup queue to sort out parts of the dispatch in advance via the LDCVT. The register file is 160 entry, with direct FP to INT transfers as required, as well as supporting accelerated recovery on flushes (when data is written to a cache further back in the hierarchy to make room).

The FP Unit uses four pipes rather than three on Excavator, and we are told that the latency in Zen is reduced as well for operations (though more information on this will come at a later date). We have two MUL and two ADD in the FP unit, capable of joining to form two 128-bit FMACs, but not one 256-bit AVX. In order to do AVX, the unit will split the operations accordingly. On the counter side each core will have 2 AES units for cryptography as well as decode support for SSE, AVX1/2, SHA and legacy mmx/x87 compliant code.

Fetch and Decode The Core Complex, Caches, and Fabric
Comments Locked

106 Comments

View All Comments

  • Krysto - Wednesday, August 24, 2016 - link

    I think PCs in general run better on four cores than on two, even if most apps themselves can't take advantage of them, although I think in the next 5 years most new games will take advantage of 8 threads. But otherwise, it's just good for multitasking.
  • tarqsharq - Wednesday, August 24, 2016 - link

    I had an argument with one fellow on the internet regarding i7 being plenty for whatever I was doing in terms of core count. But streaming a show on one monitor while playing Overwatch was hitting 70%+ CPU usage, with all logical cores being 60-70% utilized consistently, with spikes up to 90%+.

    That was on my i7-4770K to be specific, running 1080P on a 144hz monitor for Overwatch, and Crunchyroll for 1080P anime stream on the second monitor.

    So some games combined with slight multitasking is already taxing the 4C/8T environment.
  • galta - Wednesday, August 24, 2016 - link

    And how much multitasking are we really using? If I had to guess, I would say not much, on average.
    You might have some folks here and there using it, but regular users need something between two and four cores, just as you said.
    You have the OS, the software you're using, be it a game or not, plus everything that's running behind the scenes, including Windows ineficiencies, and that's it. But for some weird guy that spends his day on 7zip, more than 4 cores brings no extra power.
    This is the reason why, no matter how excited we might get with 10 cores (I would love one, even if for bragging rights only), our i5s are enough for what we do.
    Maybe in 5 years from now games will be multithreaded, but I'm not holding my breath: something similar was said 5 years ago, and here we are.
    At the end of the day, we still need improvement in per core performance.
  • looncraz - Wednesday, August 24, 2016 - link

    Browsers are becoming better and better at using more cores... and we're all running tens of processes in the background, some of which fire interrupts on a CPU. More cores allows for more going on at the same time without interruptions. You can actually feel this moving to an eight-core FX-8350 from a quad core i5... those eight cores provide a somewhat smoother multi-tasking environment, despite each core being slower and the overall performance being lower.

    Humans are simply sensitive to changes in timing - more cores and more threads reduces the variability in timing, which improves perceived performance.
  • galta - Thursday, August 25, 2016 - link

    Hum....
    I don't know many people who share your opinion about FX-8350 vs i5.
    Anyway, we have been multitasking for a while, a least to some extent: OS, Word, anti-virus, browser. The question is: for this light multitasking, are we better off with several cores with poor performance/core, or with less cores but with great performance/core.
    Reviews and actual people generally prefer the later.
    As of browsers, great news that they are improving, but download/upload speed is by far the most important factor in users experience.
  • Alexvrb - Sunday, August 28, 2016 - link

    Download speed is fine for web browsing if you've got something faster than DSL. How much data exactly do you think you're consuming while browsing the web? Outside of streaming videos you won't use up a ton of bandwidth.
  • Cooe - Thursday, May 6, 2021 - link

    I know this is ANCIENT, but how the hell did you not realize that multi-core optimization was so bad only because nobody could afford greater than >4 core CPU's pre-Zen??? Modern games run freaking TERRIBLE now on 4c/4t i5's.
  • Notmyusualid - Wednesday, August 24, 2016 - link

    No, nope, nej, and nein.

    I see (FEEL) tangible improvements in my computing ever since dropped 2 cores for 4.

    And it looks like others below agree....
  • galta - Thursday, August 25, 2016 - link

    I believe you do, for the sweet spot is now around 4 cores, as I said before.
    The question is: do you believe that your experience will improve significantly if you mo to 6 or 8 cores?
    Probably not, unless you spend your day zipping files or rendering images.
  • Alexvrb - Sunday, August 28, 2016 - link

    They said the same thing about quad cores, and dual cores before that. AMD has to get on top of the curve, not behind it. They'll offer quad cores for more mainstream systems, and 8 for performance rigs. More for servers, and potentially less for low-power and/or low-cost.

Log in

Don't have an account? Sign up now