It Does Multiple Threads Though: The Case for SMT

Despite being 2-issue, it's not always easy to execute two instructions from a single thread in parallel due to data dependencies between the two. Intel's solution to this problem was to enable SMT (Simultaneous Multi-Threading) on Atom (not all models unfortunately) to allow the concurrent execution of up to two threads. Welcome the return of Hyper Threading.

Remember the rule of thumb for power/performance tradeoffs? Intel's decision to enable SMT on Atom was the perfect example of just that. SMT increased power consumption by less than 20% on Atom, however it also yielded a 30 - 50% increase in performance on the in-order core. The decision couldn't be easier.

The Atom has a 32-entry instruction scheduling queue, but when running with SMT enabled each thread has its own 16-entry queue. The scheduler doesn't have to switch between threads each clock, it can do so intelligently, the only limitation is that it can only dispatch 2 ops per clock (since it is a 2-wide machine). If one thread is waiting on data to complete an instruction, on the next clock tick the scheduler can choose to dispatch an op from a separate thread that will hopefully be able to execute.

Making Atom multithreaded made perfect sense from a logical standpoint. The downside to an in-order core is that if there is an instruction that is waiting on data to begin execution the rest of the pipeline stalls while that dependency is resolved. The chances that you'll have two independent instructions from two independent threads both with misses in cache is highly unlikely.

Execution Units

Atom isn't a superwide processor, with an in-order front end and no on-die memory controller it's unlikely that we'll see tremendous instruction throughput. Data dependencies would do a good job of ensuring that tons of execution units remain idle, so Atom's designers did their best to only include the bare minimum when it came to execution units.

There's no dedicated integer multiplier or divider, these functions are shared with the SIMD FP units. There are two SSE units and the scheduler can dispatch either a float or an integer SIMD op to both ports in a given clock.

All of the functional units are 64-bits wide with the exception of supporting full width SIMD integer and single precision FP ADDs.

Return of the CISC: Macro-op Execution Fighting Power Consumption...with a Longer Pipeline?
Comments Locked

46 Comments

View All Comments

Log in

Don't have an account? Sign up now