The Silvermont Module and Caches

Like AMD’s Bobcat and Jaguar designs, Silvermont is modular. The default Silvermont building block is a two-core/two-thread design. Each core is equally capable and there’s no shared execution hardware. Silvermont supports up to 8-core configurations by placing multiple modules in an SoC.

 

Each module features a shared 1MB L2 cache, a 2x increase over the core:cache ratio of existing Atom based processors. Despite the larger L2, access latency is reduced by 2 clocks. The default module size gives you clear indication as to where Intel saw Silvermont being most useful. At the time of its inception, I doubt Intel anticipated such a quick shift to quad-core smartphones otherwise it might’ve considered a larger default module size.

L1 cache sizes/latencies haven’t changed. Each Silvermont core features a 32KB L1 data cache and 24KB L1 instruction cache.

Silvermont Supports Independent Core Frequencies: Vindication for Qualcomm?

In all Intel Core based microprocessors, all cores are tied to the same frequency - those that aren’t in use are simply shut off (power gated) to save power. Qualcomm’s multi-core architecture has always supported independent frequency planes for all CPUs in the SoC, something that Intel has always insisted was a bad idea. In a strange turn of events, Intel joins Qualcomm in offering the ability to run each core in a Silvermont module at its own independent frequency. You could have one Silvermont core running at 2.4GHz and another one running at 1.2GHz. Unlike Qualcomm’s implementation, Silvermont’s independent frequency planes are optional. In a split frequency case, the shared L2 cache always runs at the higher of the two frequencies. Intel believes the flexibility might be useful in some low cost Silvermont implementations where the OS actively uses core pinning to keep threads parked on specific cores. I doubt we’ll see this on most tablet or smartphone implementations of the design.

From FSB to IDI

Atom and all of its derivatives have a nasty secret: they never really got any latency benefits from integrating a memory controller on die. The first implementation of Atom was a 3-chip solution, with the memory controller contained within the North Bridge. The CPU talked to the North Bridge via a low power Front Side Bus implementation. This setup should sound familiar to anyone who remembers Intel architectures from the late 90s up to the mid 2000s. In pursuit of integration, Intel eventually brought the memory controller and graphics onto a single die. Historically, bringing the memory controller onto the same die as the CPU came with a nice reduction in access latency - unfortunately Atom never enjoyed this. The reasoning? Atom never ditched the FSB interface.

Even though Atom integrated a memory controller, the design logically looked like it did before. Integration only saved Intel space and power, it never granted it any performance. I suspect Intel did this to keep costs down. I noticed the problem years ago but completely forgot about it since it’s been so long. Thankfully, with Silvermont the FSB interface is completely gone.

Silvermont instead integrates the same in-die interconnect (IDI) that is used in the big Core based processors. Intel’s IDI is a lightweight point to point interface that’s far lower overhead than the old FSB architecture. The move to IDI and the changes to the system fabric are enough to improve single threaded performance by low double digits. The gains are even bigger in heavily threaded scenarios.

Another benefit of moving away from a very old FSB to IDI is increased flexibility in how Silvermont can clock up/down. Previously there were fixed FSB:CPU ratios that had to be maintained at all times, which meant the FSB had to be lowered significantly when the CPU was running at very low frequencies. In Silvermont, the IDI and CPU frequencies are largely decoupled - enabling good bandwidth out of the cores even at low frequency levels.

The System Agent

Silvermont gains an updated system agent (read: North Bridge) that’s much better at allowing access to main memory. In all previous generation Atom architectures, virtually all memory accesses had to happen in-order (Clover Trail had some minor OoO improvements here). Silvermont’s system agent now allows reordering of memory requests coming in from all consumers/producers (e.g. CPU cores, GPU, etc...) to optimize for performance and quality of service (e.g. ensuring graphics demands on memory can regularly pre-empt CPU requests when necessary).

ISA, IPC & Frequency SoCs and Graphics, Penryn-Class Performance
Comments Locked

174 Comments

View All Comments

  • tech4real - Tuesday, May 7, 2013 - link

    "Absolute performance"? Do we consider power constraint here at all? Atom is optimized for power-efficiency. All the current information I've seen so far suggest Silvermont will outperform A15 by a large margin in terms of power efficiency. If we throw away power constraint, Intel has Core to take care of that.
  • Wilco1 - Tuesday, May 7, 2013 - link

    I was talking about peak performance, but yes, power consumption matters too. What we've seen so far is Intel marketing suggesting that in 6-9 months time Silvermont will be more efficient than A15 was 12 months earlier. However that's not what Silvermont will have to compete with. At the end of this year A15 will have had 2 process shrinks down to 20nm in addition to a lot of tuning, so it will be far more efficient than it was 12 months ago. And A15 is just one example, Apple, QC and ARM will have new cores as well. It's reasonable to say that Intel will finally be able to compete with Silvermont, but it is far from clear that it is the overall winner like their marketing claims.
  • tech4real - Wednesday, May 8, 2013 - link

    TSMC's 20nm process is still in the works, your Q4'13 volumn production estimate seems way too optimistic, especially considering TSMC's pain in 28nm ramp. Also 28nm->20nm shrink without finfet significantly reduces its benefit.
  • Wilco1 - Wednesday, May 8, 2013 - link

    TSMC have learnt from the 28nm problems. They appear very aggressive this time, and so far the reports are they are 2 months ahead of schedule. Even if it ends up delayed to Q2'14 it's still around the same time Intel is planning to come out with Silvermont phones. The gains are not as large as with FinFETs, but enough to reduce power significantly.
  • tech4real - Wednesday, May 8, 2013 - link

    my understanding is Q2'14 volume production with high yield is almost TSMC 20nm best case scenario. Of course, the term "high yield" is such a subjective thing vendors love to manipulate with almost infinite freedom...
  • zeo - Wednesday, May 8, 2013 - link

    TSMC 20nm isn't set up for such optimization, but rather focused on cost reductions... The number of nodes, variations supported, etc will be fewer than they did with 28nm as they want to avoid the problems that caused the 28nm delays and that has resulted in a much more streamlined setup.

    While power leakage issues increase as FAB size is decreased... So without a solution like FinFET the power efficiency would be increasingly harder to keep it where it is, let alone reduce it...

    It's one of the reasons why ARM is trying to push other options like Big.LITTLE to boost operational efficiencies and not rely as much on FAB improvements.

    While it's also why not all ARM SoCs have moved to 28nm yet as for many the power leakage was still too much of a issue for their designs to make the switch right away, so there could be additional delays for 20nm releases.

    Though ARM should get FinFET in time for for the 64bit release... but by that time Intel would be on its way to 14nm...
  • Jaybus - Wednesday, May 8, 2013 - link

    Think of it as 2-wide x86 vs. 3-wide RISC. Rather than translating the x86 microcoded instruction into 2 or 3 RISC-like instructions, Intel's decode keeps it a single instruction down the pipeline. The RSIC architecture has to decode more instructions, so needs the 3-wide to keep up with the x86 2-wide.

    The point about the frequency scaling is this. The tri-gate design has a gate on top of 2 vertical gates. This gives it 3x the surface area as compared to FinFET. The greater surface area allows more electrons to flow within a given area of the die, and that allows a greater range of voltages and/or frequencies for which it can operate efficiently.
  • Wilco1 - Thursday, May 9, 2013 - link

    Eventhough macro-ops helps decode, they need to be expanded before they are executed. So in terms of execution, macro-ops don't help. Also as I mentioned in an earlier post, most ARMs also support macro-ops, allowing a 2-way ARM to behave like a 4-way. So macro-ops don't give x86 an advantage over RISC.
  • jemima puddle-duck - Monday, May 6, 2013 - link

    Without wishing to be overly cynical, Anandtech has a history of 'NOW Intel will win the mobile war' articles, which get recycled then forgotten in time for the next launch. It's all very clever stuff, but curiously underwhelming also.
  • Roffles12 - Monday, May 6, 2013 - link

    I don't remember reading any 'NOW Intel will win the mobile war' articles on Anandtech. Perhaps your perception is skewed. Intel articles are typically of a technical nature discussing the inner workings of the architecture and fab process or discussing benchmarks. Intel is really the only company so completely open about how their technology works, so why not make it a point of discussion on a website on a website dedicated to the subject? If your head is clouded by fud from competing companies and the constantly humming rumor mill, maybe you need to back off for a while. At the end of the day, it's up to you to digest this information and form an opinion.

Log in

Don't have an account? Sign up now