DDR4

Intel and the DRAM world are switching over to DDR4 and with good reason. DDR4 is a large step forward, and some of the highlights of DDR4 include the following:

  • Speeds up to 3200 MT/s (1.6GHz Double Data Rate)
  • Lower DRAM I/O voltage (1.2 instead of 1.5 V VDDQ)
  • Twice the capacity (using the same DRAM chips)
  • Improved RAS

The improvements start with the internal organization. A DDR3 chip has eight independent banks, while DDR4 comes with 16 banks, organized in a 4x4 configuration: four bank groups with four banks. More banks mean that more pages can stay open (more page hits, lower latency) at a small power increase, which is completely negated by a whole range of power efficiency features (see further). The power efficiency gains are rather large. Samsung quantifies them in the slide below. 

Samsung claims about 21% lower power thanks to the drop in operating voltage (1.5 ->1.2v). Low Power DDR4 will run at 1.05v and will lower the power usage even further. But there is more to DDR4 than lowering the voltage. Samsung claims that, when both are manufactured with the same process technology, the DDR4 runs at 2/3 of the power DDR3L needs. 

Micron gives a break down of the features that made DDR4 more power efficient besides the obvious drop in VDDQ. 

Note that the total power efficiency increase is 30-35%, and this is not just a result of the VDD reduction (20%). In that sense, DDR4 is a larger step forward than previous DDR technology transistions. Of course, the 30-35% improvement in power efficiency is measured with RAM running at the same speed. It's also possible to run DDR4 at much higher speeds (3200 MT/s vs 1866 MT/s) while sacrificing some of the power savings. The DDR4 memory that we are using for testings runs at 2100 MT/s, a good compromise between a mild speed increase and power efficiency.

A more elaborate discussion will follow in our next server memory article, but each bank also has much smaller rows (four times smaller) and thus the cycle time of the DRAM can be much higher. The result is lower latency.

The improved signal to noise ratio and the extra pins for addressing allow DDR4 to support eight DRAM stacks instead of four (DDR3). As a result, DDR4 can support twice the capacity of DDR3 using the same (4-16Gb) DRAM chips. This will require the use of 3D stacking technology, which will take time to implement. However, since 8Gb chips are now used, Registered DIMMs of 32GB should soon be a reality, as well as 64GB LRDIMMs. We'll discuss this in more detail on the next page.

Power Optimizations Improved Support for LRDIMMs
Comments Locked

85 Comments

View All Comments

  • martinpw - Monday, September 8, 2014 - link

    There is a nice tool called i7z (can google it). You need to run it as root to get the live CPU clock display.
  • kepstin - Monday, September 8, 2014 - link

    Most Linux distributions provide a tool called "turbostat" which prints statistical summaries of real clock speeds and c state usage on Intel cpus.
  • kepstin - Monday, September 8, 2014 - link

    Note that if turbostat is missing or too old (doesn't support your cpu), you can build it yourself pretty quick - grab the latest linux kernel source, cd to tools/power/x86/turbostat, and type 'make'. It'll build the tool in the current directory.
  • julianb - Monday, September 8, 2014 - link

    Finally the e5-xxx v3s have arrived. I too can't wait for the Cinebench and 3DS Max benchmark results.
    Any idea if now that they are out the e5-xxxx v2s will drop down in price?
    Or Intel doesn't do that...
  • MrSpadge - Tuesday, September 9, 2014 - link

    Correct, Intel does not really lower prices of older CPUs. They just gradually phase out.
  • tromp - Monday, September 8, 2014 - link

    As an additional test of the latency of the DRAM subsystem, could you please run the "make speedup" scaling benchmark of my Cuckoo Cycle proof-of-work system at https://github.com/tromp/cuckoo ?
    That will show if 72 threads (2 cpus with 18 hyperthreaded cores) suffice to saturate the DRAM subsystem with random accesses.

    -John
  • Hulk - Monday, September 8, 2014 - link

    I know this is not the workload these parts are designed for, but just for kicks I'd love to see some media encoding/video editing apps tested. Just to see what this thing can do with a well coded mainstream application. Or to see where the apps fades out core-wise.
  • Assimilator87 - Monday, September 8, 2014 - link

    Someone benchmark F@H bigadv on these, stat!
  • iwod - Tuesday, September 9, 2014 - link

    I am looking forward to 16 Core Native Die, 14nm Broadwell Next year, and DDR4 is matured with much better pricing.
  • Brutalizer - Tuesday, September 9, 2014 - link

    Yawn, the new upcoming SPARC M7 cpu has 32 cores. SPARC has had 16 cores for ages. Since some generations back, the SPARC cores are able to dedicate all resources to one thread if need be. This way the SPARC core can have one very strong thread, or massive throughput (many threads). The SPARC M7 cpu is 10 billion transistors:
    http://www.enterprisetech.com/2014/08/13/oracle-cr...
    and it will be 3-4x faster than the current SPARC M6 (12 cores, 96 threads) which holds several world records today. The largest SPARC M7 server will have 32-sockets, 1024 cores, 64TB RAM and 8.192 threads. One SPARC M7 cpu will be as fast as an entire Sunfire 25K. :)

    The largest Xeon E5 server will top out at 4-sockets probably. I think the Xeon E7 cpus top out at 8-socket servers. So, if you need massive RAM (more than 10TB) and massive performance, you need to venture into Unix server territory, such as SPARC or POWER. Only they have 32-socket servers capable of reaching the highest performance.

    Of course, the SGI Altix/UV2000 servers have 10.000s of cores and 100TBs of RAM, but they are clusters, like a tiny supercomputer. Only doing HPC number crunching workloads. You will never find these large Linux clusters run SAP Enterprise workloads, there are no such SAP benchmarks, because clusters suck at non HPC workloads.

    -Clusters are typically serving one user who picks which workload to run for the next days. All SGI benchmarks are HPC, not a single Enterprise benchmark exist for instance SAP or other Enterprise systems. They serve one user.

    -Large SMP servers with as many as 32 sockets (or even 64-sockets!!!) are typically serving thousands of users, running Enterprise business workloads, such as SAP. They serve thousands of users.

Log in

Don't have an account? Sign up now