TrueCrypt 7.1 Benchmark

TrueCrypt is a software application used for on-the-fly encryption (OTFE). It is free, open source and offers full AES-NI support. The application also features a built-in encryption benchmark that we can use to measure CPU performance. First we test with the AES algorithm (256-bit key, symmetric).

TrueCrypt AES

Core for Core, clock for clock, the Xeon E5 - which also supports AES-NI - is about 30% faster than the best Opteron (Xeon E5-2660 vs Opteron 6276). At a similar pricepoint (Opteron 6276 vs Xeon E5-2660 6C) however, the Opteron and Xeon E5 perform more or less the same, with a small advantage for the latter.

We also test with the heaviest combination of the cascaded algorithms available: Serpent-Twofish-AES.

TrueCrypt AES-Twofish-Serpent

The combination benchmark is limited by the slowest algorithms: Twofish and Serpent. This one of the few benchmarks where the Opteron 6276 is able to keep up with the Xeon E5.

It is important to realize that these benchmarks are not real-world but rather are synthetic. It would be better to test a website that does some encrypting in the background or a fileserver with encrypted partitions. In that case the encryption software is only a small part of the total code being run. A large performance (dis)advantage might translate into a much smaller performance (dis)advantage in that real-world situation. For example, eight times faster encryption resulted in a website with 23% higher throughput and a 40% faster file encryption (see here).

7-Zip 9.2

7-zip is a file archiver with a high compression ratio. 7-Zip is open source software, with most of the source code available under the GNU LGPL license

7-zip

Compression is more CPU intensive than decompression, meanwhile the latter depends a little more on memory bandwidth. When it comes to load/stores and memory bandwidth, the Xeon E5-2660 is about 13% faster than AMD's flagship. Compression is for a part determined by the quality of the branch predictor. The new and improved Sandy Bridge branch predictor is one of the reasons why a 2.2 GHz 6-core 2660 is able to keep up with a 2.93 GHz (!) Xeon 5670, which is also a six-core processor. The Opterons get blown away in the compression benchmark: each core of Xeon E5 is about twice as efficient in this task. The overall winner is thus once again the Xeon E5.

HPC: LSTC's LS Dyna Conclusion
POST A COMMENT

65 Comments

View All Comments

  • meloz - Tuesday, March 06, 2012 - link

    I wonder if this Data Direct I/O Technology has any relevance to audio engineering? I know that latency is a big deal for those guys. In past I have read some discussion on latency at gearslutz, but the exact science is beyond me.

    Perhaps future versions of protools and other professional DAWs will make use of Data Direct I/O Technology.
    Reply
  • Samus - Tuesday, March 06, 2012 - link

    wow. 20MB of on-die cache. thats ridiculous. Reply
  • PwnBroker2 - Tuesday, March 06, 2012 - link

    dont know about the others but not ATT. still using AMD even on the new workstation upgrades but then again IBM does our IT support, so who knows for the future.

    the new xeon's processors are beasts anyways, just wondering what the server price point will be.
    Reply
  • tipoo - Tuesday, March 06, 2012 - link

    "AMD's engineers probably the dumbest engineers in the world because any data in AMD processor is not processed but only transferred to the chipset."

    ...What?
    Reply
  • tipoo - Tuesday, March 06, 2012 - link

    Think you've repeated that enough for one article? Reply
  • tipoo - Wednesday, March 07, 2012 - link

    Like the Ivy bridge comments, just for future readers note that this was a reply to a deleted troll and no longer applies. Reply
  • IntelUser2000 - Tuesday, March 06, 2012 - link

    Johan, you got the percentage numbers for LS-Dyna wrong.

    You said for the first one: the Xeon E5-2660 offers 20% better performance, the 2690 is 31% faster. It is interesting to note that LS-Dyna does not scale well with clockspeed: the 32% higher clockspeed of the Xeon E5-2690 results in only a 14% speed increase.

    E5-2690 vs Opteron 6276: +46%(621/426)
    E5-2660 vs Opteron 6276: +26%(621/492)
    E5-2690 vs E5-2660: +15%(492/426)

    In the conclusion you said the E5 2660 is "56% faster than X5650, 21% faster than 6276, and 6C is 8% faster than 6276"

    Actually...

    LS Dyna Neon-

    E5-2660 vs X5650: +77%(872/492)
    E5-2660 vs 6276: +26%(621/492)
    E5-2660 6C vs 6276: +9%(621/570)

    LS Dyna TVC-

    E5-2660 vs X5650: +78%(10833/6072)
    E5-2660 vs 6276: +35%(8181/6072)
    E5-2660 6C vs 6276: +13%(8181/7228)

    It's funny how you got the % numbers for your conclusions. It's merely the ratio of lower number vs higher number multiplied by 100.
    Reply
  • JohanAnandtech - Wednesday, March 07, 2012 - link

    Argh. You are absolutely right. I reversed all divisions. I am fixing this as we type. Luckily this does not alter the conclusion: LS-DYNA does not scale with clockspeed very well. Reply
  • alpha754293 - Wednesday, March 07, 2012 - link

    I think that I might have an answer for you as to why it might not scale well with clock speed.

    When you start a multiprocessor LS-DYNA run, it goes through a stage where it decomposes the problem (through a process called recursive coordinate bisection (RCB)).

    This decomposition phase is done every time you start the run, and it only runs on a single processor/core. So, suppose that you have a dual-socket server where the processors say...are hitting 4 GHz. That can potentially be faster than say if you had a four-socket server, but each of the processors are only 2.4 GHz.

    In the first case, you have a small number of really fast cores (and so it will decompose the domain very quickly), whereas in the latter, you have a large number of much slower cores, so the decomposition will happen slowly, but it MIGHT be able to solve the rest of it slightly faster (to make up for the difference) just because you're throwing more hardware at it.

    Here's where you can do a little more experimenting if you like.

    Using the pfile (command line option/flag 'p=file'), not only can you control the decomposition method, but you can also tell it to write the decomposition to a file.

    So had you had more time, what I would have probably done is written out the decompositions for all of the various permutations you're going to be running. (n-cores, m-number of files.)

    When you start the run, instead of it having to decompose the problem over and over again each time it starts, you just use the decomposition that it's already done (once) and then that way, you would only be testing PURELY the solving part of the run, rather than from beginning to end. (That isn't to say that the results you've got is bad - it's good data), but that should help to take more variables out of the equation when it comes to why it doesn't scale well with clock speed. (It should).
    Reply
  • IntelUser2000 - Tuesday, March 06, 2012 - link

    Please refrain from creating flamebait in your posts. Your post is almost like spam, almost no useful information is there. If you are going to love one side, don't hate the other. Reply

Log in

Don't have an account? Sign up now