Secure Socket Layers RSA Performance

Secure Web communication is possible through the utilization of the Secure Sockets Layer (SSL) protocol. Using the command "openssl speed rsa" we can measure the number of RSA public key operations (signs) that a system can perform per second.

While "openssl speed rsa" is sufficient to test the Xeons and Opterons, the Sun T1 can speed up the Rivest Shamir Adleman (RSA) and Digital Signal Algorithm (DSA) encryption and decryption operations needed for SSL processing, thanks to a modular arithmetic unit (MAU) that supports modular exponentiation and multiplication. Each T1 core has a MAU, thus one 8 core T1 has 8 MAUs. To make use of those 8 MAUs, you have run the SSL calculations through the Solaris Cryptographic Framework (SCF). To test the T1 with the MAU crunching at full speed we used the command: "openssl speed -engine pkcs11 rsa". The Solaris 10 OS also provides in-kernel SSL termination, offering greater security than SSL termination outside the kernel.

We included the HP DL585 to see whether 8 cores of complex general purpose CPUs (Opteron 880) can keep up with the 8 MAU of the Sun T1. If you want to compare Woodcrest and the Opteron, you should check the 2 and 4 concurrency numbers. You can find our 1024-bit numbers in the graph below. One thread per core is optimal, so we tested the DL585 with a maximum of 16 threads, to show you that the peak is attained at 8 threads. The Xeon Irwindale was tested with 8 threads to show you that 4 threads (4 logical cores) is optimal and so on.



Notice that the 8 MAUs of the Sun T1 can only get in full action if we fire off 32 "SSL RSA signing" threads. Once that happens, the little 1 GHz T1 is able to keep up with the massive 2.4 GHz 8 core DL585. Without MAU, the T1 is as fast as a 1.8 GHz Xeon Irwindale. It is thus very important to check that your favorite web server works with SCF if you want to run your secure web services on the Sun T2000.

It looks like we've discovered the first - but rather insignificant to most people - "weakness" of the new Core architecture: decryption and encryption. The Opteron at 2.4 GHz has no trouble keeping up with the 3 GHz Woodcrest. This might be a result of the fact that the Woodcrest can only perform one rotate per cycle, while the Opteron can do 3. Although the RSA algorithm doesn't really use rotations, the hash algorithms needed to sign or encrypt a key make use of rotations. However, the most important reason is probably that the Opteron can sustain 2 ADC (Add with Carry) instructions per clock cycle, while Woodcrest can only do one. As ADC is good for about 17% of the instruction mix of the RSA algorithm, this might be enough to negate the extra integer power (Memory disambiguation, 4 wide decode ...) that the Woodcrest has.

Also notice that the previous NetBurst architecture, represented by the Xeon Irwindale, does very badly. The reason is that the P4 doesn't have a barrel shifter, a circuit in the chip which can shift or rotate any number in one clock cycle. Without this shifter, rotates and shifts take much longer, resulting in high latency. Most x86 code couldn't care less, but most encrypting code makes heavy use of rotates or shifts or both. We also did a quick test with Hyper-Threading on and off. In this case Hyper-Threading sped up the encryption (signs/s) with 20 to 28%.

To end the RSA sign/s benchmark, we'll make a quick comparison between quad core AMD Opteron 2.4 GHz, quad-core Intel Xeon Woodcrest and Sun's T1 with MAU enabled across different RSA bit lengths.

RSA Encryption (Signs/s)
  Opteron 2.4 GHz
4 threads
Xeon 5160 3 GHz
4 threads
SUN T1 with MAU
32 threads
512 bit 19003 21194 35613
1024 bit 6098 6240 10722
2048 bit 1145 1087 1918
4096 bit 185 164 1


Notice that the hardware acceleration of the T1 does not work beyond 2048-bit keys. Considering that most secure applications use 1024-bit and only a few "high security" ones use 2048-bit, this is not an issue.

In case of doing verifies as opposed to signs, the server has to authenticate the identity of the client. This is a lot less intensive, and we'll show you the verifies per second numbers at 2048-bits. At 1024-bits length, both the Woodcrest and Opteron were able to verify more than 50000 keys per core, and that is a hard limit of the OpenSSL benchmark.



Again, the Opteron takes the lead. The Sun T1 even with the 8 MAUs is half as slow as four Opterons or Woodcrests, but this is hardly an issue. Encrypting or signing will slow down a server much quicker than verifying keys.

Both verifies/s and signs/s benchmark are rather synthetic. It is much more realistic to test with a real web server running SSL, and that is what we are currently doing. We followed Sun's instructions to enable RSA hardware acceleration for Apache, but for some reason, the Apache web server is still not making use of the Solaris Cryptographic Framework. So our Web server SSL test is work in progress.

Theoretical Performance Apache/PHP/MySQL Performance
POST A COMMENT

91 Comments

View All Comments

  • blackbrrd - Wednesday, June 07, 2006 - link

    Finally Intel can give AMD some real competition in the two socket server market. This shows why Dell only wanted to go with AMD for 4S and not 2S server systems...

    245w vs 374w and a huge performance lead over the previous Intel generation is a huge leap for Intel.

    It will be interesting to see how much these systems are going to cost:
    1) is the fb-dimm's gonna be expensive?
    2) is the cpu's gonna be expensive?
    3) is the motherboards gonna be expensive?

    For AMD neither the ram nor the motherboards are expensive, so I am curious how this goes..

    If anybody thinks I am an Intel fanboy, I have bought in this sequence: intel amd intel intel, and I would have gotten and amd instead of an intel for the last computer, except I wanted a laptop ;)
    Reply
  • JarredWalton - Wednesday, June 07, 2006 - link

    For enterprise servers, price isn't usually a critical concern. You often buy what runs your company best, though of course there are plenty of corporations that basically say "Buy the fastest Dell" and leave it at that.

    FB-DIMMs should cost slightly more than registered DDR2, but not a huge difference. The CPUs should actually be pretty reasonably priced, at least for the standard models. (There will certainly be models with lots of L3 cache that will cost an arm and a leg, but that's a different target market.)

    Motherboards for 2S/4S are always pretty expensive - especially 4S. I would guess Intel's boards will be a bit more expensive than equivalent AMD boards on average, but nothing critical. (Note the "equivalent" - comparing boards with integrated SCSI and 16 DIMM slots to boards that have 6 DIMM slots is not fair, right?)

    Most companies will just get complete systems anyway, so the individual component costs are only a factor for small businesses that want to take the time to build and support their own hardware.
    Reply
  • blackbrrd - Wednesday, June 07, 2006 - link

    Registered DDR2 is dirt cheap, so if FB-DIMMs are only slightly more expensive thats really good.

    About compareing 6 DIMM slot and 16 DIMM slot motherboards, I agree, you can't do it. The number of banks is also important, we have a motherboard at work with 8 ranks and 6 DIMM slots, so only two of the slots can be filled with the cheapest 2gb dual rank memory. Currently 2gb single ranks modules are 50% more expensive than dual rank modules.

    Which brings another question.. Does FB-DIMM have the same "problem" with rank limit in addition to slot limit? Or does the FB part take care of that?
    Reply
  • BaronMatrix - Wednesday, June 07, 2006 - link

    why are we running servers with only 4GB RAM. I have that in my desktop. Not ot nitpick but I think you shuld load up 16GB and rerun the tests. If not this is a low end test, not HPC. I saw the last Apache comparison and it seems like the benchmark is different. Opteron was winning by 200-400% in those tests. What happened? Reply
  • JohanAnandtech - Wednesday, June 07, 2006 - link

    Feel free to send me 12 GB of FBDIMMs. And it sure isn't a HPC test, it is a server test.

    "I saw the last Apache comparison and it seems like the benchmark is different. Opteron was winning by 200-400% in those tests. What happened? "

    A new Intel architecture called "Core" was introduced :-)
    Reply
  • BaronMatrix - Wednesday, June 07, 2006 - link

    I didn't say the scores, I said the units in the benchmark. I'm not attacking you. It just stuck out in my head that the units didn't seem to be the same as the last test with Paxville. By saying HPC, I mean apps that use 16GB RAM, like Apache/Linux/Solaris. I'm not saying you purposely couldn't get 12 more GB of RAM but all things being equal 16GB would be a better config for both systems.

    I've been looking for that article but couldn't find it.
    Reply
  • JohanAnandtech - Wednesday, June 07, 2006 - link

    No problem. Point is your feedback is rather unclear. AFAIK, I haven't tested with Paxville. Maybe you are referring to my T2000 review, where we used a different LAMP test, as I explained in this article. In this article the LAMP server has a lot more PHP and MySQL work.

    http://www.anandtech.com/IT/showdoc.aspx?i=2772&am...">http://www.anandtech.com/IT/showdoc.aspx?i=2772&am...
    See the first paragraph

    And the 4 GB was simply a matter of the fact that Woodcrest had 4 GB of FB DIMM.
    Reply
  • JarredWalton - Wednesday, June 07, 2006 - link

    Most HPC usage models don't depend on massive amounts of RAM, but rather on data that can be broken down into massively parallel chunks. IBM's BlueGene for example only has 256MB (maybe 512MB?) of RAM per node. When I think of HPC, that's what comes to mind, not 4-way to 16-way servers.

    The amount of memory used in these benchmarks is reasonable, since more RAM only really matters if you have data sets that are too large to fit with the memory. Since our server data sets are (I believe) around 1-2GB, having more than 4GB of RAM won't help matters. Database servers are generally designed to having enough RAM to fit the vast majority of the database into memory, at least where possible.

    If we had 10-14GB databases, we would likely get lower results (more RAM = higher latency among other things), but the fundamental differences between platforms shouldn't change by more than 10%, and probably closer to 5%. Running larger databases with less memory would alter the benchmarks to the point where they would largely be stressing the I/O of the system - meaning the HDD array. Since HDDs are so much slower than RAM (even 15K SCSI models), enterprise servers try to keep as much of the regularly accessed data in memory as possible.

    As for the Paxville article, click on the "IT Computing" link at the top of the website. Paxville is the second article in that list (and it was also linked once or twice within this article). Or http://www.anandtech.com/IT/showdoc.aspx?i=2745">here's the direct link.
    Reply
  • BaronMatrix - Wednesday, June 07, 2006 - link

    Thx for the link, but the test I was looking at was Apache and showed concurrency tests. At any rate, just don't think I was attacking you. I was curious as to the change in units I noticed. Reply
  • MrKaz - Wednesday, June 07, 2006 - link

    How much will it cost?

    If Conroe XE 2.9Ghz is 1000$.
    Then I assume that this will cost more.

    I think looks good, but it will depends a lot of the final price.

    Also does that FBdimm have a premium price over the regular ones?
    Reply

Log in

Don't have an account? Sign up now