The Impact of NCQ on Multitasking Performance

Just under a year ago, we reviewed Maxtor's MaXLine III, a SATA hard drive that boasted two very important features: a 16MB buffer and support for Native Command Queuing (NCQ).  The 16MB buffer was interesting as it was the first time that we had seen a desktop SATA drive with such a large buffer, but what truly intrigued us was the drive's support for NCQ.  The explanation of NCQ below was from our MaXLine III review from June of 2004:

Hard drives are the slowest things in your PC and they are such mostly because they are the only component in your PC that still relies heavily on mechanics for its normal operation. That being said, there are definite ways of improving disk performance by optimizing the electronics that augment the mechanical functions of a hard drive.

Hard drives work like this: they receive read/write requests from the chipset's I/O controller (e.g. Intel's ICH6) that are then buffered by the disk's on-board memory and carried out by the disk's on-board controller, making the heads move to the correct platter and the right place on the platter to read or write the necessary data. The hard drive is, in fact, a very obedient device; it does exactly what it's told to do, which is a bit unfortunate. Here's why:

It is the hard drive, not the chipset's controller, not the CPU and not the OS that knows where all of the data is laid out across its various platters. So, when it receives requests for data, the requests are not always organized in the best manner for the hard disk to read them. They are organized in the order in which they are dispatched by the chipset's I/O controller.

Native Command Queuing is a technology that allows the hard drive to reorder dynamically its requests according to the location of the requests on a platter. It's like this - say you had to go to the grocery store and the drug store next to it, the mall and then back to the grocery store for something else. Doing it in that order would not make sense; you'd be wasting time and money. You would naturally re-order your errands to grocery store, grocery store, drug store and then the mall in order to improve efficiency. Native Command Queuing does just that for disk accesses.

For most desktop applications, NCQ isn't necessary. Desktop applications are mostly sequential in nature and exhibit a high degree of spatial locality. What this means is that most disk accesses for desktop systems occur around the same basic areas on a platter. Applications store all of their data around the same location on your disk as do games, so loading either one doesn't require many random accesses across the platter - reducing the need for NCQ. Instead, we see that most desktop applications benefit much more from higher platter densities (more data stored in the same physical area on a platter) and larger buffers to improve sequential read/write performance. This is the reason why Western Digital's 10,000 RPM Raptor can barely outperform the best 7200 RPM drives today.

Times are changing, however, and while a single desktop application may be sequential in nature, running two different desktop applications simultaneously changes the dynamics considerably. With Hyper Threading and multi-core processors being the things of the future, we can expect desktop hard disk access patterns to begin to slightly resemble those of servers - with more random accesses. It is with these true multitasking and multithreading environments that technologies such as NCQ can improve performance.

In the Maxtor MaXLine III review, we looked at NCQ as a feature that truly came to life when working in multitasking scenarios. Unfortunately, finding a benchmark to support this theory was difficult. In fact, only one benchmark (the first Multitasking Business Winstone 2004 test) actually showed a significant performance improvement due to NCQ.

After recovering from Part I and realizing that my nForce4 Intel Edition platform had died, I was hard at work on Part II of the dual core story. For the most part, when someone like AMD, Intel, ATI or NVIDIA launches a new part, they just send that particular product. In the event that the new product requires another one (such as a new motherboard/chipset) to work properly, they will sometimes send both and maybe even throw in some memory if that's also a more rare item. Every now and then, one of these companies will decide to actually build a complete system and ship that for review. For us, that usually means that we get a much larger box and we have to spend a little more time pulling the motherboard out of the case so we can test it out on one of our test benches instead - obviously, we never test a pre-configured system supplied by any manufacturer. This time around, both Intel and NVIDIA sent out fully configured systems for their separate reviews - two great huge boxes blocking our front door now.

When dissecting the Intel system, I noticed something - it used a SATA Seagate Barracuda 7200.7 with NCQ support. Our normal testbed hard drive is a 7200.7 Plus, basically the same drive without NCQ support. I decided to make Part I's system configuration as real world as possible and I used the 7200.7 with NCQ support. So, I used that one 7200.7 NCQ drive for all of the tests for Monday's review. Normally, only being able to run one system at a time would be a limitation. But given how much work I had to put into creating the tests, I wasn't going to be able to run multiple things at the same time while actually using each machine, so this wasn't a major issue. The results turned out as you saw in the first article and I went on with working on Part II.

For Part II, I was planning to create a couple more benchmarks, so I wasn't expecting to be able to compare things directly to Part I. I switched back to our normal testbed HDD, the 7200.7 Plus. Using our normal testbed HDD, I was able to set up more systems in parallel (since I had more HDDs) and thus, testing went a lot quicker. I finished all of the normal single threaded application benchmarks around 3AM (yes, including gaming tests) and I started installing all of the programs for my multitasking scenarios.

When I went to run the first multitasking scenario, I noticed something was very off - the DVD Shrink times were almost twice what they were in Monday's review. I spent more time working with the systems and uncovered that Firefox and iTunes weren't configured identically to the systems in Monday's review, so I fixed those problems and re-ran. Even after re-running, something still wasn't right - the performance was still a lot slower. It was fine in all other applications and tests, just not this one. I even ran the second multitasking scenario from Monday's review and the performance was dead on - something was definitely up. Then it hit me...NCQ.

I ghosted my non-NCQ drive to the NCQ drive and re-ran the test. Yep, same results as Monday. The difference was NCQ! Johan had been pushing me to use a Raptor in the tests to see how much of an impact disk performance had on them, and the Raptor sped things up a bit, but not nearly as much as using the 7200.7 did. How much of a performance difference? The following numbers use the same configuration from Monday's article, with the only variable being the HDD. I tested on the Athlon 64 FX-55 system:

Seagate Barracuda 7200.7 NCQ - 25.2 minutes
Seagate Barracuda 7200.7 no NCQ - 33.6 minutes
Western Digital Raptor WD740 - 30.9 minutes

The performance impact of NCQ is huge. But once again, just like the first NCQ article, this is the only test that I can get to be impacted by NCQ - the other Multitasking Scenarios remain unchanged.  Even though these numbers were run on the AMD system, I managed to get similar results out of the Intel platform. Although, for whatever reason, the Intel benchmarks weren't nearly as consistent as the AMD benchmarks.  Given that we're dealing with different drive controllers and vastly different platforms, there may be many explanations for that.

At first, I thought that this multitasking scenario was the only one where NCQ made an impact, but as you'll find out later on in this article, that's not exactly true.

Multitasking Performance Multitasking Scenario 2: File Compression
Comments Locked

106 Comments

View All Comments

  • saratoga - Friday, April 8, 2005 - link

    #90:

    HT is the same thing as SMT. You can thank Intel's marketing for that one.
  • Reflex - Friday, April 8, 2005 - link

    #93: Intel has labeled it as SMT, however there is another name for what they are doing(that I cannot remember at the moment). What they are calling SMT is nowhere even close to solutions like Power.

    That aside, the implementation Intel has chosen is designed to make up for inefficiencies in the Prescott pipeline, such a implementation would make zero sense on the Athlon architecture, it does not share the same inefficiencies that the P4 design has. It would actually harm rather than help performance.

    True SMT is not a 'bolt on' feature. Its something that has to be planned for from the very beginning of the CPU design cycle. You could not in any way add it to the current Athlon design and gain any performance. Whatever their next generation is may include it, it depends on what direction they decide to go, but you will not see it on the current generation, and thats actually a good thing as it would be purely a marketing move.
  • eeceret - Friday, April 8, 2005 - link

    As always a very interesting article, one thing comes to mind though... In the gaming multitasking tests you adjusted the priority of the DVD Shrink process to see the effect on gaming performance. What I was wondering is if you could take a look at what effect explicitly binding the processes to seperate cores (processor affinity) has on gaming performance
  • defter - Friday, April 8, 2005 - link

    Hyperthreading IS SMT. SMT stands for symmetric multithreading (ability to run two or more threads at once and this is exactly what hyperthreading does.

    Of course, CPUs from different manufacturers have vastly different internal structures, thus also the SMT is implemented differently.


    "Intel's next major IA-32 processor release, codenamed Prescott, will include a feature called simultaneous multithreading (SMT)"

    http://arstechnica.com/articles/paedia/cpu/hyperth...
  • tynopik - Friday, April 8, 2005 - link

    and of course that's just the net part, don't want to leave out other background tasks like that resource sucker outlook and playing flac/ape files
  • tynopik - Thursday, April 7, 2005 - link

    to get repeatable multi-tasking/ncq benches, anand is going to have to bite the bullet and setup a full-blown network simulation:

    1: an nntp server
    2. a bittorrent swarm
    3. an irc server

    with this setup, you can test these multi-tasking scenarios that seem more reasonable:
    1. firewall (a pig like zonealarm)
    2. pulling news articles with either 2 clients or 1 client with 2 threads (writing to different places on hd simultaneously)
    3. about 10 torrents where it is BOTH downloading and uploading (so pulling from a gazillion different places on hd at once)
    4. mirc with about 5 open channels and some scripts (like filters). At least one channel should be very high traffic (like #mp3passion on undernet)
    5. icq
    6. running all this with software raid 5

    this would represent a typical background load, and then you can benchmark foreground tasks to see how much they are affected by what's going on in the background (specifically ncq could be tested by seeing how long it takes to copy a file from one partition to another under these circumstances)
  • Reflex - Thursday, April 7, 2005 - link

    Just to be clear: SMT is NOT the same thing as HyperThreading. They go about what they are doing in radically different ways. The only similarity is in the CPU being able to execute two simultanious threads. How it goes about that though is implemented completely differently.
  • Reflex - Thursday, April 7, 2005 - link

    "#47, if HT is simply a "bandaid", then why is AMD the only major CPU vendor not using it? IBM uses it heavily in their Power5, Sun is making their next CPUs (Niagra) very highly SMT (same thing as HT). Arguably, both of those architectures have much more shallow pipelines than the P4, yet see reason to provide SMT. AMD is the only holdout."

    The SMT used in IBM's Power series is completely different from what Intel is doing with the P4 design. The only similarity is the fact that two threads can be run at once, the implementation has nothing even close to the same however. I do not have details on Sun's implementation, but I would assume it will be closer to IBM's than Intel's implementation considering the market they are targetting. The Power architecture was designed from the ground up to use SMT, it wasn't a tacked on feature, and you get considerably more of a performance boost in most scenerios with it than you would ever see with HT on Intel.

    The Athlon64 architecture was not designed with SMT or HT in mind, it was designed around two physical cores. So adding HT to it would do very little, and SSE3(which mostly optimizes HT style multithreading) does almost nothing on the K8 architecture.

    Not every feature would help every CPU design, it all depends on what was taken into account when the design was made. Power has some limitations you do not see on x86(in order execution for example), and x86 has challenges you do not see on Power. The multi-threading implementations are similarly different and not comparable. In the x86 world, HT makes sense on Intel in some situations(not always). It makes no sense on AMD and would likely result in performance drops rather than gains. It certainly would not improve performance in any way as the core does not often have idle units or execution steps due to its design.
  • Icehawk - Thursday, April 7, 2005 - link

    "I'm also curious to see what effects RAID would have on testing striped setups."

    Uh, delete the "striped setups" from the end ;)

    Can we please, please get some kind of short-term editing abilities here?
  • Icehawk - Thursday, April 7, 2005 - link

    So you'd rather wait for information than recieve it now?

    Anand clearly shows that dual core is only a good choice now IF you use it in scenarios where it can run multiple applications. Otherwise, single core chips are still the better choice. So i don't see the marketing hype you are referring to. Basically we've been told now for quite a few years that multi-taskers can benefit from multiple CPUs but the costs have been prohibitive. Now it looks like within the next year a 2 CPU machine will cost no more than previous single core processors.

    Thanks Anand for helping us out in planning for the future! The DVDShrink stuff was very interesting to me as was the NCQ information - makes switching to SATA drives a bit more appealing to me considering my usage profile.

    I just recently went from 1.4 k7->3.2 P4 w/HT so I'm pretty happy at this point. It does look like a dual core system *might* allow me to get rid of my second box (the 1.4 K7) which would save me money in the long run - one less PC to power up and cool off. My home office requires year round A/C to cool my 2 21" CRTs and 2 PCs...

    I'm also curious to see what effects RAID would have on testing striped setups. I'm very curious if a RAID5 type of setup with NCQ and a dual-core might make chores like encoding & gaming realistic - it sounds like from the review that at this point I/O may cause hiccups even when the processor still has headroom.

Log in

Don't have an account? Sign up now