How We Test PCIe 4.0 Storage: The AnandTech 2021 SSD Benchmark Suite

Name: How We Test PCIe 4.0 Storage: The AnandTech 2021 SSD Benchmark Suite
Item: How We Test PCIe 4.0 Storage: The AnandTech 2021 SSD Benchmark Suite
Author: Billy Tallis

by Billy Tallis on February 1, 2021 1:15 PM EST

70 Comments | Add A Comment

70 Comments

Advanced Synthetic Tests

Our benchmark suite includes a variety of tests that are less about replicating any real-world IO patterns, and more about exposing the inner workings of a drive with narrowly-focused tests. Many of these tests will show exaggerated differences between drives, and for the most part that should not be taken as a sign that one drive will be drastically faster for real-world usage. These tests are about satisfying curiosity, and are not good measures of overall drive performance.

Sequential Drive Fill

The main purpose of the sequential drive fill tests are to estimate the size of a drive's SLC write cache. This test is also one of the most likely to trigger thermal throttling, because it is the longest-running sustained IO test in our suite. This test performs two passes of writing to the drive. The first is conducted after erasing the drive and giving it a few minutes to cool down and finish any background work. This first pass of sequential writes shows us the best-case SLC cache capacity, since any variable-sized cache will be at its largest when starting on an empty drive. The second pass is conducted after giving the drive some idle time and performing some read performance tests. By the time the second write pass begins, the drive should have finished any background work and we should observe the worst-case SLC cache capacity for drives that have a variable size cache.

As the second sequential write pass continues, the SLC cache will eventually be filled and even drives that don't use SLC caching will usually show some performance drop. This is pushing the drive well beyond the limits of any real-world consumer workload, so aside from any SLC cache at the beginning, performance during the second pass is irrelevant. However, since this second pass is overwriting data that was also written sequentially, the drive's garbage collection during this process is quite straightforward. Overwriting the drive with random writes instead of sequential writes would be more likely to fill the drive's spare area and induce more severe performance drops.


Pass 1
Pass 2


Average Throughput for last 16 GB	Overall Average Throughput

After both passes of sequential writes are complete, the last 20% of the drive is TRIMed and the drive is given plenty of idle time. This prepares the drive for the battery of tests that are conducted on an 80%-full drive—full enough that SLC cache size is significantly reduced, but still leaving some empty space to avoid testing the absolute worst-case scenario of performance on a completely full drive.

Working Set Size

This test performs random 4kB reads at queue depth 1 while varying the working set size: the size of the dataset that the random reads are coming from. When the working set size is small, the access pattern has a high degree of spatial locality, and DRAMless drives should have no trouble caching the limited amount of NAND mapping information needed to handle the reads. As the working set size increases, drives with little or no RAM are likely to show reduced performance from an increasing number of FTL cache misses. Often there is a sharp drop in performance that suggests the size of any on-controller SRAM or HMB cache in use. Drives with some DRAM but not the full 1GB per 1TB ratio may be able to handle very large working set sizes with good performance, but typically still show reduced performance when random reads span the entire drive.

This test also provides an opportunity to verify that the TRIM command is working properly: when attempting to read data from a portion of the drive that is empty (or has been trimmed), the drive should return a bunch of zeros as soon as it has looked up the relevant LBAs in the FTL and determined that there isn't actually any real flash memory currently allocated to those addresses. So in addition to running the working set size test on a full drive, we also run it when the drive is 32GB full and 80% full, expecting to see substantially increased performance when many or most of the reads should be handled without actually touching the NAND flash memory. These extra test runs aren't included in the graphs we publish, but we're keeping an eye out for drives that don't behave as expected.

Performance vs Block Size

Industry standard practice is to measure random IO performance using 4kB operations and sequential IO performance using 128kB operations. But SSDs permit IOs as small as 512 bytes, and real-world workloads include a wide variety of actual IO block sizes. Our trace-based tests subject drives to IOs of various sizes, but are ill-suited for analyzing how specific block sizes perform.

These tests perform 1GB of IO at each block size, at a queue depth of 1 and with the usual idle time after each step. Like our other synthetic tests, they're performed both with the drive 32GB full and 80% full, to capture any differences due to things like SLC caching. Regular readers may recognize these tests as based on ones we use as part of our enterprise SSD test suite. The principle is the same, but the configuration here has been adjusted to match the rest of our synthetic tests, and we're now testing up to block sizes of 2MB. As with some of the other tests, the fact that we're testing under Linux means that IOs larger than 128kB get split up by the OS and issued to the drive as a batch. For example, IO with a 1MB block size ends up looking to the drive like eight operations of 128kB issued at the same time.


Random Read
Random Write
Sequential Read
Sequential Write

There are several interesting phenomena to keep an eye out for. With block sizes smaller than 4kB, we generally see performance that is roughly the same IOPS as with a 4kB block size. This is a consequence of the fact that virtually all flash-based SSDs manage the NAND flash memory in 4kB chunks, even when configured to expose a 512-byte LBA size. Some drives exhibit pathologically low performance with sub-4kB block sizes, especially for writes, where a read-modify-write cycle may be necessary for the drive to preserve the data in the rest of the 4kB block.

Sequential IO with small to medium block sizes can also reveal some surprises, such as drives that seem to assume any 4kB access will be a random access and choose not to read and cache the rest of the (typically ~16kB) NAND page. Quite a few drives also show little improvement in sequential throughput with the medium block sizes, but show significant throughput scaling once the block size is well past 128kB. This is part of why we changed our burst sequential IO tests to use 1MB block sizes instead of 128kB.

Synthetic Tests: Basic IO Patterns Power Management, Conclusions

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

70 Comments

View All Comments

drmaddogs - Saturday, June 19, 2021 - link
Random is measured by Chaos measures. Turing had it best. And AI mimics this like the human brain.
pexxie - Friday, February 12, 2021 - link
I was hoping to hear more from the linux fundi. :-(
I guess criticism is easy, guidance takes effort. :-P
pexxie - Saturday, February 13, 2021 - link
An alternative to this might be a retention or volatility test. So basically hook the SSD up in a way that you can quickly yank out its sata or power cable. Then copy a very big file to it, and immediately after Windows says the copy is done; yank out the data or power cable. Then reboot and do a checksum on the file on the target SSD, and compare to the original, and see if any of them have actually written all the data.
pexxie - Saturday, February 13, 2021 - link
I wish we could edit posts. Grrr.
Otherwise if it's an M.2 slot; hit the reset button on the PC immediately after Windows says the file has finished copying. Then compare checksums.
pexxie - Saturday, February 13, 2021 - link
So basically testing power loss resiliency. There in the 1st world power reliability is of no concern, but it's a big concern here in the 3rd world. Power aint reliable like in America.
pexxie - Saturday, February 13, 2021 - link
You can observe the disk's misconduct with the disk LED on your chassis. The disk LED should stop when the file copy is done, but it doesn't - so it still takes time for it to get it onto non-volatile storage. So the data is still floating around in volatile memory while that LED is still on. I have 4 SSDs - for one of them the LED only stays on like a second after the file copy is "done." The others take 5-ish seconds. They all fail in a power cut test - killing power immediately after the OS says the copy is "done." Checked in Windows and Linux. I suspected this was misconduct by Windows, but since I see it in linux too; I'm more confident about it being disk misconduct.
pexxie - Saturday, February 13, 2021 - link
My bad. Actually this LED thing was because of buffered writes by the OS. Using xcopy in windows with the /J parameter avoids this "misconduct." So it is actually the OS behaving badly. Now to just figure out how to force all writes to be unbuffered....
Even using unbuffered writing; my SSDs still fail my power cut test - parts of the file sit in volatile memory for too long after the copy is "done" and the file gets corrupted on the destination disk.
pexxie - Sunday, February 14, 2021 - link
Woohoo! Finally solved this by mounting partitions in linux using the "sync" option. I knew TLC chips were insanely slow, but damn - less than 1MB/s sequential writing is madness. At least I'm getting 10MB/s sequential with my old MLC chips. So it was the doing of the OS all long. Multiple layers of caching make a tortoise storage medium look like a rabbit.

Won't add any more posts/spam. Just wish I could consolidate into 1.
kpb321 - Monday, February 1, 2021 - link
How is the AMD Ryzen 5 3600X system being run without a GPU? That chip doesn't have integrated video so generally I expect it would fail at Post with beep codes. AFAIK none of the AMD APUs have PCI-E 4 support so I don't think there is a way to use integrated video and support PCI-E 4. I mean it doesn't need much of a video card and the 580 in the other system is probably overkill for storage testing but it seems like it would need something even if it's installed in one of the PCI-e 3 lanes hanging off the chipset instead of the 4.0 lane off the cpu.
frbeckenbauer - Monday, February 1, 2021 - link
You can run Ryzen headless without issues on many motherboards, while some will indeed refuse to boot. MSI apparently provides a BIOS that has the error disabled so it works headless if you ask them.

How We Test PCIe 4.0 Storage: The AnandTech 2021 SSD Benchmark Suite

Advanced Synthetic Tests

Sequential Drive Fill

Working Set Size

Performance vs Block Size

Post Your Comment

70 Comments

View All Comments

drmaddogs - Saturday, June 19, 2021 - link

pexxie - Friday, February 12, 2021 - link

pexxie - Saturday, February 13, 2021 - link

pexxie - Saturday, February 13, 2021 - link

pexxie - Saturday, February 13, 2021 - link

pexxie - Saturday, February 13, 2021 - link

pexxie - Saturday, February 13, 2021 - link

pexxie - Sunday, February 14, 2021 - link

kpb321 - Monday, February 1, 2021 - link

frbeckenbauer - Monday, February 1, 2021 - link

Log in

Don't have an account? Sign up now