Comparison With Other Storage Paradigms

Zoned storage is just one of several efforts to enable software to make its IO more SSD-friendly and to reduce unnecessary overhead from the block storage abstraction. The NVMe specification already has a collection of features that allow software to issue writes with appropriate sizes and alignment for the SSD, and features like Streams and NVM Sets to help ensure unrelated IO doesn't land in the same erase block. When supported by the SSD and host software, these features can provide most of the QoS benefits ZNS can achieve, but they aren't as effective at preventing write amplification. Applications that go out of their way to serialize writes (eg. log-structured databases) can expect low write amplification, but only if the filesystem (or another layer of the IO stack) doesn't introduce fragmentation or reordering commands. Another downside is that these several features are individually optional, so applications must be prepared to run on SSDs that support only a subset of the features the application wants.

The Open Channel SSD concept has been tried in several forms. Compared to ZNS, Open Channel SSDs put even more requirements on the host software (such as wear leveling), which has hindered adoption. However, capable hardware has been available from several vendors. The LightNVM Open Channel SSD specification and associated projects have now been discontinued in favor of ZNS and other standard NVMe features, which can provide all of the benefits and functionality of the Open Channel 2.0 specification while placing fewer requirements on host software (but slightly more on SSD firmware). The other vendor-specific open channel SSD specifications will probably be retired when the current hardware implementations reach end of life.

ZNS and Open Channel SSDs can both be seen as modifications to the block storage paradigm, in that they continue to use the concept of a linear space of Logical Block Addresses of a fixed size. Another recently approved NVMe TP adds a command set for Key-Value Namespaces, which completely abandon the fixed-size LBA concept. Instead, the drive acts as a key-value database, storing objects of potentially variable size, identified by keys of a few bytes. This storage abstraction looks nothing like how the underlying flash memory works, but KV databases are very common in the software world. A KV SSD allows such a database's functionality to be almost completely offloaded from the CPU to the SSD. Implementing a KV database directly in the SSD firmware avoids a lot of the overhead of implementing a KV database on top of block storage that is running on top of a traditional Flash Translation Layer, so this is another viable route to making IO more SSD-friendly. KV SSDs don't really have the cost advantages that a ZNS-only SSD can offer, but for some workloads they can provide similar performance and endurance benefits, and save some CPU time and RAM in the process.

The Software Model Ecosystem Status
Comments Locked

45 Comments

View All Comments

  • Carmen00 - Friday, August 7, 2020 - link

    Fantastic article, both in-depth and accessible, a great primer for what's coming up on the horizon. This is what excellence in tech journalism looks like!
  • Steven Wells - Saturday, August 8, 2020 - link

    Agree with @Carmen00. Super well written. Fingers crossed that one of these “Not a rotating rust emulator” architectures can get airborne. As long as the flash memory chip designers are unconstrained to do great things to reduce cost generation to generation with the SSD maintaining the fixed abstraction I’m all for this.
  • Javier Gonzalez - Friday, August 7, 2020 - link

    Great article Billy. A couple of pointers to other parts of the ecosystem that are being upstreamed at the moment are:

    - QEMU support for ZNS emulation (several patches posted in the mailing list)
    - Extensions to fio: Currently posted and waiting for stabilizing support for append in the kernel
    - nvme-cli: Several patches for ZNS management are already merged

    Also, a comment to xZTL is that it is intended to be used on several LSM-based databases. We ported RocksDB as a first step, but other DBs are being ported on top. xZTL gives the necessary abstractions for the DB backend to be pretty thin - you can see the RocksDB HDFS backend as an example.

    Again, great article!
  • Billy Tallis - Friday, August 7, 2020 - link

    Thanks for the feedback, and for your presentations that were a valuable source for this article!
  • Javier Gonzalez - Friday, August 7, 2020 - link

    Happy to hear that it helped.

    Feel free to reach out if you have questions on a follow-up article :)
  • jabber - Friday, August 7, 2020 - link

    And for all that, will still slow to Kbps and take two hours when copying a 2GB folder full of KB sized microfiles.

    We now need better more efficient file systems not hardware.
  • AntonErtl - Friday, August 7, 2020 - link

    Thank you for this very interesting article.

    It seems to me that ZNS strikes the right abstraction balance:

    It leaves wear leveling to the device, which probably does know more about wear and device characteristics, and the interface makes the job of wear leveling more straightforward than the classic block interface.

    A key-value would cover a significant part of what a file system does, and it seems to me that after all these years, there is still enough going on in this area that we do not want to bake it into drive firmware.
  • Spunjji - Friday, August 7, 2020 - link

    Everything up to the "Supporting Multiple Writers" section seemed pretty universally positive... and then it all got a bit hazy for me. Kinda seems like they introduced a whole new problem, there?

    I guess if this isn't meant to go much further than enterprise hardware then it likely won't be much of an issue, but still, that's a pretty keen limitation.
  • Spunjji - Friday, August 7, 2020 - link

    Great article, by the way. Realised I didn't mention that, but I really appreciate the perspective that's in-depth but not too-in-depth for the average tech-head 😁
  • AntonErtl - Saturday, August 8, 2020 - link

    As long as the zone is not divided between file systems, or direct-access databases, it is natural that writes are are synchronized and sequenced. And talking to the SSD through one NVMe/PCIe interface means that all writes (even to multiple zones) are sent to the drive in sequence.

    OTOH, you have software and hardware with synchronous interfaces (waits for some feedback before sending the next request), and in such a setting doing everything through one thread costs throughput.

    So you can either design everything to work with asynchronous interfaces (e.g., SCSI tagged command queuing), at least at all single-thread levels, or you design synchronous interfaces that work with multiple threads. The "write it somewhere, and then tell where you wrote" approach seems to be along the latter lines. What's the status of asynchronous interfaces for NVMe?

Log in

Don't have an account? Sign up now