Samsung has announced a new prototype key-value SSD that is compatible with the first industry standard API for key-value storage devices. Earlier this year, the Object Drives working group of Storage Networking Industry Association (SNIA) published version 1.0 of the Key Value Storage API Specification. Samsung has added support for this new API to their ongoing key-value SSD project.

Most hard drives and SSDs expose their storage capacity through a block storage interface, where the drive stores blocks of a fixed size (typically 512 bytes or 4kB) and they are identified by Logical Block Addresses that are usually 48 or 64 bits. Key-value drives extend that model so that a drive can support variable-sized keys instead of fixed-sized LBAs, and variable-sized values instead of fixed 512B or 4kB blocks. This allows a key-value drive to be used more or less as a drop-in replacement for software key-value databases like RocksDB, and as a backend for applications built atop key-value databases.

Key-value SSDs have the potential to offload significant work from a server's CPUs when used to replace a software-based key-value database. More importantly, moving the key-value interface into the SSD itself means it can be tightly integrated with the SSD's flash translation layer, cutting out the overhead of emulating a block storage device and layering a variable-sized storage system on top of that. This means key-value SSDs can operate with much lower write amplification and higher performance than software key-value databases, with only one layer of garbage collection in the stack instead of one in the SSD and one in the database.

Samsung has been working on key-value SSDs for quite a while, and they have been publicly developing open-source software to support KV SSDs for over a year, including the basic libraries and drivers needed to access KV SSDs as well as a sample benchmarking tool and a Ceph backend. The prototype drives they have previously discussed have been based on their PM983 datacenter NVMe drives with TLC NAND, using custom firmware to enable the key-value interface. Those drives support key lengths from 4 to 255 bytes and value lengths up to 2MB, and it is likely that Samsung's new prototype is based on the same hardware platform and retains similar size limits.

Samsung's Platform Development Kit software for key-value SSDs originally supported their own software API, but now additionally supports the vendor-neutral SNIA standard API. The prototype drives are currently available for companies that are interested in developing software to use KV SSDs. Samsung's KV SSDs probably will not move from prototype status to being mass production products until after the corresponding key-value command set extension to NVMe is finalized, so that KV SSDs can be supported without needing a custom NVMe driver. The SNIA standard API for key-value drives is a high-level transport-agnostic API that can support drives using NVMe, SAS or SATA interfaces, but each of those protocols needs to be extended with key-value support.

Comments Locked

48 Comments

View All Comments

  • FunBunny2 - Friday, September 6, 2019 - link

    "its transactional capabilities work by allowing a (single) writer to write an alternative B+ tree whilst multiple readers use the current B+tree, all of which is safe to do because the copy-on-write semantics isolate the writer entirely from the readers. when the transaction is complete, a new B+-tree "root" is created, which *only new readers* will be allowed to see (not the ones that currently have a read transaction in progress)."

    this is not materially different from MVCC semantics, which have hobbled Oracle et al for a long time. Oracle eats servers for breakfast. sane transaction scoping, sans client intervention, with a locker (traditional) engine, will always be more parsimonious and efficient.
  • lkcl - Friday, September 6, 2019 - link

    you may be interested to know that an oracle employee tracked the LMDB wikipedia page for a long time, continuously trying to get it deleted. when they succeeded, i went into overdrive, spent three weeks rewriting it to a high standard. we have that unethical oracle employee to thank, for that. whoopsie :)

    howard chu's perspecive on write journal transaction logging is both dead-accurate and extremely funny. he says, very simply, "well if you maintain a write journal you just wrote data twice and cut performance in half".

    oracle's approach is *in no way* more efficient. "eats servers for breakfast" - as in, it's so heavy it "consumes" them? as in, it puts such a heavy load onto machines that it destroys them?

    howard's LMDB code is so small it fits into the L1 cache of any modern processor (including smartphone ones). by designing the code to *be* efficient, the server happens to have more time to do things such as "read and write".

    interestingly, as long as fsync is enabled, it is literally impossible to corrupt the database. only if SSDs without power loss protection are used (such that the *SSD* corrupts the data) will the database become corrupted. that's with *no write journal because by design one is not needed*. you might lose some data (if the last write transaction did not get to write the new B+-tree root iin time), but you will not *corrupt* the *existing* data due to a power-loss event.

    https://en.wikipedia.org/wiki/Multiversion_concurr...

    no.

    timestamps are *NOT* used in LMDB. there is absolutely no slow-down as the number of readers increases, because there is no actual "locking" on read-access. the only "locking" required is at the point where a writer closes a transaction and needs to update the "root" of the B+-tree.

    so if a reader happens to start a new read transaction just at the point where a writer happens to want to *close* a transaction, there is a *small* window of time in which the reader *might* be blocked (for a few hundred nanoseconds). if it misses that window (early) it gets the *old* root node. if it misses that window (late) it gets the *new* root node. either way there will be no delay.

    there are *NOT* "multiple versions of the same objects", either, because there is only one writer. however there is some garbage-collection going on: old B+-tree roots, pointing to older "versions", will hang around until the very last reader closes the read transaction. at that point, LMDB knows to throw those pages onto the "free" page list.

    so the only reason that "old" objects will be still around is because there's a reader *actually using them*. duh :)

    the difference in the level of efficiency between MVCC and what LMDB does is just enormous, and it's down to the extremely unusual choice to use copy-on-write shared memory semantics.

    you may be interested to know that the BTRFS filesystem is the only other well-known piece of software using copy-on-write. it's just so obscure from a computer science perspective that it was 20 years before i even became aware that shmem had a copy-on-write option!
  • lkcl - Friday, September 6, 2019 - link

    re-read that wikipedia page:

    "When an MVCC database needs to update a piece of data, it will not overwrite the original data item with new data, but instead creates a newer version of the data item. Thus there are multiple versions stored. "

    so yes... technically you are correct. however the actual implementation details are chalk and cheese. without using copy-on-write, the performance penalty is just enormous. *with* copy-on-write, MVCC semantics are an accidental beneficial *penalty-less* design side-effect.
  • bcronce - Friday, September 6, 2019 - link

    "you may be interested to know that the BTRFS filesystem is the only other well-known piece of software using copy-on-write"

    ZFS uses a merkle tree, which is also CoW due to the nature that an individual branch of a merkle tree is effectively immutable. Any changes by definition has to create a whole new branch. Git is another example of this same algorithm. You can't change history, but you can create a new history, and it is perfectly obvious that is it different.
  • mode_13h - Sunday, September 8, 2019 - link

    And Sun/Oracle also wrote ZFS.
    : )
  • mode_13h - Sunday, September 8, 2019 - link

    > i went into overdrive, spent three weeks rewriting it to a high standard

    Thanks for that & your informative post.

    > interestingly, as long as fsync is enabled, it is literally impossible to corrupt the database. only if SSDs without power loss protection are used (such that the *SSD* corrupts the data) will the database become corrupted.

    Even with fsync(), I believe you still need to enable filesystems' write barriers - a common mount option.

    > you may be interested to know that the BTRFS filesystem is the only other well-known piece of software using copy-on-write.

    This is a very strong claim. I know Linux has long used CoW in its VM system to trigger page allocations. I'm not sure if it's still true, but it used to be that malloc() would return memory mapped to a shared page containing zeros. Only when you tried to initialize it would a write-fault cause the kernel to go out and find an exclusive page to put behind it. Now that I think about it, I'm not sure how advantageous that would really be, especially if using 4 kB pages - the overhead from all those page faults would surely add up.

    CoW semantics show up in a lot of places where you have ref-counting. Some libraries force a copy, when you try to write to an object and find that the refcount is > 1 (meaning you don't have exclusive ownership).
  • mode_13h - Sunday, September 8, 2019 - link

    BTW, there's some hint of irony at bashing Oracle on one hand, and then seemly praising BTRFS - also written by Oracle! That said, big companies are rarely all bad or all good.
  • bcronce - Friday, September 6, 2019 - link

    That is interesting. A single writer DB can be useful for limited applications or heavy-read low-write. But then you also have the issue of needing to partition your data in order to scale up writes.

    Moral of the story. There are many interesting tools to solve many types of problems. Some are fundamentally a poor fit for certain problems, and others could be a good or bad fit depending on the design, implementation, and configuration.

    The most fun thing about document data stores that everyone is using is distributed data stores that need multi-document cross-service transnational consistency is a much more difficult problem that multithreading. At least with multithreading, you're working in the same memory space, no messages are lost, and very small upper bounds on delayed messages.

    Engineers treat multithreading as some humanly impossible thing, but then blindly barrel head first into distributed eventually-consistent data storage with no cross-service consistency protocols and assume anyone can write a "microservice" in a vacuum as a one sprint project.
  • lkcl - Friday, September 6, 2019 - link

    yes, you may be interested to know that LMDB was initially designed purely for OpenLDAP (focussing as it does on read performance). howard's technical background would not let him write something that was inefficient, and that, surprisingly, makes it "not bad" (not great, just "not bad") for writes as well.

    OpenLDAP obviously has slurpd and can do data distribution. i agree 100%, it's technically an extremely challenging problem. not helped by developers not appreciating that network connections can disappear on you, and the *entire program* has to be designed around connection recovery and continuation.

    tricky! :)

    howard's comments on eventually-consistent databases are very funny. he says something like, "if you are going to use that [brain-dead technical approach] you might as well just use mongodb and hope that it writes the data", i wish i could remember where it was, it was incredibly funny. basically someone suggested doing something which, technically, was as good as just throwing the data away.

    the last time i did a review of these eventually-consistent "data"bases, i was asked to throw as much data at them as possible in order to check maximum write rates. i was shocked to find that mongodb, after only 3 minutes, paused for around 30 seconds and would not let me write any more data. after another 2 minutes, it paused for 45 seconds. 2 minutes later, it paused for an entire minute. you can guess where that went. we did not go with mongodb.

    btw another really informative article by him: https://www.linkedin.com/pulse/actordb-distributed...
  • mode_13h - Sunday, September 8, 2019 - link

    > The most fun thing about document data stores that everyone is using is distributed data stores that need multi-document cross-service transnational consistency is a much more difficult

    You mean "document data stores that everyone is using *as* distributed data stores that need multi-document cross-service *transactional* consistency" ?

    > assume anyone can write a "microservice" in a vacuum as a one sprint project.

    Lol, true.

Log in

Don't have an account? Sign up now