Understanding QoS

While SSDs generally provide good average latency, individual access times can spike to magnitudes higher than the mean average. These major spikes typically occur hundreds of times per second, but arguably more problematic are the spikes that are 3x to 5x the mean average. While these 3x to 5x spikes may occur on less than 10% of operations, this still translates into thousands of spikes per second. And that is just for one SSD. The problem compounds with an array of independent SSDs where each is randomly producing its own latency spikes.

SSD latency spikes occur without warning, either in frequency or magnitude. While the degree of impact is unknown, the only guarantee is the random variability.

“Consistent latency is the goal of every storage solution … the degree of variability is what separates enterprise storage solutions from typical client-side hardware. ….The degree of variability is especially pertinent, as many applications can hang or lag as they wait for I/O requests to complete.”
TweakTown SSD Testing Analysis Summary

Data center storage workloads vary, but the most prevalent workloads involve a mix of read and write accesses, with two reads for each write being the most common mix. This is also one of the most difficult workloads for SSDs to deliver deterministic QoS.

Symphonic achieves a magnitude improvement in latency QoS for mixed read/write operations. The combination of operating in effective host address space, while performing Flash management processes under cooperative host control, not only provides a dramatic reduction in mean average latency, but provides completely deterministic response times. This type of consistent QoS, not possible with FTL SSDs, effectively creates a new Flash storage tier to extend Flash throughout Next Gen data centers.

Why latency QoS has become the dominant
metric for data center performance

Applications from VDI to analytics or OLTP each demand low latency, but the most important requirement is that those response times be consistent and predictable. Data center applications accessing storage must be able to rely on a predetermined level of responsiveness to reliably execute their own operations within defined intervals, which cannot be achieved by just relying on the mean average latency.

“AVE Latency only shows the “average” of the response times recorded. MAX Latency only shows the slowest response time measured during the test period. …Confidence plots show the percentage of all response times that occur at a given time threshold – at what time value will, for example, 99.99% of the IOs occur. A higher percentage at a faster time is better.

Confidence plots can be used as a Quality of Service (QoS) metric to show how many IOs could be dropped if they occur slower than a given time value. The “number of nines” is shorthand for indicating QoS – 99.99% or 99.999% is “4 nines” or “5 nines” of Confidence.”
– SNIA SSSI “SSD Performance, Evaluation, and Test”
Eden Kim, August 2013

So can a particular application and specific user accept a predetermined latency 95% of the time, meaning that exceeding the timing threshold for 5% of operations is acceptable? Or does it need to be 99.999% of the time? And when the timing threshold is exceeded, by what margin can it be exceeded? This is a difficult question for any individual application, but harder for networked arrays servicing multiple applications, and near impossible for multi-tenant cloud environments.

The only practical answer to this issue for service providers is Service Level Agreements (SLAs), where guaranteed levels of performance, capacity, availability and other services are specified in formal contracts. But from a cost and management perspective, enterprise environments share many of the same challenges as resources allocations must be adjusted to fulfill dynamic business requirements.

For application designers, the variability and unpredictable aspect of latency spikes means that they do not know what specific operation they are executing will hit one of the thousands of spikes randomly occurring in any given second, so they have to design the application around an assumption that all operations will be serviced at some upward bound level, e.g., 95% or higher.

Even then, how do application designers factor in the outlier spikes that, while occurring less than 1% of the time, can spike to 10 ms, 20 ms, or more? They cannot, but the severity of these spikes means that the application can hang, and at a minimum have a significant impact on how the time out effects subsequent operations following these outlier spikes.

It is well understood that the frequency and magnitude of latency spikes increases with capacity utilization, so virtually any solution attempting to address Flash QoS is based on some form of overprovisioning.

Enterprise-class SSDs are already overprovisioned from 10% to 30%, with 23% often cited as an average for enterprise class devices. What is often overlooked is that systems are already overprovisioned at the system level, typically 20% to 35%, i.e., only utilizing 65% to 80% of the advertised storage device capacity. But despite how much this overprovisoning increases costs, that still isn’t enough for most purpose-built storage systems to provide premium QoS. Instead, most new All-Flash Arrays are built upon an approach that utilizes full stripe writes to mitigate latency spikes, substantially increasing system write amplification, wearing out the SSDs, and again requiring more overprovisioning depending on the system write rates.

The biggest QoS challenge for SSDs is the most common workload mix, namely when read and write operations are mixed together. While workloads vary considerably, data center applications often mix reads and writes roughly at a ratio of two reads for each write operation. This is especially challenging for SSDs, and in particular for the most critical metric, which is read latency and QoS in the presence of writes.

Most purpose-built systems in data centers today, and yet more so for advanced and Next Gen systems, employ some form of write-cache or logging at the system level. This architectural approach is chosen because write-caching or logging at the system level provides superior response times and requires less fault tolerant memory than if it was employed at the device level. In the case of log structured architectures and related variants (LSM, Append-Only, etc.), which form the basis for the overwhelming number of purpose-built and advanced systems, the result is that these systems have already serialized system write operations and solved write latency in a manner that is superior to any Flash SSD. But they have done so at a steep cost for random read operations, where read requests are now more likely to be scattered across random locations.

For this reason, random read latency and QoS, in the presence of write operations, generally define storage system performance levels. Unfortunately, this is also the greatest challenge for FTL SSD architectures. NAND memory programing (write) operations are typically over 2ms, >10x those required for read operations. To achieve good write performance, FTLs must parallelize write operations over multiple dies and channels, consequently achieving good aggregate bandwidth to fulfill write operations. However, this approach also increases the likelihood that any read request to the SSD will get stuck behind a write operation. This results in suboptimal read latency and the destruction of read latency QoS, where read operations that should average less than 250 us occasionally get stuck behind write operations, creating latency spikes over 2 ms, without warning.

Symphonic Cooperative Flash Management was created to address this architectural challenge while reducing Flash overprovisioning costs.

Understanding QoS

Why latency QoS has become the dominant metric for data center performance

Why latency QoS has become the dominant
metric for data center performance