Solid State Drives are no longer new technology but may still improve as time goes by. SSD’s have only been in use for about a decade. Hard Drives have been available for about 50 years from the date of this article. With more usage time and history of the expected lifespan of a SATA Hard Drive, more accurate and reliable data on Mean Time Before Failure data (MTBF) is available. This is not so much the case with SSD’s. However, that does not necessarily mean that MTBF data on SSD’s is inaccurate or too limited to draw reliable conclusions on the lifespan of SSD’s. Determining what is the most reliable SSD technology available is a complex process.
What is important is understanding that the two technologies are quite different from each other. They will behave in different ways when it comes to real-world use. Expectations and justification for the added expense of this newer technology also need to be taken into account.
What has some technology experts in disagreement with each other is which of the three types of different SSD technologies has longer data retention:
Single-level cell (SLC)
Multi-level cell (MLC)
Triple-level cell (TLC)*.
SLC stores one bit per cell, MLC stores two bits per cell and TLC which stores three bits on each cell. Some believe SLC is more durable than the other technologies.
However, this changed since the release of a six-year study carried out by Google about the most reliable SSD technology available. They concluded that there is, in fact, no real difference between the three SSD technologies and data retention. SLC based SSD’s are no more reliable than the other two technologies. (Note, other factors still do affect an SSD’s lifetime.)
Interesting to note in the Google SSD study is that 20% – 60% of the SSD’s tested ended up having at least one uncorrectable read error resulting in data that simply cannot be recovered. Write errors had a significantly lower percentage rate of 1.5% – 2.5%. This is because if a write error occurs the server will write to another part of the drive.
What does this mean? If a drive is experiencing write errors on that drive, the drive is experiencing a major and significant hardware failure. This is because it would have to completely run out of room on the drive to write that data on a non-failed portion of the drive. Unfortunately, in most cases, data retrieval that is not possible due to read errors results in data retrieval losses. A write error should have an option of simply writing to another drive/location. Read errors almost certainly cause data loss.
So What is the Most Reliable SSD Technology Available?
The conclusion in the Google SSD study leans more towards SSD failures resulting from the age of the drive and not the amount of use. “We see no evidence that higher-end SLC drives are more reliable than MLC drives within typical drive lifetimes.”
Heat and electricity are eventually what cause hardware failures in dedicated servers (especially if there is poor cooling and ventilation.) Just because there are no moving parts does not mean that the hardware will never fail. Shutting down servers and rebooting also causes a lot of stress on all parts of dedicated server hardware. This will affect statistics on what is the most reliable SSD technology available.
All of this can have a big impact on technology firms looking at their budget involving repairs and replacement parts. SLC drives seem to fare no better than the other SSD technologies.
When an SSD fails due to developing new bad blocks numbering in the hundreds it is likely due to chip failure. It’s usually detected by the user during read operations, and less so with erase and write operations. Bad blocks tend to be a frequent occurrence and depending on the manufacturer and model, 30-80% of drives will develop bad blocks in real-world use.
Fortunately, most drives develop only a few ranging from 2-4. However, if an SSD should develop more bad blocks than the average of 2-4, it will develop many more numbering in the hundreds. “After only 2-4 bad blocks on a drive, there is a 50% chance that hundreds of bad blocks will follow.”
Interesting to note is that virtually every single drive has bad blocks, to begin with, although this is separate from developed bad blocks from usage in a Data Center environment mentioned previously which could eventually lead to chip failure.
What sets this study apart from many others is Google’s decision to carry out this study in an environment that mimics the average Data Center instead of in a strict lab controlled environment. Ten different drive models were tested with different flash technologies over a four to six-year period.
SSD Advantages and Reliability
The main advantages SSD’s still certainly have over Hard Disk Drives (HDDs) is read & write speeds and (adhering to this articles topic on reliability), low failure/replacement rates.
However, more than 20% of the flash drives in the Google study developed uncorrectable errors. 30-80% develop bad blocks while 2-7% develop bad chips. Compared to HDD reports only 3.5% developed bad sectors in just under a 3 year period. Flash drives still have a much lower replacement rate within their rated lifetime. How this can be interpreted is, even though replacement rates for flash drives are better, their uncorrectable errors, bad blocks, and chips will, of course, lead to problems for the user. This, of course, plays a large part in interpreting what is the most reliable SSD technology available.
Also worth noting is that SSD manufacturers are able to use a spare chip should one become unusable. This will help SSD’s in appearing better on paper in studies showing which is the most reliable SSD technology available. The Google study did not allow for this.
Google has work showing techniques that can predict uncorrectable errors with surprising accuracy. This is based on factors such as age and prior errors.
For an audio presentation of this study please go here: https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder
* Google’s study included MLC, SLC and eMLC (enterprise Multi-Level Cell.)
BAIRAVASUNDARAM, L. N., GOODSON, G. R., PASUPATHY, S., AND SCHINDLER, J. An analysis of latent sector errors in disk drives. In Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (New York, NY, USA, 2007), SIGMETRICS ’07, ACM, pp. 289–300.