How To Solve SSD Longevity Challenges

There are lots of reasons to use SSD storage devices. SSD devices are lightning fast. You don’t have to wait for the drive to spin up a thin sheet of metal (or polymer). You don’t have to wait for a drive head to get properly positioned over the right physical location on the drive. [Note: This was a real problem with legacy disk storage – until EMC proved that cache is the best friend that any spinning media could have.] Today, solid state storage is demonstrably smaller than any physical storage media. But in the past few years, SSD longevity has become a serious concern.

Background

Many of us carry a few of these devices with us wherever we go. They are a very durable form of storage. You can drop a 250GB thumb drive (or SSD) into your pocket and be confident that your storage will be unaffected by the motion. If you did the same with a small hard disk, then you might find that data had been lost due to platter and/or R/W head damage.

Similarly, the speed, power, and thermal properties make these devices a fantastic inclusion into any mobile platform – whether it be a mobile phone, a tablet, or even a laptop. In fact, we just added SSD devices to a number of our systems. With these devices, we have exceptionally good multi-boot options at our disposal. For my personal system, I can boot to the laptop’s main drive (to run an Ubuntu 19.04 system) or I can boot to an external, USB-attached SSSD drive where I have Qubes 4.0 installed.

Whether you want fast data transfer speeds, reduced power needs, or a reduced physical footprint, SSD storage is an excellent solution. But it is not without its own drawbacks.

Disadvantages

No good solution comes without a few drawbacks. SSD is no exception. And the two real drawbacks are SSD cost and SSD longevity. The cost problems are real. But they are diminishing over time. As more phones are coming with additional storage (e.g., 128GB – 256GB of solid state storage on recent flagship phones), the chip manufacturers have responded with new fabrication facilities. But even more importantly, there is now substantial supply competition. And increased supply necessarily results in price reductions.

Even more importantly, device construction is becoming less complex. Manufacturers can stuff an enclosure with power, thermal flow control, media, rotational controls (e.g., stepper motors, servos), and an assortment of programmable circuits. Or manufacturers can just put power and circuits into a chip (or chip array). For things like laptops, this design streamlining is allowing vendors to swap spinning platters for additional antenna arrays. The result of this is inevitable. Manufacturing is less complex. Integration costs (and testing costs) are also less. This means that the unit costs of manufacturing are declining.

Taken together increased supply and decreased costs have bent the production function. So SSD is an evolutionary technology that is rapidly displacing spinning media. But there is still one key disadvantage: SSD longevity.

SSD Longevity

In the late eighties, the floppy disk was replaced by optical media. The floppy (or rigid floppy) was replaced by the CD-ROM. In the nineties, the CD-ROM gave way to the DVD-ROM. But in both of these transitions, the successor technology had superior durability and longevity. That is not the case for SSD storage. If you were to treat an EEPROM like a cD-ROM or DVD-ROM, it would probably last for 10+ years. But the cost per write would be immense.

Due to its current costs, no one is using SSD devices for WORM (write once, read many) storage. These devices are just too costly to be written as an analog to tape storage. Instead, SSD’s are being used for re-writable storage. And this is where the real issue arises. As you re-write data (via electrical erasure and new writing), the specific physical location in the chip becomes somewhat unstable. After numerous cycles, this location can become unusable. So manufacturers are now publishing the number of program / erase cycles (i.e. p/e cycles) that their devices are rated to deliver.

But is there a real risk of exhausting the re-write potential of your SSD device? Yes, there is a real risk. But with every new generation of chips, the probability of failure is declining. Nevertheless, probabilities are not your biggest concern. Most CIO’s should be concerned with risk. If you data is critical, then the risk is real – regardless of the probabilities for failure.

Technology Is Not The Answer

Most technologists focus on technology. Most CIO’s focus on cost / benefit or risk / reward. While scientific and engineering advances will decrease the probability of SSD failure, these advances won’t really affect the cost (and risks) associated with an inevitable failure. So the only real solutions are ones to mitigate a failure and to minimize the cost of recovery. When a failure occurs (and it will occur), how will you recover your data?

Bypass The Problem

One of the simplest things that you can do is to limit the use of your SSD devices. That may sound strange. But consider this. When a failure occurs, your system (OS and device drivers) will mark a “sector” as bad and write the data to an alternate location. If such a location exists, then you continue ahead w/o incurring any real impact.

The practical upshot of this is that you should always seek to limit how much data is written to the device in order to ensure that there is ample space for rewriting the data to a known “good” sector. Personally, I’m risk averse. So I usually recommend that you limit SSD usage to ~50% of total space. Some people will recommend ~30%. But I would only recommend this amount of unused space if your SSD device is rated for higher p/e cycles.

Data Backup and Recovery Processes

For most people and most organizations,it takes a lot to recover from a failure. And this is true because most organizations do not have a comprehensive backup and recovery program in place. In case of an SSD failure, you need to have good backups. And you should continue to perform these backups until the cost of making backups exceeds the costs of recovering from a failure.

For a homeowner who has a bunch of Raspberry Pi’s running control systems, the cost of doing backups is minimal. You should have good backups for every specific control system that you operate. For our customers, we recommend that routine backups be conducted for every instance of Home Assistant, OpenHab, and any other control system that the customer operates.

For small businesses, we recommend that backup and recovery services be negotiated into any management contract that you have with technology providers. If you have no such contracts, then you must make sure that your “in-house” IT professionals take the job of backup and recovery very seriously.

Of course, we also recommend that there be appropriate asset management, change management, and configuration management protocols in place. While not necessary in a home, these are essential for any and all businesses.

Bottom Line

SSD devices will be part of your IT arsenal. In fact, they probably already are a part of your portfolio – whether you know it or not. And while SSD devices are becoming less costly and more ubiquitous, they are not the same as HDD technology. Their advantages come at a cost: SSD longevity. SSD devices have a higher probability of failure than do already-established storage technologies. Specifically, they do have a higher probability of failure. Therefore, make sure that you have processes in place to minimize the impact of failures and to minimize the cost of conducting a recovery.