What is hot-swap and why would you need to test it in the first place?
I’ll start by saying I’m not about to delve into the technical intricacies of hot-swapping (our CEO, Mike, goes into greater detail about this as well as the things to consider when designing devices with this capability). Instead I’m stripping it back a bit – highlighting the issues commonly faced with the manual aspect of this operation, and why automating it in a testing environment is the solution.
Put simply, hot-swap is the action of replacing a component of a system while the system is still running – providing the system is hot-swap compatible.
Typically used on an enterprise level among those working with high-speed data interfaces (think disk drives, self-driving cars, data centers), hot-swap is an essential operation, for reasons you might find in the following scenarios:
- One of the hard drives in your large server array has failed and you need to swap the faulty part without shutting down the whole server. You need to make sure downtime is kept to an absolute minimum while you swap them, to keep important customer data intact.
- You’re testing the integrity of a RAID system.
- When you manufacture a new hard drive, you need to know it meets the industry specifications for hot-swap.
- You need to be able to simulate, detect and report any likely faults in a live system (i.e. ‘debug’) before it gets used for anything involving lots of important data.
Relatively simple to understand. Not so simple, however, is the hot-swap process itself.
The trouble with hot-swap
Typically, hot-swap testing – specifically for drives – has involved removing and replacing components by hand. By nature, this is complex and is riddled with issues that stem from the sheer variability introduced by human intervention.
Expect to hit the following obstacles with manual hot-swap:
- Inconsistency. In terms of speed and force, it’s impossible for us to be precise when inserting or removing a cable or drive – so how can we make each test scenario 100% repeatable? Robots are sometimes used for this, but even they aren’t immune to error. Not especially helpful when you need an accurate picture of how your device performs in certain environments.
- It’s complicated: Generating and analyzing the results of your hot-swap test would usually involve writing your own Python script.
- It’s time-consuming. Imagine having to go round an entire array by yourself, plugging and unplugging things all day. Oh, and someone needs to be paid for that…
- It’s unhealthy: Granted, plugging and unplugging these devices all day really isn’t great for your hands – but it’s even less great for your sanity.
We should stress that we know this stuff because we’ve had to do it. Our Founder once did this for a living, and if anything should serve as a warning sign against the headache of manual hot-swap, it’s the fact he was actually compelled to go and design his own automated solution and make an entire business out of it.
Better to fake it
You read that right. Precision is key to creating scenarios that are repeatable, and repeating those scenarios gives you the best picture of how your device performs. Automation removes the need for human intervention at critical or production stages, thus cutting out a lot of potential errors and dangers (both to the human and the device). That’s what compelled us at Quarch to design a system that creates a hot-swap event, or system failure, without the need to touch anything other than a button.
To do the job, the tool required is the Quarch Torridon system – plus our software, such as QPS or Drive Test Suite (an interface that records and displays all your resulting test data in a comprehensive way, removing the need for you to code your own). When plugged into your host system and your drive, the tools perform all the functions of hot-swap while the system is still intact – and it works together seamlessly, thus cutting out a huge chunk of the expense and time of manual testing. They also perform many more tests than have been achievable manually.
Quarch hot-swap modules are now considered the ‘gold standard’ for this type of testing and are even mandatory components of testing the new NVMe storage standard in the US.
Feel free, also, to read more about our automated hot-swap and fault injection tools.