Sick But Not Dead in your SANs
As storage area network (SAN) environments grow and become increasingly centralized, many small and seemingly insignificant problems have the potential of becoming large scale and impactful events, unless properly contained or controlled. The need to predict, understand and mitigate the impact from “sick but not dead” (SBND) conditions in SAN space has become critical to operating a healthy and reliable storage environment.
A single bad initiator (saturated/non-responsive/erroneous) can have severe impact on its target ports and on the path to these targets. This condition may also spread the problem to other unrelated data flows sharing the same data path, even in part. The same applies to faults occurring anywhere else within the data path (i.e. on ISLs or target ports themselves). These conditions are insignificant when a given fault causes the port to go offline or lose PLOGI/FLOGI state, as failover occurs and data flow continues. The issue arises when a faulty link remains online and shows signs of intermittency or lack of responsiveness.
In these cases, we normally have to rely on the intelligence of the multipathing software, storage controllers’ QOS functionality and switch port blocking/fencing logic to identify the behavior and redirect traffic to healthy paths while excluding the intermittent ones. Unfortunately, this is the area in which many technology vendors are lacking.
On multipath driver front, most vendors do not perform prolonged link integrity checks prior to reinstating the data path after failure. As well, most are incapable of dealing with faults that manifest only in terms of non-responsiveness (no clear error accumulation). Generally, one of three things happen: 1) fault is unrecognized and data frames continue being sent via a faulty/non-responsive path 2) fault is recognized, path is excluded but is quickly reinstated upon the first successful TUR (test unit ready command) 3) path is excluded and remains offline until manually reselected for IO by the user. The first two cases result in an application hang and, depending on timeout values set, eventual application service failure. In the last case, application IO survives, but link redundancy is compromised (disk IO will fail on the first subsequent fault on the remaining link, since no alternates are now available).
On storage controller side, in case of backpressure/buffer credit starvation, even if QOS is available, most vendors employ QOS by simply applying throttling on WWPN/host object/LUN levels but never look within the data flow and attempt to identify the initiator(s) responsible for backpressure. Since there is no specific workload that causes buffer credit starvation (this can happen at 15MBps or 1600MBps), most controllers are unable to effectively prevent a single problematic initiator (or initiator group) from continuously and negatively impacting all other healthy workloads sharing the target port.
Finally, from a switch perspective, while port fencing and similar approaches exist and are generally effective for treatment of SBND conditions, they are only as good as the error counters and signal metrics available to the switch for assessment. If a fault falls below the set threshold or the fault does not generate an identifiable error counter accumulation, then port fencing logic will also ignore the fault and allow the problem behavior to continue.
Without getting into application resiliency topic and speaking strictly from SAN perspective, what the above means is that no single component is capable of handling all SBND conditions alone. All three layers of protection are required and need to be improved further by technology vendors in order to counter SAN SBND conditions effectively.
Identifying faults and recreating conditions in the lab using an error Injector
Troubleshooting large scale SAN performance issues, especially those that have roots buried within FC protocol is an interesting but lengthy undertaking. Proving to a technology vendor that a bug exists and a fix is required is even more painful and often difficult to achieve. A problem behavior can typically be tracked down to a particular host/workload using conventional performance monitoring utilities (HDS Tuning Manager, IBM VSC, Brocade BNA, etc). When identified, either the hostile workload is treated or the faulty component is fixed. A question that remains, however, is how to prevent similar widespread events in the future. Often times this leads to the need to recreate a given fault in a lab environment, which is problematic without an appropriate toolset. Workloads can be simulated with ease, but protocol level issues are not that simple to recreate. For this specific purpose an in-band protocol analyzer and error injection utility is now an integral part of my troubleshooting arsenal. I selected a device from Teledyne LeCroy’s SierraNet product line. There are others out there with various pros and cons. For me, SierraNet fit the requirement from all angles (compliments to SierraNet developers for a well thought through product and great support. Well done!).
In-band protocol analyzers give a full view of FC data exchange. “Wireshark” in FC space is essentially what these are. Conventional utilities and device logging provide only the select information that a particular vendor chooses to expose. Most times this is sufficient, but on some occasions it is not. Looking at protocol level data in such instances often reveals that one missing piece of evidence that solves the case.
Error injection capability of these tools is of even higher importance to me however. With an error injector, I can introduce any condition into the data flow: CRC errors, C3 discards, R-RDY removals, flapping links, etc. Any command or frame within an FC exchange is exposed and modifiable. To date, I’ve used this tool and approach to demonstrate product exposures to a number of vendors, identify internal design exposures as well as validate and tune remediation approaches prior to their implementation.
Typically, this type of technology is deployed in vendor labs only and plays a crucial role in their product development and testing lifecycle. It is rare to see FC error injection technology deployed on end user premises. With that said, given the number of product resiliency exposures I see and the benefit of actually testing solutions rather than taking vendors’ word for it, I feel that many large SAN technology consumers could gain from deploying these devices in their labs as well. Not only is it beneficial from a user perspective in terms of risk mitigation, but is beneficial to the industry as a whole, as more and more customers will be capable of identifying and reporting QA mistakes or oversights to SAN technology vendors.
To put the above into context, below is an example of how an error injector fits into a customer’s environment when dealing with a typical SBND fault: Application fails due to a non-responsive disk, it is identified that 1 of 8 paths to disk is non-responsive due to signal degradation (light leakage), after disabling the faulty port, traffic is automatically redistributed onto 7 remaining healthy paths and storage connectivity is restored. Questions that this issue raises are: could the switch auto-disable the degraded port (in this case yes, with port fencing feature enabled, due to clear port error count accumulation and measurable signal loss) and why didn’t the multipath driver exclude the intermittent path. For the multipathing concern, without the error injector, what would’ve happened is a lengthy log exchange routine with multipath driver vendor, discussions with the array and switch vendors and likely no quick admittance to a fault from either one of the three parties. With the error injector, I was able to quickly create an error injection scenario in our lab mimicking the event (i.e. corrupt CRC header every 300 ms on one of x-num links) and prove that said multipath driver does not exclude paths based on such intermittent behavior which results in an almost complete IO halt, while alternative multipath driver handles the issue as expected and excludes the intermittent path. After a quick WebEx demonstration to the multipath vendor’s engineers, a bug was identified and remediation development work started. While this specific fault could be simulated without an error injector (pull the LC connector out slightly for example), it would be much less controlled and you couldn’t specify the frequency of the problem event, which often times is the key factor in why intermittency isn't being addressed by multipath driver's failover logic.
Above is just one of many examples where an error injection technique is useful. Recreations can simulate impact of congested ISLs, erroneous target ports, flapping links and many other conditions. Results are often surprising in terms of how fragile enterprise SAN technology actually is.
Sick but not dead (SBND) conditions are always the most difficult to address but logic does exist to counter such behaviors. A simple link/path exclusion rule that continuously tests the marked link for integrity for x-num minutes before reinstating it would greatly reduce the number of SBND exposures that exist today. This applies not only to host multipath drivers but also to similar software deployed within storage controllers themselves. In my opinion, technology vendors should spend more time assessing and developing failover logic to address these intermittent behaviors and thus greatly improve fault tolerance of their products. From consumer perspective, the use of error injection techniques may serve as a vital addition to internal design certification process as well as result in significant reduction of troubleshooting hours spent, not to mention downtime prevention.
Author - Armen Veloumian is a storage area network (SAN) engineer currently employed by one of Canada’s leading financial institutions. He’s been administering, designing, deploying and validating various storage technologies for over a decade. His current focus is mainly with SAN planning, design and deployment as well as technology validation and testing.