No Circuit Diversity = SPOF
The death blow of any mission-critical payment switch is a SPOF. There are the obvious ones – like relying on one application server with no architected high availability or fault tolerance built into the design. There are also some unobvious factors, like a lack of circuit diversity. I’ll pass along some lessons learned over the past week.
We urge our OLS.Switch clients to take a number of steps to maximize the up-time of their payment switch implementations. These include:
- Replicated application nodes with connections to all endpoints from each node (establish this need early with your authorizers)
- Content Service Switch (‘CSS’) – aka a “load balancer” - fronting the nodes (and taking this to its logical conclusion, you want two of these)
- Virtual DB clusters
- OLS.Switch DB schema on a SAN
- QMUX configuration with two or more channels in the MUX definition connecting to physically separate lines
- The two lines provisioned by separate carriers – this practice is called ‘circuit diversity’…no sensitivity training required! Hey, it even has its own research initiative.
- HSRP built into the authorizer connections
Furthermore, we appreciate authorizer/endpoints that offer geographical diversity in their data centers, like in AMEX’s nice configuration where one connection goes to Phoenix (their ‘IPC’) and one to Greensboro, NC (their ‘NROC’). You have little control over this from your side, but I like to put this on the table early in planning meetings. If the authorizer doesn’t do it, we go on the record as comparing them to their peers and noting their shortcomings vs. best practices.
You can do all that and still get bitten by an unforeseen SPOF. Earlier this week, one of our clients got it, big-time. That ‘circuit diversity’ initiative referenced above? It states in part that “Manual assessment and periodic manual assurance are required to ensure that circuits are diverse and remain diverse over time.” Man, no truer words were ever written. One authorizer had what it thought was a dual-carrier approach, only to find out that both lines traced through the same CO. When the CO tanked, so did 100% of the point-of-sale authorizations serviced by that endpoint…to the tune of > $1M USD in lost sales. “Ouch” doesn’t do that justice. Now, our client’s excellent network team is working aggressively with this endpoint to engineer the SPOF out of the path.
I write here to prevent you from having similar problems. Question your authorizers very carefully about their circuit diversity. Don’t take the words for proof – ask them to demonstrate via manual assessment that the circuits are indeed diverse.
Comments