Yesterday, an Acquirer switch client of ours called telling me that one of the production channel connections to their ECA provider had been down for the last 24 hours. He wondered why there was no history of the outage reflected in the application syslog.
Answer: they'd not finished configuring the channel. See to the left our jPOS-EE-fueled UI screen depicting the channel configuration. You've got to specify Timeout and On Timeout values in order for the system component to be moved to the proper error status; and you need to specify Max Events in order for the corresponding component status change history to be reflected in the syslog.
I've shown the FDR channel above as an example. I recommended the ECA channels get set up in a similar manner. My client asked a really good question: "60 seconds....isn't that a little tight for a channel timeout?" Ah - this is a confusion between the active and passive timeout concepts. Both are in play here.
He was thinking about what I call the 'active timeout.' As described here, this is a concept implemented at the channel's socket level and is specified inside of the deploy file (see image at left - production host:port information is redacted). As described in that earlier post:
Alejandro advised us to implement a property value that will set a socket-level timeout. The ‘receive’ function in the multiplexer’s ‘Channel Adapter’ will fail (with log event ‘<io_timeout>’) if nothing is received within the specified timeout period. The channel will disconnect and then attempt a reconnection.
We've got FDR set up, as you see, to 800000 ms (13 mins, 20 secs). FDR is a high-volume gateway. If we've not received a response of any kind on it for that long a duration, we want to proactively rupture and cycle the connection. This simple, easy-to-implement concept has greatly improved our up-time and reduced the volume of our support phone calls.
The 'passive timeout' - implemented in these examples by the 60-second value depicted above - is something quite different but no less important. As described here, in the jPOS-EE framework...
...the status subsystem works in a passive way. Each status entry is 'touched' from the system component that controls it. Each component has an associated timeout. If a status entry has not been touched within the given timeout period, then somebody needs to move that entry from its current ‘OK’ status to the destination error condition (e.g., WARN, CRITICAL, OFF, etc.). When a jPOS-EE Web UI user hits the status page, we run that check on all entries; but if for some reason the users are not looking at that page, we need a way to check all statuses and move timed-out entries to the specified error condition. 02_status_heartbeat is in charge of that!
-- 2009-08-19 Update to Original Post --
Federico asks: Seems kind of redundant to me to have an 'active timeout' and 800s at the same time. Am I missing something?
AAO: Very good comment. The 800-series network messages aren’t foolproof. They’re great at keeping the line active during your volume troughs so that the endpoint doesn’t disconnect and cycle your connection. They’re not great at determining and resolving hung lines. The ‘active’ timeout approach works to resolve the ‘half-baked’ channel connection, i.e., the situation where you think you’re connected, but you’re actually not.
Comments