You want a timeout

April 27, 2014 · 3 min read

jPOS project founder

Every single week, for the last 14 years I have discussions with developers, CTOs and CIOs about channel timeouts.

The discussions usually start by a customer requirement asking us to keep established socket connections forever.

They say "We want the socket to stay always connected, forever. We don't want to see disconnects. Our systems are very reliable, our remote endpoint partners are very reliable, we don't want a timeout".

So I usually start with the Fallacies of distributed computing but I'm never lucky. I try to explain that I don't want to die, but it just so happens that I will certainly die, sooner or later. It's life.

Disconnections happen, networking problems happen all the time, router and firewall reboots, and the most evil situation, a paranoid firewall administrator configuring very tight timeouts.

When jPOS is the client, and the channel is idle for a long period of time, having no timeout is actually not a big deal. Imagine a situation where the channel is connected for say 5 minutes, but our paranoid FW administrator had set a timeout of 3 minutes to disconnect the session. While jPOS believes we are connected, we are actually not connected, so when a real transaction arrives, and we try to send it, we find out we are no longer connected. That will raise an exception, we'll reconnect, and we'll send the message (a few seconds later). So the problem is just a delay that may put us out of the SLA for this particular transaction, but it's still not a big deal, the system will recover nicely.

But when jPOS is the server, and we don't have a timeout, the client will establish a new connection, but the old one will remain connected forever. A few hours/days later, these connection will accumulate and we'll hit the maxSessions of the QServer configuration (see the Programmer's Guide section 8.4). Only way to recover is to restart that particular QServer, something that needs to be done manually.

You can set SO_KEEPALIVE at the channel level in order to detect these broken connections, and in order to prevent some firewalls from disconnecting your session, but the KEEPALIVE time is OS dependent.

Our recommendation is to send network management messages from time to time (i.e. every 5 minutes) and have a reasonable timeout of say 6 minutes.

There's another situation where you want a timeout. Imagine an ideal network (I call it 'Disney LAN') where the connection remains ESTABLISHED from a TCP/IP standpoint, but the remote host's application is dead and is not answering to your replies. You can of course detect that at the application level (i.e. MUX) and proactively initiate a reconnection, but if that logic fails (or you never implemented it), a reasonable timeout will recover automatically from the situation. The remote host doesn't reply, the call to channel receive time us out out, we reconnect, and with a little bit of luck, we get to connect to a new session that actually works.