Skip to main content

Pause Timeout

· 4 min read
Alejandro Revilla
jPOS project founder

In 2007, a long time ago, when hardware was not as fast as it is today, we introduced CONTINUATIONS that had the ability to PAUSE (or park) a given transaction while waiting for a response from a remote server. Think of it as some kind of reactive programming.

If you have, say, a 4 CPU machine and need to process one thousand simultaneous transactions, without continuations you would need to set your TransactionManager sessions to a disproportionately large number (such as 1000), which has no relation to your hardware, making the entire system slower due to the context switch overhead.

With continuations, a small number of sessions (e.g., twice the number of CPUs you have) can handle a large number of transactions, because those waiting for a response are offloaded from the transaction manager until the response arrives.

The typical candidate for continuations is the QueryHost participant, which is implemented like this:

  mux.request(m, t, this, ctx);
return PREPARED | READONLY | PAUSE | NO_JOIN;

The PAUSE modifier tells the TM to park that particular transaction until the context is resumed as part of its Pausable interface implementation. For the record, Pausable looks like this:

public interface Pausable {
void setPausedTransaction(PausedTransaction p);
...
...
void resume();
}

and the standard TransactionManager’s Context (org.jpos.transaction.Context) implements it.

When we call the asynchronous mux.request implementation and pass it an ISOResponseListener (QueryHost implements ISOResponseListener) and the current Context as a handback object, the transaction gets resumed and queued back with high priority when a response is received or the MUX times out, allowing it to be picked up by the next available session in the TransactionManager.

Futures were not popular in those days. Otherwise, the void request(ISOMsg m, long timeout, ISOResponseListener r, Object handBack) signature of the MUX’s asynchronous call would have probably been implemented using them.

QueryHost, which uses QMUX, as well as many other participants such as HttpQuery in jPOS-EE, use this facility to handle a large number of simultaneous transactions without requiring a large number of sessions—which would require a platform thread—in their transaction manager and are guaranteed to resume the context so that the transaction can continue its route within the TransactionManager.

This delicate mechanism works very well at production sites processing billions of transactions. However, we can’t guarantee that user-implemented participants can reliably time out, always, sooner or later, to complete the transaction and avoid a memory leak.

And here comes the reason for this post: In situations where participants may pause a transaction and not resume it after a timeout, we have a pause-timeout optional property that you can configure in the TransactionManager, e.g.:

xml
<property name="pause-timeout" value="300000" />

This is a safety net for those situations. If for some reason your participant doesn’t resume, the TransactionManager will auto-resume it for you after a reasonable amount of time to prevent a memory leak. Contexts are lightweight and consume little memory, so those leaks manifest themselves after several days or even weeks depending on your system load, making 5 minutes a reasonable default.

But the pause-timeout feature comes at a price. For every Context that we pause, a TimerTask is created to cancel the transaction. TimerTasks, popular 20 years ago, are quite brittle as they rely on a single platform thread calling every TimerTask Runnable WITHOUT CATCHING unchecked exceptions, which can cause the entire Timer facility to fail and create significant problems for other components such as the TSpace’s garbage collector.

Takeaways

  • The pause-timeout feature can be used if you fear your pausable participants may not resume. It’s usually used for debugging purposes after you experience warnings related to high in-transit transactions. A high in-transit indicates that at least one transaction got stuck without ever finishing. You can see this in the log by noticing that the tail ID gets stuck and doesn’t move up.

  • If you decide to use a pause-timeout, remember that this is a safety net that should never be necessary. Setting a short timeout conflicting with the timeout you use at QueryHost (e.g., 15 seconds, or 30 seconds) is not advisable because it increases the chances of a race condition between the normal resume caused by the QueryHost expiration and the resume caused by the TimerTask runnable trying to do the same. If you use a pause-timeout, make it a long one.

Final Comment

All this goes away in jPOS 3. Continuations make no sense in jPOS 3’s TransactionManager because sessions are handled by virtual threads, making it entirely feasible to have a large number of sessions in the 100k range. All the delicate asynchronous programming we required almost 20 years ago is now magically handled by Project Loom’s virtual threads.