Timeouts from condor_rm and condor_submit

The condor_schedd is an event driven process, like all other Condor daemons. It spends its time waiting in select(2) for events to process. Events include: condor_q queries, spawning and reaping condor_shadow processes, accepting condor_submit submissions, negotiating with the Negotiator, removing jobs during condor_rm. The responsiveness of the Schedd to user interaction, e.g. condor_q, condor_rm, condor_submit, and process interaction, e.g. messages with condor_shadow, condor_startd or condor_negotiator, is effected by how long it takes to process an event and how many events are waiting to be processed.

For instance, if a thousand condor_shadow processes start up at the same time there may be a thousand keep-alive messages for the Schedd to process after a single call to select. Once select returns, no new events will be considered until the Schedd calls select again. A condor_rm request would have to wait. Likewise, if any one event takes a long time to process, such as a negotiation cycle, it can also keep the Schedd from getting back to select and accepting new events.

Basically, to function well, the Schedd needs to get back to select as fast as possible.

From a user perspective, when the Schedd does not get back to select quickly, a condor_rm or condor_submit attempt may appear to fail, e.g.

$ time condor_rm -a

Could not remove all jobs.

real	0m20.069s
user	0m0.020s
sys	0m0.020s

As of the Condor 7.4 series, this rarely happens because of internal events that the Schedd is processing. The Schedd uses structures that allow such events to be interleaved with calls to select. However, some events still take long periods of time, e.g. the removal of 300,000 jobs above. One such event is a negotiation cycle initiated by the Negotiator. If a condor_rm, condor_q, condor_submit, etc happens during a negotiation, there is a good chance it may timeout.

Though a simple re-try of the tool will often succeed, this timeout may be annoying to users of the tools, be they people or processes. An alternative to a re-try is to extend the timeout used by the tool. The default timeout is 20 seconds, which is very often long enough, but may not be in large pools.

To extend the timeout for condor_submit, put SUBMIT_TIMEOUT_MULTIPLER=3 in the configuration file read by condor_submit. To extend the timeout for condor_q, condor_rm, etc, put TOOL_TIMEOUT_MULTILIER=3 in the configuration file read by the tool. These changes will take the default timeout, 20 seconds, and multiply it by 3, giving the Schedd 60 seconds to respond. For instance, with 100Ks of jobs in the queue:

$ _CONDOR_TOOL_TIMEOUT_MULTIPLIER=3 time condor_rm -a
All jobs marked for removal.
0.01user 0.02system 0:53.99elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+4374minor)pagefaults 0swaps

Tags: , , , ,

Leave a comment