condor_schedd is an event driven process, like all other Condor daemons. It spends its time waiting in
select(2) for events to process. Events include:
condor_q queries, spawning and reaping
condor_shadow processes, accepting
condor_submit submissions, negotiating with the Negotiator, removing jobs during
condor_rm. The responsiveness of the Schedd to user interaction, e.g. condor_q, condor_rm, condor_submit, and process interaction, e.g. messages with condor_shadow, condor_startd or condor_negotiator, is effected by how long it takes to process an event and how many events are waiting to be processed.
For instance, if a thousand condor_shadow processes start up at the same time there may be a thousand keep-alive messages for the Schedd to process after a single call to
select returns, no new events will be considered until the Schedd calls
select again. A condor_rm request would have to wait. Likewise, if any one event takes a long time to process, such as a negotiation cycle, it can also keep the Schedd from getting back to
select and accepting new events.
Basically, to function well, the Schedd needs to get back to
select as fast as possible.
From a user perspective, when the Schedd does not get back to
select quickly, a condor_rm or condor_submit attempt may appear to fail, e.g.
$ time condor_rm -a Could not remove all jobs. real 0m20.069s user 0m0.020s sys 0m0.020s
As of the Condor 7.4 series, this rarely happens because of internal events that the Schedd is processing. The Schedd uses structures that allow such events to be interleaved with calls to
select. However, some events still take long periods of time, e.g. the removal of 300,000 jobs above. One such event is a negotiation cycle initiated by the Negotiator. If a condor_rm, condor_q, condor_submit, etc happens during a negotiation, there is a good chance it may timeout.
Though a simple re-try of the tool will often succeed, this timeout may be annoying to users of the tools, be they people or processes. An alternative to a re-try is to extend the timeout used by the tool. The default timeout is 20 seconds, which is very often long enough, but may not be in large pools.
To extend the timeout for
condor_submit, put SUBMIT_TIMEOUT_MULTIPLER=3 in the configuration file read by condor_submit. To extend the timeout for
condor_rm, etc, put TOOL_TIMEOUT_MULTILIER=3 in the configuration file read by the tool. These changes will take the default timeout, 20 seconds, and multiply it by 3, giving the Schedd 60 seconds to respond. For instance, with 100Ks of jobs in the queue:
$ _CONDOR_TOOL_TIMEOUT_MULTIPLIER=3 time condor_rm -a All jobs marked for removal. 0.01user 0.02system 0:53.99elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+4374minor)pagefaults 0swaps