Archive for December, 2009

Timeouts from condor_rm and condor_submit

December 16, 2009

The condor_schedd is an event driven process, like all other Condor daemons. It spends its time waiting in select(2) for events to process. Events include: condor_q queries, spawning and reaping condor_shadow processes, accepting condor_submit submissions, negotiating with the Negotiator, removing jobs during condor_rm. The responsiveness of the Schedd to user interaction, e.g. condor_q, condor_rm, condor_submit, and process interaction, e.g. messages with condor_shadow, condor_startd or condor_negotiator, is effected by how long it takes to process an event and how many events are waiting to be processed.

For instance, if a thousand condor_shadow processes start up at the same time there may be a thousand keep-alive messages for the Schedd to process after a single call to select. Once select returns, no new events will be considered until the Schedd calls select again. A condor_rm request would have to wait. Likewise, if any one event takes a long time to process, such as a negotiation cycle, it can also keep the Schedd from getting back to select and accepting new events.

Basically, to function well, the Schedd needs to get back to select as fast as possible.

From a user perspective, when the Schedd does not get back to select quickly, a condor_rm or condor_submit attempt may appear to fail, e.g.

$ time condor_rm -a

Could not remove all jobs.

real	0m20.069s
user	0m0.020s
sys	0m0.020s

As of the Condor 7.4 series, this rarely happens because of internal events that the Schedd is processing. The Schedd uses structures that allow such events to be interleaved with calls to select. However, some events still take long periods of time, e.g. the removal of 300,000 jobs above. One such event is a negotiation cycle initiated by the Negotiator. If a condor_rm, condor_q, condor_submit, etc happens during a negotiation, there is a good chance it may timeout.

Though a simple re-try of the tool will often succeed, this timeout may be annoying to users of the tools, be they people or processes. An alternative to a re-try is to extend the timeout used by the tool. The default timeout is 20 seconds, which is very often long enough, but may not be in large pools.

To extend the timeout for condor_submit, put SUBMIT_TIMEOUT_MULTIPLER=3 in the configuration file read by condor_submit. To extend the timeout for condor_q, condor_rm, etc, put TOOL_TIMEOUT_MULTILIER=3 in the configuration file read by the tool. These changes will take the default timeout, 20 seconds, and multiply it by 3, giving the Schedd 60 seconds to respond. For instance, with 100Ks of jobs in the queue:

$ _CONDOR_TOOL_TIMEOUT_MULTIPLIER=3 time condor_rm -a
All jobs marked for removal.
0.01user 0.02system 0:53.99elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+4374minor)pagefaults 0swaps

Cap job runtime: Debugging periodic job policy in a Condor pool

December 5, 2009

Job policy includes a set of periodic expressions; PeriodicHold, PeriodicRelease, and PeriodicRemove. Periodic expressions on a job are evaluated in the context of the job’s ad. They should evaluate to a boolean; if PeriodicRemove is TRUE, then remove the job. They are evaluated by the Schedd and a Shadow. The Schedd evaluates them as frequently as PERIODIC_EXPR_INTERVAL, which defaults to 60 seconds in condor 7.4 and 300 before. The Shadow evaluates them periodically, based on PERIODIC_EXPR_INTERVAL, and every time it receives an update from the Starter running the job. Updates occur periodically, controlled by STARTER_UPDATE_INTERVAL on the Starter.

Say you want to put a job on hold if it runs for more than 60 seconds. To do this you need job policy and two points of reference; the start time of the job, and the current time.

The job policy you want to use is PeriodicHold.

For the two time references you need to look at a job’s ad to see what is available. You can see a job’s ad with condor_q -long. The start time is available as JobCurrentStartDate, seconds since Epoch, and it would appear that ServerTime is the current time, seconds since Epoch.

Here’s a job you might write:

executable = /bin/sleep
arguments =  29m
notification = never
periodic_hold = (ServerTime - JobCurrentStartDate) >= 60
queue 1

If you submit it and wait a minute you will see it is on Hold.

$ condor_q

-- Submitter: robin.local :  : robin.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   matt           11/5  19:03   0+00:00:19 H  0   2.0  sleep 29m         

1 jobs; 0 idle, 0 running, 1 held

But, it only ran for 19 seconds. Check why.

$ condor_q -hold

-- Submitter: robin.local :  : robin.local
 ID      OWNER           HELD_SINCE HOLD_REASON                   
   1.0   matt           11/5  19:04 The UNKNOWN (never set) PeriodicHold expres

1 jobs; 0 idle, 0 running, 1 held

The hold reason is truncated, but you can use condor_q -long to see all of it.

$ condor_q -long | grep ^HoldReason
HoldReason = "The UNKNOWN (never set) PeriodicHold expression '' evaluated to UNDEFINED"
HoldReasonCode = 5
HoldReasonSubCode = 0

This is saying that something went wrong and your PeriodicHold expression evaluated to UNDEFINED. You wanted it to evaluate to a boolean. The way expression evaluation in ClassAds happens, if part of an expression evaluates to UNDEFINED then chances are the entire expression will. For instance, the expression A + 1 will evaluate to the value of the A attribute plus 1. When A is a number like 2, the expression evaluates to 2 + 1 => 3. When A is not defined, the expression evaluates to UNDEFINED + 1 => UNDEFINED.

There is a handy way to debug expression in Condor, use the debug() ClassAd function.

Your job becomes:

executable = /bin/sleep
arguments =  29m
notification = never
periodic_hold = debug((ServerTime - JobCurrentStartDate) >= 60)
queue 1

When you submit the job and look in either the SchedLog or ShadowLog you will see “Classad debug” messages.

ShadowLog:

...: Classad debug: ServerTime --> UNDEFINED
...: Classad debug: JobCurrentStartDate --> ERROR
...: Classad debug: JobCurrentStartDate --> 1257476798
...: Classad debug: debug((ServerTime - JobCurrentStartDate) >= 60) --> UNDEFINED

Very clearly, ServerTime is evaluating to UNDEFINED. That may seem strange because it looked like it was part of the job ad. However, ServerTime is somewhat special. Likely buggy. It is only added to a job ad in response to a query, e.g. condor_q. It is not actually part of the job ad. Annoying.

There is a solution. Another special attribute is CurrentTime. It is available to all expressions, but it is not a visible attribute on a job ad. Also annoying. You have to know it is there. Using it we can rewrite the job as follows.

executable = /bin/sleep
arguments =  29m
notification = never
periodic_hold = debug((CurrentTime - JobCurrentStartDate) >= 60)
queue 1

After submitting the job and waiting a minute you can see that it has been put on hold.

$ condor_q

-- Submitter: robin.local :  : robin.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   3.0   matt           11/5  19:09   0+00:01:00 H  0   0.0  sleep 29m         

1 jobs; 0 idle, 0 running, 1 held

The details of the hold are also much more what you would expect.

$ condor_q -long | grep ^HoldReason
HoldReason = "The job attribute PeriodicHold expression 'debug((CurrentTime - JobCurrentStartDate) >= 60)' evaluated to TRUE"
HoldReasonCode = 3
HoldReasonSubCode = 0

In the ShadowLog you can see that CurrentTime is available as expected.

ShadowLog:

...: Classad debug: CurrentTime --> 1257477062
...: Classad debug: JobCurrentStartDate --> ERROR
...: Classad debug: JobCurrentStartDate --> 1257477002
...: Classad debug: debug((CurrentTime - JobCurrentStartDate) >= 60) --> 1

Remember to remove debug() from your expressions.


%d bloggers like this: