Job policy includes a set of periodic expressions; PeriodicHold
, PeriodicRelease
, and PeriodicRemove
. Periodic expressions on a job are evaluated in the context of the job’s ad. They should evaluate to a boolean; if PeriodicRemove
is TRUE
, then remove the job. They are evaluated by the Schedd and a Shadow. The Schedd evaluates them as frequently as PERIODIC_EXPR_INTERVAL
, which defaults to 60 seconds in condor 7.4 and 300 before. The Shadow evaluates them periodically, based on PERIODIC_EXPR_INTERVAL
, and every time it receives an update from the Starter running the job. Updates occur periodically, controlled by STARTER_UPDATE_INTERVAL
on the Starter.
Say you want to put a job on hold if it runs for more than 60 seconds. To do this you need job policy and two points of reference; the start time of the job, and the current time.
The job policy you want to use is PeriodicHold
.
For the two time references you need to look at a job’s ad to see what is available. You can see a job’s ad with condor_q -long
. The start time is available as JobCurrentStartDate
, seconds since Epoch, and it would appear that ServerTime
is the current time, seconds since Epoch.
Here’s a job you might write:
executable = /bin/sleep arguments = 29m notification = never periodic_hold = (ServerTime - JobCurrentStartDate) >= 60 queue 1
If you submit it and wait a minute you will see it is on Hold.
$ condor_q -- Submitter: robin.local : : robin.local ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 matt 11/5 19:03 0+00:00:19 H 0 2.0 sleep 29m 1 jobs; 0 idle, 0 running, 1 held
But, it only ran for 19 seconds. Check why.
$ condor_q -hold -- Submitter: robin.local : : robin.local ID OWNER HELD_SINCE HOLD_REASON 1.0 matt 11/5 19:04 The UNKNOWN (never set) PeriodicHold expres 1 jobs; 0 idle, 0 running, 1 held
The hold reason is truncated, but you can use condor_q -long
to see all of it.
$ condor_q -long | grep ^HoldReason HoldReason = "The UNKNOWN (never set) PeriodicHold expression '' evaluated to UNDEFINED" HoldReasonCode = 5 HoldReasonSubCode = 0
This is saying that something went wrong and your PeriodicHold
expression evaluated to UNDEFINED
. You wanted it to evaluate to a boolean. The way expression evaluation in ClassAds happens, if part of an expression evaluates to UNDEFINED
then chances are the entire expression will. For instance, the expression A + 1
will evaluate to the value of the A attribute plus 1. When A is a number like 2, the expression evaluates to 2 + 1 => 3
. When A is not defined, the expression evaluates to UNDEFINED + 1 => UNDEFINED
.
There is a handy way to debug expression in Condor, use the debug()
ClassAd function.
Your job becomes:
executable = /bin/sleep arguments = 29m notification = never periodic_hold = debug((ServerTime - JobCurrentStartDate) >= 60) queue 1
When you submit the job and look in either the SchedLog or ShadowLog you will see “Classad debug” messages.
ShadowLog:
...: Classad debug: ServerTime --> UNDEFINED ...: Classad debug: JobCurrentStartDate --> ERROR ...: Classad debug: JobCurrentStartDate --> 1257476798 ...: Classad debug: debug((ServerTime - JobCurrentStartDate) >= 60) --> UNDEFINED
Very clearly, ServerTime
is evaluating to UNDEFINED
. That may seem strange because it looked like it was part of the job ad. However, ServerTime
is somewhat special. Likely buggy. It is only added to a job ad in response to a query, e.g. condor_q
. It is not actually part of the job ad. Annoying.
There is a solution. Another special attribute is CurrentTime
. It is available to all expressions, but it is not a visible attribute on a job ad. Also annoying. You have to know it is there. Using it we can rewrite the job as follows.
executable = /bin/sleep arguments = 29m notification = never periodic_hold = debug((CurrentTime - JobCurrentStartDate) >= 60) queue 1
After submitting the job and waiting a minute you can see that it has been put on hold.
$ condor_q -- Submitter: robin.local : : robin.local ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3.0 matt 11/5 19:09 0+00:01:00 H 0 0.0 sleep 29m 1 jobs; 0 idle, 0 running, 1 held
The details of the hold are also much more what you would expect.
$ condor_q -long | grep ^HoldReason HoldReason = "The job attribute PeriodicHold expression 'debug((CurrentTime - JobCurrentStartDate) >= 60)' evaluated to TRUE" HoldReasonCode = 3 HoldReasonSubCode = 0
In the ShadowLog you can see that CurrentTime
is available as expected.
ShadowLog:
...: Classad debug: CurrentTime --> 1257477062 ...: Classad debug: JobCurrentStartDate --> ERROR ...: Classad debug: JobCurrentStartDate --> 1257477002 ...: Classad debug: debug((CurrentTime - JobCurrentStartDate) >= 60) --> 1
Remember to remove debug()
from your expressions.