Configuration and policy evaluation

December 10, 2012

Figuring out how evaluation happens in configuration and policy is a common problem. The confusion is justified.

Configuration provides substitution with $() syntax, while policy is full ClassAd language evaluation without $() syntax.

Configuration is all the parameters listed in files discoverable with condor_config_val -config.

$ condor_config_val -config
Configuration source:
Local configuration sources:

Policy is the ClassAd expression found on the right-hand side of specific configuration parameters. For instance,

$ condor_config_val -v START
START: ( (KeyboardIdle > 15 * 60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner")) )
  Defined in '/etc/condor/condor_config', line 753.

Configuration evaluation allows for substitution of configuration parameters with $().

$ cat /etc/condor/condor_config | head -n753 | tail -n1

$ condor_config_val -v UWCS_START
UWCS_START: ( (KeyboardIdle > 15 * 60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner")) )
  Defined in '/etc/condor/condor_config', line 808.

$ cat /etc/condor/condor_config | head -n808 | tail -n3
UWCS_START	= ( (KeyboardIdle > $(StartIdleTime)) \
                    && ( $(CPUIdle) || \
                         (State != "Unclaimed" && State != "Owner")) )

Here START is actually the value of UWCS_START, provided by $(UWCS_START).

The substitution is recursive. Explore /etc/condor/condor_config and the JustCPU parameter. It is actually a parameter that is never read by daemons or tools. It is only useful in other configuration parameters. It’s shorthand.

Policy evaluation is full ClassAd expression evaluation. The evaluation happens at the appropriate times while daemons or tools are running.

Taking START as an example, the words KeyboardIdle, LoadAvg, CondorLoadAvg, State are attributes found on machine ads, and it is evaluated by the condor_startd and condor_negotiator to figure out if a job is allowed to start on a resource.

$ condor_status -l slot1@eeyore.local | grep -e ^KeyboardIdle -e ^LoadAvg -e ^CondorLoadAvg -e ^State
KeyboardIdle = 0
LoadAvg = 0.290000
CondorLoadAvg = 0.0
State = "Owner"

Evaluation happens by recursively evaluating those attributes. The expression ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner"))) becomes ((0 > 15 * 60) && (((0.29 - 0.0) <= 0.3) || ("Owner" != "Unclaimed" && "Owner" != "Owner"))). And so forth.

That’s it.

FAQ: Job resubmission?

November 5, 2012

A question that often arises when approaching Condor from other batch systems is “How does Condor deal with resubmission of failed/preempted/killed jobs?”

The answer requires a slight shift in thinking.

Condor provides more functionality around the resubmission use case than most other schedulers. And the default policy is setup in such a way that most Condor folks don’t ever think about “resubmission.”

Condor will keep your job in the queue (condor_schedd managed) until the policy attached to the job says otherwise.

The default policy says a job will be run as many time as necessary for the job to terminate. So if the machine a job is running on crashes (generally, becomes unavailable), the condor_schedd will automatically try to run the job on another machine.

When you start changing the default policy you can control things such as: if a job should be removed after a period of time, even if it is running or only if it hasn’t started running; if a job should run multiple times even if it terminated cleanly; if a termination w/ an error should make the job run again, be held in the queue for inspection, be removed from the queue; if a job held for inspection should be held forever or a specific amount of time; if a job should only start running at a specific time in the future, or be run at repeated intervals.

The condor_submit manual page can provide specifics.

