Posts Tagged ‘Slots’

The Owner state

September 11, 2012

What is the Owner state? Why are my slots in the Owner state? Why do jobs not start immediately after I restart Condor? How do I keep slots from going into the Owner state?

$ condor_status
Name               OpSys      Arch   *State*  Activity LoadAv Mem   ActvtyTime
eeyore.local       LINUX      X86_64 *Owner*  Idle     0.480  7783  0+00:00:04
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        1     1       0         0       0          0
               Total        1     1       0         0       0          0

All these questions require understanding a bit of Condor policy. Specifically, the START policy on execute nodes.

In Condor, the decision to run a job on a resource, a.k.a. a slot, is made by both the job and the resource. The job specifies the requirements it has for a resource, e.g. Memory > 1024. And, the resource specifies the requirements it has for jobs, e.g. MemoryUsage < Memory. If both do not agree, the job will not be run on the resource.

First, the Owner state is an indication that the resource is not available to run jobs. The resource is in use by its Owner. Historically, this has been someone at the resource’s console. Generally, think of Owner meaning Unavailable.

Second, the resource specifies its requirements using the START configuration parameter.

$ condor_config_val -v START
START: ( (KeyboardIdle > 15 * 60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner")) )
  Defined in '/etc/condor/condor_config', line 754.

When the START policy evaluates to False, i.e. starting jobs is not allowed, the resource will appear in the Owner state.

Note that the default START policy does not care about aspects of the jobs that might run. It cares entirely about aspects of the resource itself. It is designed to protect a user at the console (KeyboardIdle) and other processes already running on the machine (LoadAvg – CondorLoadAvg).

Third, jobs may not start immediately on restart because the KeyboardIdle timer has been reset to 0. This means waiting 15 minutes for the START policy, specifically KeyboardIdle > 15 * 60, to evaluate to True and the resource to become available.

Finally, on dedicated resources, which are very common, the KeyboardIdle component of the START policy can be removed. In fact, it is quite common to simply set START = TRUE.

%d bloggers like this: