Posts Tagged ‘Limits’

Concurrency Limits: Group defaults

January 21, 2013

Concurrency limits allow for protecting resources by providing a way to cap the number of jobs requiring a specific resource that can run at one time.

For instance, limit licenses and filer access at four regional data centers.

CONCURRENCY_LIMIT_DEFAULT = 15
license.north_LIMIT = 30
license.south_LIMIT = 30
license.east_LIMIT = 30
license.west_LIMIT = 45
filer.north_LIMIT = 75
filer.south_LIMIT = 150
filer.east_LIMIT = 75
filer.west_LIMIT = 75

Notice the repetition.

In addition to the repetition, every license.* and filer.* must be known and recorded in configuration. The set may be small in this example, but imagine imposing a limit on each user or each submission. The set of users is board, dynamic and may differ by region. The set of submissions is a more extreme version of the users case, yet it is still realistic.

To simplify the configuration management for groups of limits, a new feature to provide group defaults to limit was added for the Condor 7.8 series.

The feature requires that only the exception to a rule be called out explicitly in configuration. For instance, license.west and filer.south are the exceptions in the configuration above. Simplified configuration available in 7.8,

CONCURRENCY_LIMIT_DEFAULT = 15
CONCURRENCY_LIMIT_DEFAULT_license = 30
CONCURRENCY_LIMIT_DEFAULT_filer = 75
license.west_LIMIT = 45
filer.south_LIMIT = 150

In action,

$ for limit in license.north license.south license.east license.west filer.north filer.south filer.east filer.west; do echo queue 1000 | condor_submit -a cmd=/bin/sleep -a args=1d -a concurrency_limits=$limit; done

$ condor_q -format '%s\n' ConcurrencyLimits -const 'JobStatus == 2' | sort | uniq -c | sort -n
     30 license.east
     30 license.north
     30 license.south
     45 license.west
     75 filer.east
     75 filer.north
     75 filer.west
    150 filer.south

Concurrency Limits: Protecting shared resources

June 27, 2011

Concurrency Limits, sometimes called resource limits, are Condor‘s way of giving administrators and users a tool to protect limited resources.

A popular resource to protect is a software license. Take for example jobs that run Matlab. Matlab uses flexlm and users often have a limited number of licenses available, effectively limiting how many jobs they can run concurrently. Condor does not and does not need to integrate with flexlm here. Condor lets a user specify concurrency_limits = matlab with their job and administrators to add MATLAB_LIMIT = 64 to configuration.

Other uses include limiting the number of jobs connecting to network filesystem filer, limiting the number of jobs a user can be running, limiting the number of jobs running in a submission, and really anything else that can be managed at a global pool level. I have also heard of people using them to limit database connections and implement a global pool load share.

The global aspect of these resources is important. Concurrency limits are not local to nodes, e.g. for GPU management. Limits are managed by the Negotiator. They work because jobs contain a list of their limits and slot advertisements contain a list of active limits. During the negotiation cycle, the negotiator can sum up the active limits and compare with the configured maximum and what a job is requesting.

Also, limits are not considered in preemption decisions. Changes to limits on a running job, via qedit, will not impact the job until it stops. This means a job cannot give up a limit it no longer needs when it exits a certain phase of execution – consider DAGs here. And, lowering a limit via configuration will not result in job preemption.

By example,

First the configuration needs to be on the Negotiator, e.g.

$ condor_config_val -dump | grep LIMIT
CONCURRENCY_LIMIT_DEFAULT = 3
SLEEP_LIMIT = 1
AWAKE_LIMIT = 2

This says that there can be a maximum of 1 job using the SLEEP resources at a time. This is across all users and all accounting groups.

$ cat > limits.sub
cmd = /bin/sleep
args = 1d
concurrency_limits = sleep
queue
^D

$ condor_submit -a 'queue 4' limits.sub
Submitting job(s)...
3 job(s) submitted to cluster 41.

$ condor_q
-- Submitter: eeyore.local :  : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  41.0   matt            7/04 12:21   0+00:55:55 R  0   4.2  sleep 1d
  41.1   matt            7/04 12:21   0+00:00:00 I  0   0.0  sleep 1d
  41.2   matt            7/04 12:21   0+00:00:00 I  0   0.0  sleep 1d
  41.3   matt            7/04 12:21   0+00:00:00 I  0   0.0  sleep 1d
4 jobs; 3 idle, 1 running, 0 held

(A) $ condor_q -format "%s " GlobalJobId -format "%s " ConcurrencyLimits -format "%s" LastRejMatchReason -format "\n" None
eeyore.local#41.0 sleep
eeyore.local#41.1 sleep concurrency limit reached
eeyore.local#41.2 sleep
eeyore.local#41.3 sleep

(B) $ condor_status -format "%s " Name -format "%s " GlobalJobId -format "%s" ConcurrencyLimits -format "\n" None
slot1@eeyore.local eeyore.local#41.0 sleep
slot2@eeyore.local
slot3@eeyore.local
slot4@eeyore.local

(A) shows each job wants to use the sleep limit. It also shows that job 41.1 did not match because its concurrency limits were reached. (B) shows that only 41.0 got to run, on slot1. Notice, the limit is present on the slot’s ad.

The Negotiator can also be asked about active limits directly,

$ condor_userprio -l | grep ConcurrencyLimit
ConcurrencyLimit_sleep = 1.000000

That’s well and good, but there are three more things to know about: 0) the default maximum, 1) multiple limits, 2) duplicate limits.

First, the default maximum, CONCURRENCY_LIMIT_DEFAULT, apply to any limit that is not explicitly named in configuration, as SLEEP was.

$ condor_submit -a 'concurrency_limits = biff' -a 'queue 4' limits.sub
Submitting job(s)....
4 job(s) submitted to cluster 42.

$ condor_rm 41
Cluster 41 has been marked for removal.

$ condor_q
-- Submitter: eeyore.local :  : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  42.0   matt            7/04 12:34   0+00:00:22 R  0   0.0  sleep 1d
  42.1   matt            7/04 12:34   0+00:00:22 R  0   0.0  sleep 1d
  42.2   matt            7/04 12:34   0+00:00:22 R  0   0.0  sleep 1d
  42.3   matt            7/04 12:34   0+00:00:00 I  0   0.0  sleep 1d
8 jobs; 4 idle, 4 running, 0 held

$ condor_q -format "%s " GlobalJobId -format "%s " ConcurrencyLimits -format "%s" LastRejMatchReason -format "\n" None
eeyore.local#42.0 biff
eeyore.local#42.1 biff
eeyore.local#42.2 biff
eeyore.local#42.3 biff concurrency limit reached

Second, a job can require multiple limits at the same time. The job will need to consume each limit to run, and the most restricted limit will dictate if the job runs.

$ condor_rm -a
All jobs marked for removal.

$ condor_submit -a 'concurrency_limits = sleep,awake' -a 'queue 4' limits.sub
Submitting job(s)....
4 job(s) submitted to cluster 43.

$ condor_q
-- Submitter: eeyore.local :  : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  43.0   matt            7/04 13:07   0+00:00:13 R  0   0.0  sleep 1d
  43.1   matt            7/04 13:07   0+00:00:00 I  0   0.0  sleep 1d
  43.2   matt            7/04 13:07   0+00:00:00 I  0   0.0  sleep 1d
  43.3   matt            7/04 13:07   0+00:00:00 I  0   0.0  sleep 1d
4 jobs; 3 idle, 1 running, 0 held

$ condor_q -format "%s " GlobalJobId -format "%s " ConcurrencyLimits -format "%s" LastRejMatchReason -format "\n" None
eeyore.local#43.0 awake,sleep
eeyore.local#43.1 awake,sleep concurrency limit reached
eeyore.local#43.2 awake,sleep
eeyore.local#43.3 awake,sleep

$ condor_status -format "%s " Name -format "%s " GlobalJobId -format "%s" ConcurrencyLimits -format "\n" None
slot1@eeyore.local eeyore.local#43.0 awake,sleep
slot2@eeyore.local
slot3@eeyore.local
slot4@eeyore.local

Only one job gets to run because even though there are two awake limits available, there is only one sleep available.

Finally, a job can require more than one of the same limit. In fact, the requirement can be fractional.

$ condor_rm -a
All jobs marked for removal.

$ condor_submit -a 'concurrency_limits = sleep:2.0' -a 'queue 4' limits.sub
Submitting job(s)....
4 job(s) submitted to cluster 44.

$ condor_submit -a 'concurrency_limits = awake:2.0' -a 'queue 4' limits.sub
Submitting job(s)....
4 job(s) submitted to cluster 45.

$ condor_q
-- Submitter: eeyore.local :  : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  44.0   matt            7/04 13:11   0+00:00:00 I  0   0.0  sleep 1d
  44.1   matt            7/04 13:11   0+00:00:00 I  0   0.0  sleep 1d
  44.2   matt            7/04 13:11   0+00:00:00 I  0   0.0  sleep 1d
  44.3   matt            7/04 13:11   0+00:00:00 I  0   0.0  sleep 1d
  45.0   matt            7/04 13:13   0+00:00:24 R  0   0.0  sleep 1d
  45.1   matt            7/04 13:13   0+00:00:00 I  0   0.0  sleep 1d
  45.2   matt            7/04 13:13   0+00:00:00 I  0   0.0  sleep 1d
  45.3   matt            7/04 13:13   0+00:00:00 I  0   0.0  sleep 1d
8 jobs; 7 idle, 1 running, 0 held


$ condor_q -format "%s " GlobalJobId -format "%s " ConcurrencyLimits -format "%s" LastRejMatchReason -format "\n" None
eeyore.local#44.0 sleep:2.0 concurrency limit reached
eeyore.local#44.1 sleep:2.0
eeyore.local#44.2 sleep:2.0
eeyore.local#44.3 sleep:2.0
eeyore.local#45.0 awake:2.0
eeyore.local#45.1 awake:2.0 concurrency limit reached
eeyore.local#45.2 awake:2.0
eeyore.local#45.3 awake:2.0

$ condor_userprio -l | grep Limit
ConcurrencyLimit_awake = 2.000000
ConcurrencyLimit_sleep = 0.0

Here none of the jobs in cluster 44 will run, they each need more SLEEP than is available. Also, only one of the jobs in cluster 45 can run at a time, because each one uses up all the AWAKE when it runs.


%d bloggers like this: