Node local limiters, e.g. GPUs

Condor supports pool-wide resource limiters called Concurrency Limits. They allow administrators and users to manage pool-wide consumable resources, e.g. software licenses, db connections, pool load, etc.

A common request on condor-users is for a feature related to node-wide limiters. The common solution is through node configuration. For instance, limiting one GPU job on a machine,

STARTD_ATTRS = SLOT1_GPU_COUNT, SLOT2_GPU_COUNT, SLOT3_GPU_COUNT, SLOT4_GPU_COUNT, GPU_COUNT
STARTD_JOB_EXPRS = GPU
STARTD_SLOT_ATTRS = GPU

SLOT1_GPU_COUNT = ifThenElse(slot1_GPU =?= UNDEFINED, 0, 1)
SLOT2_GPU_COUNT = ifThenElse(slot2_GPU =?= UNDEFINED, 0, 1)
SLOT3_GPU_COUNT = ifThenElse(slot3_GPU =?= UNDEFINED, 0, 1)
SLOT4_GPU_COUNT = ifThenElse(slot4_GPU =?= UNDEFINED, 0, 1)

GPU_COUNT = (SLOT1_GPU_COUNT + SLOT2_GPU_COUNT + SLOT3_GPU_COUNT + SLOT4_GPU_COUNT)

START = GPU_COUNT < 1

Then in a job submit file,

+GPU = "This job consumes a GPU resource"

This configuration works fairly well, but has three issues:

1) Job execution requires two steps between a Schedd and Startd, a CLAIM followed by ACTIVATE. The STARTD_SLOT_ATTRS is spread across slots after CLAIM and before ACTIVATE. Two GPU CLAIMs could succeed but fail to ACTIVATE because starting them would exceed the limit of one. This can result in thrashing and no forward progress if GPU jobs repeated get matched to an execute node and then rejected. This prevents forward progress.

2) To help avoid (1), jobs can be trickled in. However, when trickling in GPU jobs, slots not matched get STARTD_SLOT_ATTRS but are not re-advertised. The Negotiator ends up with a lagged view of slots, and will hand out matches that will be rejected at CLAIM time. This hurts throughput.

3) When jobs exit or are removed, only the slot that was running the job is re-advertised. Similar to (2), the result is a Negotiator with lagged state, except instead of hanging out matches that will be rejected, the Negotiator fails to hand out matches at all. This hurts throughput.

Solutions:

1) Using Dynamic Slots can address the thrashing problem in (1), at least as of 7.4.2 where a STARTD_SLOT_ATTRS issue was resolved. The dynamic slots will naturally control the rate at which jobs are matched with a node, preventing the possibility of thrashing. Of course, with dynamic slots it will take more than a single negotiation cycle to fill a multi-core machine. Also, possibly (3) below, which would have the same slow fill on multi-core machines problem, with added overall load.

2) Publish slots when they gain an attribute via STARTD_SLOT_ATTRS.

3) Publish slots when an attribute gained by STARTD_SLOT_ATTRS is removed from its source slot.

Advertisements

Tags: , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: