Archive for February, 2010

Node local limiters, e.g. GPUs

February 21, 2010

Condor supports pool-wide resource limiters called Concurrency Limits. They allow administrators and users to manage pool-wide consumable resources, e.g. software licenses, db connections, pool load, etc.

A common request on condor-users is for a feature related to node-wide limiters. The common solution is through node configuration. For instance, limiting one GPU job on a machine,

STARTD_ATTRS = SLOT1_GPU_COUNT, SLOT2_GPU_COUNT, SLOT3_GPU_COUNT, SLOT4_GPU_COUNT, GPU_COUNT
STARTD_JOB_EXPRS = GPU
STARTD_SLOT_ATTRS = GPU

SLOT1_GPU_COUNT = ifThenElse(slot1_GPU =?= UNDEFINED, 0, 1)
SLOT2_GPU_COUNT = ifThenElse(slot2_GPU =?= UNDEFINED, 0, 1)
SLOT3_GPU_COUNT = ifThenElse(slot3_GPU =?= UNDEFINED, 0, 1)
SLOT4_GPU_COUNT = ifThenElse(slot4_GPU =?= UNDEFINED, 0, 1)

GPU_COUNT = (SLOT1_GPU_COUNT + SLOT2_GPU_COUNT + SLOT3_GPU_COUNT + SLOT4_GPU_COUNT)

START = GPU_COUNT < 1

Then in a job submit file,

+GPU = "This job consumes a GPU resource"

This configuration works fairly well, but has three issues:

1) Job execution requires two steps between a Schedd and Startd, a CLAIM followed by ACTIVATE. The STARTD_SLOT_ATTRS is spread across slots after CLAIM and before ACTIVATE. Two GPU CLAIMs could succeed but fail to ACTIVATE because starting them would exceed the limit of one. This can result in thrashing and no forward progress if GPU jobs repeated get matched to an execute node and then rejected. This prevents forward progress.

2) To help avoid (1), jobs can be trickled in. However, when trickling in GPU jobs, slots not matched get STARTD_SLOT_ATTRS but are not re-advertised. The Negotiator ends up with a lagged view of slots, and will hand out matches that will be rejected at CLAIM time. This hurts throughput.

3) When jobs exit or are removed, only the slot that was running the job is re-advertised. Similar to (2), the result is a Negotiator with lagged state, except instead of hanging out matches that will be rejected, the Negotiator fails to hand out matches at all. This hurts throughput.

Solutions:

1) Using Dynamic Slots can address the thrashing problem in (1), at least as of 7.4.2 where a STARTD_SLOT_ATTRS issue was resolved. The dynamic slots will naturally control the rate at which jobs are matched with a node, preventing the possibility of thrashing. Of course, with dynamic slots it will take more than a single negotiation cycle to fill a multi-core machine. Also, possibly (3) below, which would have the same slow fill on multi-core machines problem, with added overall load.

2) Publish slots when they gain an attribute via STARTD_SLOT_ATTRS.

3) Publish slots when an attribute gained by STARTD_SLOT_ATTRS is removed from its source slot.

NFS and Job Initial Working Directory (Iwd)

February 14, 2010

Condor deployments tend to include a network file system, such as NFS, AFS or SMB, which allows users easy access to their files across many machines. The presence of such file systems also means that a user can skip using Condor’s file transfer mechanisms and have their jobs write output or read input directly from the networked locations, often the user’s home directory. Condor is more than happy to do this, as long as the user’s credentials are available to access the home directory, which is often the case. Condor will even go one step further.

Sometime in the past, a user was automating their job submission to Condor, similar to what DAGMan does, and ran into a problem when their files were written to NFS. Their meta-scheduler, as they’re called, was reading job output files and getting stale cached data. This meant the job may have completed but the machine on which the meta-scheduler was running only saw part of the output. To get around this issue the condor_schedd, which in this case was managing jobs for the meta-scheduler, was changed to try and flush the NFS cache for the job’s Iwd. When a job completes the Schedd checks to see if the Iwd is on NFS, and if so creates a temporary file that is immediately deleted. The Schedd’s log reports “Forcing NFS sync of Iwd” and a .condor_nfs_sync_XXXXXX file briefly lives in the Iwd. This of course has pros and cons.

On the plus side, this is helpful to meta-schedulers because now they never have to bother making sure data sources aren’t stale. Arguably the meta-scheduler should be fixed in this situation. On the negative side, all jobs that have an Iwd in NFS now incur a penalty in the form of some NFS round trips when they complete. This penalty can actually be very dramatic, even halving the number of jobs a single Schedd can complete in a second.

To address the performance hit, in Condor 7.4, the IwdFlushNFSCache job attribute was introduced. It defaults to True, and can be changed in a submit file with +IwdFlushNFSCache = False or for all new jobs with IwdFlushNFSCache = False followed by SUBMIT_EXPRS = IwdFlushNFSCache in configuration. As expected, IwdFlushNFSCache works as a guard to the code in the condor_schedd that flushes the Iwd on job completion.

Maybe in future versions of Condor (7.5+) the default will become False and those who need the cache flushing functionality will place +IwdFlushNFSCache = True in their submit files.


%d bloggers like this: