Posts Tagged ‘Scheduling’

Social scheduling

November 26, 2012

As a thought experiment.

There are always multiple users and limited resources. Users have work, which takes time and resources to complete.

The top resource users are visible to all.

A user can relinquish resources she is using.

A relinquished resource, either by work completing or by user action, is reassigned randomly.

How would this not work?

How would you refine it?

Extensible machine resources

November 19, 2012

Physical machines are home to many types of resources these days. The traditional cores, memory, disk, now share space with gpus, co-processors or even protein sequence analysis accelerators.

To facilitate use and management of these resources, a new feature is available in HTCondor for extending machine resources. Analogous to concurrency limits, which operate on a pool / global level, machine resources operate on a machine / local level.

The feature allows a machine to advertise that it has specific types of resources available. Jobs can then specify that they require those specific types of resources. And the matchmaker will take into account the new resource types.

By example, a machine may have some GPU resources, an RS232 connected to your favorite telescope, and a number of physical spinning hard disk drives. The configuration for this would be,

MACHINE_RESOURCE_NAMES = GPU, RS232, SPINDLE
MACHINE_RESOURCE_GPU = 2
MACHINE_RESOURCE_RS232 = 1
MACHINE_RESOURCE_SPINDLE = 4

SLOT_TYPE_1 = cpus=100%,auto
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 1

Aside – cpus=100%,auto instead of just auto because of GT3327. Also, the configuration for SLOT_TYPE_1 will likely go away in the future when all slots are partitionable by default.

Once a machine with this configuration is running,

$ condor_status -long | grep -i MachineResources
MachineResources = "cpus memory disk swap gpu rs232 spindle"

$ condor_status -long | grep -i -e TotalCpus -e TotalMemory -e TotalGpu -e TotalRs232 -e TotalSpindle
TotalCpus = 24
TotalMemory = 49152
TotalGpu = 2
TotalRs232 = 1
TotalSpindle = 4

$ condor_status -long | grep -i -e ^Cpus -e ^Memory -e ^Gpu -e ^Rs232 -e ^Spindle
Cpus = 24
Memory = 49152
Gpu = 2
Rs232 = 1
Spindle = 4

As you can see, the machine is reporting the different types of resources, how many of each it has and how many are currently available.

A job can take advantage of these new types of resources using a syntax already familiar for requesting resources from partitionable slots.

To consume one of the GPUs,

cmd = luxmark.sh

request_gpu = 1

queue

Or for a disk intensive workload,

cmd = hadoop_datanode.sh

request_spindle = 1

queue

With these jobs submitted and running,

$ condor_status
Name            OpSys      Arch   State     Activity LoadAv Mem ActvtyTime

slot1@eeyore    LINUX      X86_64 Unclaimed Idle      0.400 48896 0+00:00:28
slot1_1@eeyore  LINUX      X86_64 Claimed   Busy      0.000  128 0+00:00:04
slot1_2@eeyore  LINUX      X86_64 Claimed   Busy      0.000  128 0+00:00:04
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        3     0       2         1       0          0
               Total        3     0       2         1       0          0

$ condor_status -l slot1@eeyore | grep -i -e ^Cpus -e ^Memory -e ^Gpu -e ^Rs232 -e ^Spindle
Cpus = 22
Memory = 48896
Gpu = 1
Rs232 = 1
Spindle = 3

That’s 22 cores, 1 gpu and 3 spindles still available.

Submit four more of the spindle consuming jobs and you’ll find the fourth does not run, because the available number of spindles is 0.

$ condor_status -l slot1@eeyore | grep -i -e ^Cpus -e ^Memory -e ^Gpu -e ^Rs232 -e ^Spindle
Cpus = 19
Memory = 48512
Gpu = 1
Rs232 = 1
Spindle = 0

Since these custom resources are available as attributes in various ClassAds the same way Cpu, Memory and Disk are, all the policy, management and reporting capabilities you would expect is available.

No longer thinking in slots, thinking in aggregate resources and consumption policies

November 13, 2012

The slot model was natural when a machine housed a single core. Though, the slot model did not exist when a machine housed a single core.

When machines were single core the model was a machine, represented as a MachineAd. A MachineAd had an associated CPU, some nominal amount of RAM and some chunk of disk space. Running a job meant consuming a machine.

When machines grew multiple cores the machine model was split. A single machine became independent MachineAds, called virtual machines. However, the name didn’t stick as the term virtual machine became a popular term in hardware virtualization. So a machine became independent MachineAds, called slots. The unifying entity, the machine itself, was lost. Running a job still meant consuming a slot.

Most recently, slots split into two classes: static and partitionable. Static slots are the slots formerly known as virtual machines. Partitionable slots are a representation of the physical machine itself, and are carved up, on-demand to service jobs. Both types are still MachineAds, but the consumption of partitionable slots is dynamic.

The slot model has demonstrated great utility but has been stretched.

In this time workloads have also changed. They have become more memory bound, disk IO bound, and network bound. They have started relying on specialized hardware and even application level services. They have started both spanning and packing into cores. They have grown complex data dependencies, become very short running, and become infrastructure level long running.

Machines have also grown to include scores of cores, hundreds of gigabytes of RAM, dozens of terabytes of disk, specialized hardware such as GPUs, co-processors, entropy keys, high speed interconnects and a bevy of other attached devices.

Machines are lumpy, heterogeneous means more than operating system and CPU architecture.

Furthermore, if it still existed, the machine model itself would fail to cleanly describe available resources. Classes of resources exist that house entire clusters, grids, or life-cycle manageable application services. Resources share addressable memory across operating systems instances, are custom architectures across whole data centers, and even those that don’t provide an outline of their capacity. Resources may grow and shrink while in use.

Consumption of these resources is not necessarily straightforward or uniform.

It’s time to stop thinking in slots. Its time to start thinking in aggregate resources and their consumption policies.

Advanced scheduling: Execute periodically with cron jobs

October 15, 2012

If you want to run a job periodically you could repeatedly submit jobs, or qedit existing jobs after they run, but both of those options are a kludge. Instead, the condor_schedd provides support for cron-like jobs as a first-class citizen.

The cron-like feature builds on the ability to defer job execution. However, instead of using deferral_time, commands analogous to crontab(5) fields are available. cron_month, cron_day_of_month, cron_day_of_week, cron_hour, and cron_minute all behave as you would expect, and default to * when not provided.

To run a job every two minutes,

executable = /bin/date
log = cron.log
output = cron.out
error = cron.err

cron_minute = 0-59/2
on_exit_remove = false

queue

Note – on_exit_remove = false is required or the job will only be run once. It is arguable that on_exit_remove should default to false for jobs using cron_* commands.

After submitting and waiting 10 minutes, results can be found in the cron.log file.

$ grep ^00 cron.log
000 (009.000.000) 09/09 09:22:46 Job submitted from host: <127.0.0.1:56639>
001 (009.000.000) 09/09 09:24:00 Job executing on host: <127.0.0.1:45887>
006 (009.000.000) 09/09 09:24:00 Image size of job updated: 75
004 (009.000.000) 09/09 09:24:00 Job was evicted.
001 (009.000.000) 09/09 09:26:00 Job executing on host: <127.0.0.1:45887>
004 (009.000.000) 09/09 09:26:00 Job was evicted.
001 (009.000.000) 09/09 09:28:00 Job executing on host: <127.0.0.1:45887>
004 (009.000.000) 09/09 09:28:00 Job was evicted.
001 (009.000.000) 09/09 09:30:00 Job executing on host: <127.0.0.1:45887>
004 (009.000.000) 09/09 09:30:00 Job was evicted.
001 (009.000.000) 09/09 09:32:00 Job executing on host: <127.0.0.1:45887>
004 (009.000.000) 09/09 09:32:01 Job was evicted.

Note – the job appears to be evicted instead of terminated. What really happens is the job remains in the queue on termination. This is arguably a poor choice of wording in the log.

Just like for job deferral, there is no guarantee resources will be available at exactly the right time to run the job. cron_prep_time and cron_window provide a means to introduce tolerance.

Common question: What happens when a job takes longer than the time between defined starts, i.e. job takes 30 minutes to complete and is set to be run every 15 minutes?

Answer: The job will run serially. It will not stack up. The job does not need to serialize itself.

Note – a common complication, arguably a bug, which occurs only in pools with little or no new jobs being submitted, is that matchmaking must happen in time for the job dispatch. The Schedd does not publish a new Submitter Ad for the cron job’s owner when the job completes. This means that submitter ad the Negotiator sees may have zero idle jobs, resulting in no new match being handed out to dispatch the job on the next time it is set to execute.

Enjoy.

Advanced scheduling: Execute in the future with job deferral

September 24, 2012

One advanced scheduling feature of Condor is the ability to set a time, in the future, when a job should be run. This is called a deferral time.

Using the deferral_time command, you simply specify a time, in seconds since EPOCH, when your job should run:

executable = /bin/date
log = deferral.log
output = deferral.out
error = deferral.err

deferral_time = 1357016400

queue

Use date(1) to generate the deferral_time.

$ date -d @1357016400
Tue Jan  1 00:00:00 EST 2013
$ date +%s -d "2013-01-01 00:00:00"
1357016400

After submitting the job and waiting until 1 Jan 2013, you can see the result by looking in deferral.log and deferral.out.

$ grep ^00 deferral.log
000 (001.000.000) 08/15 22:33:00 Job submitted from host: <127.0.0.1:56006>
001 (001.000.000) 01/01 00:00:00 Job executing on host: <127.0.0.2:57590>
006 (001.000.000) 01/01 00:00:00 Image size of job updated: 75
005 (001.000.000) 01/01 00:00:00 Job terminated.

$ cat deferral.out
Tue Jan  1 00:00:00 EST 2013

Of course there is no guarantee that a resource will be available at a precise time in the future. A job that does not run at its deferral_time will be put on Hold for manual intervention.

To reduce the likelihood of missing the deferral_time and needing manual intervention, the deferral_prep_time and deferral_window commands are available. Respectively, they specify the amount of time before the deferral_time that the job can be matched with a resource and how long after the deferral_time execution is acceptable.

executable = /bin/date
log = deferral.log
output = deferral.out
error = deferral.err

deferral_time = 1357016400

# 1 day = 24 hour * 60 min * 60 sec = 86,400 seconds
# 1/2 day = 86,400 sec / 2 = 43,200 seconds
deferral_prep_time = 86400
deferral_window = 43200

queue

In the example above, the job may be matched to a resource, where it will keep the resource Claimed/Busy for up to a day (deferral_prep_time) in advance of its actual run. This will make it more likely that the job will run at precisely the deferral_time. It also means that for accounting purposes, you will be charged for using the resource, though the job has not yet run.

Additionally, if the job is not matched or otherwise does not start at precisely deferral_time, it has half a day (deferral_window) to run before it is put on hold for manual intervention.

That’s it.


%d bloggers like this: