Posts Tagged ‘Migration’

FAQ: Job resubmission?

November 5, 2012

A question that often arises when approaching Condor from other batch systems is “How does Condor deal with resubmission of failed/preempted/killed jobs?”

The answer requires a slight shift in thinking.

Condor provides more functionality around the resubmission use case than most other schedulers. And the default policy is setup in such a way that most Condor folks don’t ever think about “resubmission.”

Condor will keep your job in the queue (condor_schedd managed) until the policy attached to the job says otherwise.

The default policy says a job will be run as many time as necessary for the job to terminate. So if the machine a job is running on crashes (generally, becomes unavailable), the condor_schedd will automatically try to run the job on another machine.

When you start changing the default policy you can control things such as: if a job should be removed after a period of time, even if it is running or only if it hasn’t started running; if a job should run multiple times even if it terminated cleanly; if a termination w/ an error should make the job run again, be held in the queue for inspection, be removed from the queue; if a job held for inspection should be held forever or a specific amount of time; if a job should only start running at a specific time in the future, or be run at repeated intervals.

The condor_submit manual page can provide specifics.


Migrating workflows between Schedds with DAGMan

March 15, 2010

In a distributed system, migration of resources can happen at numerous levels. For instance, in the context of Condor, migration could be of running job processes between execution machines, jobs between Schedd’s, Schedds between machines, Negotaitors between machines. Here the focus is on workflows between Schedds. A workflow is a directed acyclic graph (DAG) of jobs managed by DAGMan.

DAGMan at its simplest is a process, condor_dagman, that is submitted and run as a job. It typically runs under the Schedd in what’s known as the Scheduler Universe. Different universes in Condor provide different contracts with jobs. The Scheduler Universe lets a job run local to the Schedd and provides certain signaling guarantees in response to condor_hold/release/rm. DAGMan is written to have all its state persisted in either job execution log files (UserLogs) or its input file, a DAG. It is highly tolerant of faults in the execution of jobs it is managing, and in its own execution.

Migration of a workflow between Schedds amounts to moving the condor_dagman job running a workflow between Schedds. The Schedd does not support migration of jobs between Schedds. However, since DAGMan keeps no state it cannot reconstruct, it can be logically migrated through removal (condor_rm) and re-submission (condor_submit_dag).

For instance,

$ _CONDOR_SCHEDD_NAME=ScheddA@ condor_submit_dag migration.dag
Submitting jobs(s).
File for submitting this DAG to Condor           : migration.dag.condor.sub
Log of DAGMan debugging messages                 : migration.dag.dagman.out
Log of Condor library output                     : migration.dag.lib.out
Log of Condor library error messages             : migration.dag.lib.err
Log of the life of condor_dagman itself          : migration.dag.dagman.log

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1.

$ condor_q -dag -global
-- Schedd: ScheddA@ : 
   1.0   matt            3/15 10:39   0+00:00:04 R  0   1.7  condor_dagman -f -
  11.0    |-M0           3/15 10:39   0+00:00:00 I  0   0.0  sleep 19
  21.0    |-M1           3/15 10:39   0+00:00:00 I  0   0.0  sleep 80
3 jobs; 2 idle, 1 running, 0 held

$ condor_rm -name ScheddA@ 1.0   
Job 1.0 marked for removal

$ _CONDOR_SCHEDD_NAME=ScheddB@ condor_submit_dag migration.dag
Running rescue DAG 1
File for submitting this DAG to Condor           : migration.dag.condor.sub
Log of DAGMan debugging messages                 : migration.dag.dagman.out
Log of Condor library output                     : migration.dag.lib.out
Log of Condor library error messages             : migration.dag.lib.err
Log of the life of condor_dagman itself          : migration.dag.dagman.log

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 2.

$ condor_q -dag -global
-- Schedd: ScheddB@ : 
   2.0   matt            3/15 10:44   0+00:00:04 R  0   1.7  condor_dagman -f -
  12.0    |-M00          3/15 10:45   0+00:00:03 R  0   0.0  sleep 19
  22.0    |-M11          3/15 10:45   0+00:00:02 R  0   0.0  sleep 84
3 jobs; 0 idle, 3 running, 0 held

There are two important things going on for this to work. First, _CONDOR_SCHEDD_NAME, a way of setting the configuration parameter SCHEDD_NAME from a shell environment, specifies where condor_submit_dag will submit the workflow, and, because it is recorded in the condor_dagman job’s environment, where all the jobs that condor_dagman submits will go. This is important because DAGMan only tracks job ids, not job id + schedd name.

Second, the job ids, 1.0 11.0 21.0 and 2.0 12.0 22.0. As just mentioned, DAGMan only keeps track of job ids. This means that the Schedds cannot have overlapping job id spaces. To achieve this use SCHEDD_CLUSTER_INITIAL_VALUE and SCHEDD_CLUSTER_INCREMENT_VALUE. Give each Schedd a unique initial cluster value, and set the cluster increment value to one more than the number of Schedds.

Additionally, condor_hold on the DAGMan job will prevent it from submitting new nodes and allow existing ones to complete. Useful for draining off a submission during migration.

Live Migration: How it could work

August 25, 2009

The Condor infrastructure historically prefers jobs that run on a fixed set of resources. A parallel universe job may run in multiple resources at once, but won’t add or remove resources dynamically. All other jobs just run on a single resource.

Condor has long had functionality to periodically checkpoint a process and resume from a checkpoint. Historically the functionality was part of the standard universe. It is also available in the virtual machine universe.

The checkpoint and resume functionality can provide a basis for migration. For instance, a job may run, checkpoint, be evicted and then be run again. The second time the job will start from the last checkpoint to minimize bad put. This is a powerful mode of operation that handles many use cases. One use case not handled is live migration. At issue is the be evicted step in the process.

Migration is about moving a process from one resource to another. Whenever a process migrates there will be some interruption in its execution. Live migration is about minimizing that interruption.

An eviction in Condor is when an executing job is removed from the resource where it is running and placed back into the queue as an idle job ready to be scheduled and run again. This process often does not take less than 10s of seconds. It involves terminating the job and transferring any state off the execution resource, waiting for a negotiation cycle where a new resource is found for the job, transferring the jobs state the to new resource and starting the job.

This is clearly not live migration, and it is not intended to be.

Live migration could be something that Condor’s scheduler and execution infrastructure becomes deeply aware of, but the existing infrastructure may already be enough. Generally, it is preferable to write general, instead of specialized, features into the infrastructure. It is better to either show the infrastructure simply cannot handle a use case and needs to be extended, or show that it can handle a use case but an extension will simplify the case and simplify or enable others.


0) job A is submitted
1) job A starts running on resource X
2) job B is submitted, containing knowledge of job A
3) job B starts running on resource Y

Now, job A and B are special in that A knows how to transfer its state to someone else when asked and B knows how to ask A to transfer its state.

4) job B asks job A to transfer state and execution to it
5) after the transfer, job B starts executing and job A terminates

In the end, the new job B has taken over for job A, which may no longer exist.

The semantics jobs A and B must provide are not very complex. The details in implementing them very well may be. Many research and production technologies have tried to generally address this problem. One such, currently popular, technology is virtualization. Any virtualization technology worth its salt can handle migration between resources and often with minimal execution interruption.

Condor’s virtual machine universe uses the libvirt API, which provides it the ability to direct different hypervisors in performing migrations. Jobs that run on Condor can do the same. To actually do virtual machine migration in Condor a number of details would need to be worked out: who is allowed to initiate the migration; who should be terminating the source job (job A); what to do about different network topologies; how to most transparently link the source and destination (job B) jobs; how to be tolerant to faults before, during and after transfer; how to evaluate and pick a particular technology for a particular workload; what general enhancements can or should go into the infrastructure; … . It can be done, and not everything must be done to have a useful migration solution.

When it comes to general features for the infrastructure, one noticeable downsides to the procedure above is anything tracking job A has to know that job A’s termination signals the need to track job B instead of a normal termination. The transfer operation between jobs A and B has to be robust. It must handle the cases where the transfer fails, potentially due to either job A or B being evicted. It is likely not the case that job A and B should both think they are the owner of their shared state. All this information could be discernible at the tracking level, or it could be completely invisible. Often it is desirable to be complete invisible. Making the information invisible means providing a mechanism to link job A and B as a single unit in the infrastructure. Also, being invisible cannot involve hiding or swallowing critical information.

Another potential infrastructure enhancement is a means to link two jobs, though they may not be presented as a single unit. For instance, job B could be linked with job A meaning it can find out where job A is running, and job A could contain policy to officially terminate once job B has taken over execution. The ability to link jobs is actually one that is often requested. For instance, being able to have a set of jobs that always run on separate resources is quite desirable. Simplistically, the possibility of a STARTD_SLOT_ATTRS for jobs could be quite powerful.

Beyond tracking and sharing, the ability to have a multi-homed or multi-execution job could attack the common long tail in large workloads. When only a few jobs are left to be run, they could be matched and executed on multiple resources, with all but the first to complete being discarded.

Yet another enhancement that could assist in improving migration support, is the ability for a job to be matched with a resource while it is already executing. For instance, a running job might have ended up on a low ranked execution resource. It is stuck running there until it completes or gets evicted. It cannot express policy that states: I’ll run on my current resource until I complete, unless a higher ranked resource shows up in the first 5 minutes of my execution, in which case I want to be restarted on the other resource.

All these potential Condor enhancements would help in implementing or optimizing a migration solution, but are not specific to migration and do not necessarily involve deep knowledge about migration to be introduced to the scheduler or execution resources.

%d bloggers like this: