Migrating workflows between Schedds with DAGMan

In a distributed system, migration of resources can happen at numerous levels. For instance, in the context of Condor, migration could be of running job processes between execution machines, jobs between Schedd’s, Schedds between machines, Negotaitors between machines. Here the focus is on workflows between Schedds. A workflow is a directed acyclic graph (DAG) of jobs managed by DAGMan.

DAGMan at its simplest is a process, condor_dagman, that is submitted and run as a job. It typically runs under the Schedd in what’s known as the Scheduler Universe. Different universes in Condor provide different contracts with jobs. The Scheduler Universe lets a job run local to the Schedd and provides certain signaling guarantees in response to condor_hold/release/rm. DAGMan is written to have all its state persisted in either job execution log files (UserLogs) or its input file, a DAG. It is highly tolerant of faults in the execution of jobs it is managing, and in its own execution.

Migration of a workflow between Schedds amounts to moving the condor_dagman job running a workflow between Schedds. The Schedd does not support migration of jobs between Schedds. However, since DAGMan keeps no state it cannot reconstruct, it can be logically migrated through removal (condor_rm) and re-submission (condor_submit_dag).

For instance,

$ _CONDOR_SCHEDD_NAME=ScheddA@ condor_submit_dag migration.dag
Submitting jobs(s).
-----------------------------------------------------------------------
File for submitting this DAG to Condor           : migration.dag.condor.sub
Log of DAGMan debugging messages                 : migration.dag.dagman.out
Log of Condor library output                     : migration.dag.lib.out
Log of Condor library error messages             : migration.dag.lib.err
Log of the life of condor_dagman itself          : migration.dag.dagman.log

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1.
-----------------------------------------------------------------------

$ condor_q -dag -global
-- Schedd: ScheddA@ : 
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   matt            3/15 10:39   0+00:00:04 R  0   1.7  condor_dagman -f -
  11.0    |-M0           3/15 10:39   0+00:00:00 I  0   0.0  sleep 19
  21.0    |-M1           3/15 10:39   0+00:00:00 I  0   0.0  sleep 80
3 jobs; 2 idle, 1 running, 0 held

$ condor_rm -name ScheddA@ 1.0   
Job 1.0 marked for removal

$ _CONDOR_SCHEDD_NAME=ScheddB@ condor_submit_dag migration.dag
Running rescue DAG 1
-----------------------------------------------------------------------
File for submitting this DAG to Condor           : migration.dag.condor.sub
Log of DAGMan debugging messages                 : migration.dag.dagman.out
Log of Condor library output                     : migration.dag.lib.out
Log of Condor library error messages             : migration.dag.lib.err
Log of the life of condor_dagman itself          : migration.dag.dagman.log

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 2.
-----------------------------------------------------------------------

$ condor_q -dag -global
-- Schedd: ScheddB@ : 
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   matt            3/15 10:44   0+00:00:04 R  0   1.7  condor_dagman -f -
  12.0    |-M00          3/15 10:45   0+00:00:03 R  0   0.0  sleep 19
  22.0    |-M11          3/15 10:45   0+00:00:02 R  0   0.0  sleep 84
3 jobs; 0 idle, 3 running, 0 held

There are two important things going on for this to work. First, _CONDOR_SCHEDD_NAME, a way of setting the configuration parameter SCHEDD_NAME from a shell environment, specifies where condor_submit_dag will submit the workflow, and, because it is recorded in the condor_dagman job’s environment, where all the jobs that condor_dagman submits will go. This is important because DAGMan only tracks job ids, not job id + schedd name.

Second, the job ids, 1.0 11.0 21.0 and 2.0 12.0 22.0. As just mentioned, DAGMan only keeps track of job ids. This means that the Schedds cannot have overlapping job id spaces. To achieve this use SCHEDD_CLUSTER_INITIAL_VALUE and SCHEDD_CLUSTER_INCREMENT_VALUE. Give each Schedd a unique initial cluster value, and set the cluster increment value to one more than the number of Schedds.

Additionally, condor_hold on the DAGMan job will prevent it from submitting new nodes and allow existing ones to complete. Useful for draining off a submission during migration.

Advertisements

Tags: , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: