The Condor infrastructure historically prefers jobs that run on a fixed set of resources. A parallel universe job may run in multiple resources at once, but won’t add or remove resources dynamically. All other jobs just run on a single resource.
Condor has long had functionality to periodically checkpoint a process and resume from a checkpoint. Historically the functionality was part of the standard universe. It is also available in the virtual machine universe.
The checkpoint and resume functionality can provide a basis for migration. For instance, a job may run, checkpoint, be evicted and then be run again. The second time the job will start from the last checkpoint to minimize bad put. This is a powerful mode of operation that handles many use cases. One use case not handled is live migration. At issue is the be evicted step in the process.
Migration is about moving a process from one resource to another. Whenever a process migrates there will be some interruption in its execution. Live migration is about minimizing that interruption.
An eviction in Condor is when an executing job is removed from the resource where it is running and placed back into the queue as an idle job ready to be scheduled and run again. This process often does not take less than 10s of seconds. It involves terminating the job and transferring any state off the execution resource, waiting for a negotiation cycle where a new resource is found for the job, transferring the jobs state the to new resource and starting the job.
This is clearly not live migration, and it is not intended to be.
Live migration could be something that Condor’s scheduler and execution infrastructure becomes deeply aware of, but the existing infrastructure may already be enough. Generally, it is preferable to write general, instead of specialized, features into the infrastructure. It is better to either show the infrastructure simply cannot handle a use case and needs to be extended, or show that it can handle a use case but an extension will simplify the case and simplify or enable others.
0) job A is submitted
1) job A starts running on resource X
2) job B is submitted, containing knowledge of job A
3) job B starts running on resource Y
Now, job A and B are special in that A knows how to transfer its state to someone else when asked and B knows how to ask A to transfer its state.
4) job B asks job A to transfer state and execution to it
5) after the transfer, job B starts executing and job A terminates
In the end, the new job B has taken over for job A, which may no longer exist.
The semantics jobs A and B must provide are not very complex. The details in implementing them very well may be. Many research and production technologies have tried to generally address this problem. One such, currently popular, technology is virtualization. Any virtualization technology worth its salt can handle migration between resources and often with minimal execution interruption.
Condor’s virtual machine universe uses the libvirt API, which provides it the ability to direct different hypervisors in performing migrations. Jobs that run on Condor can do the same. To actually do virtual machine migration in Condor a number of details would need to be worked out: who is allowed to initiate the migration; who should be terminating the source job (job A); what to do about different network topologies; how to most transparently link the source and destination (job B) jobs; how to be tolerant to faults before, during and after transfer; how to evaluate and pick a particular technology for a particular workload; what general enhancements can or should go into the infrastructure; … . It can be done, and not everything must be done to have a useful migration solution.
When it comes to general features for the infrastructure, one noticeable downsides to the procedure above is anything tracking job A has to know that job A’s termination signals the need to track job B instead of a normal termination. The transfer operation between jobs A and B has to be robust. It must handle the cases where the transfer fails, potentially due to either job A or B being evicted. It is likely not the case that job A and B should both think they are the owner of their shared state. All this information could be discernible at the tracking level, or it could be completely invisible. Often it is desirable to be complete invisible. Making the information invisible means providing a mechanism to link job A and B as a single unit in the infrastructure. Also, being invisible cannot involve hiding or swallowing critical information.
Another potential infrastructure enhancement is a means to link two jobs, though they may not be presented as a single unit. For instance, job B could be linked with job A meaning it can find out where job A is running, and job A could contain policy to officially terminate once job B has taken over execution. The ability to link jobs is actually one that is often requested. For instance, being able to have a set of jobs that always run on separate resources is quite desirable. Simplistically, the possibility of a STARTD_SLOT_ATTRS for jobs could be quite powerful.
Beyond tracking and sharing, the ability to have a multi-homed or multi-execution job could attack the common long tail in large workloads. When only a few jobs are left to be run, they could be matched and executed on multiple resources, with all but the first to complete being discarded.
Yet another enhancement that could assist in improving migration support, is the ability for a job to be matched with a resource while it is already executing. For instance, a running job might have ended up on a low ranked execution resource. It is stuck running there until it completes or gets evicted. It cannot express policy that states: I’ll run on my current resource until I complete, unless a higher ranked resource shows up in the first 5 minutes of my execution, in which case I want to be restarted on the other resource.
All these potential Condor enhancements would help in implementing or optimizing a migration solution, but are not specific to migration and do not necessarily involve deep knowledge about migration to be introduced to the scheduler or execution resources.