I have been meaning to write a short getting-started tutorial about installing a multiple-node Condor pool, but it just hasn’t happened yet. Lucky for everyone, Jon Thomas at Red Hat has written one. Check out his How do I install a basic two node MRG grid?, which covers installation on Linux very nicely.
Setup a two node Condor pool (Red Hat KB)
January 14, 2010 by spinningmattgit rebase and merge gotcha (edge case)
January 8, 2010 by spinningmattImagine you work on a project that generally has two live branches at a time: master, which holds everything, it’s your development branch; and, stable, which holds your latest stable release, it only ever gets bug fixes. The project tries very hard to maintain stable releases separate from new development, which may break backwards compatibility.
Now imagine a normal workflow for such a project. When a bug is identified in the stable branch, the fix goes into the stable branch. The fix is of course needed on the development branch as well, so the stable branch is merged into the development branch.
All is well and good.
Now imagine a workflow for how the stable branch gets the bug fix. Something like:
$ git checkout stable $ emacs ... $ git commit -a $ git fetch ; git rebase origin stable $ git push origin stable
The rebase avoids extraneous merge commits that only provide information that two developers were working on the branch at the same time.
So now you’re guessing something went wrong, especially given the title of this post. You’re probably also guessing it has to do with the rebase. You’re right.
If you’re the developer fixing this bug you just made the stable branch == the development branch, and in a rather non-obvious way.
First off, git rebase origin stable has a typo, it should be git rebase origin/stable, subtle. The latter does what you would expect, and is essentially equivalent to git pull --rebase, which is therefore much safer. The former sets the local stable branch equal to the origin branch with the bug fix tacked onto the end. That’s bad because typically origin is going to mean origin/master, and you’ve just set your stable branch equal to your development branch.
Now I didn’t mention it before, but you’re working from a cloned shared repository that disallows pushes that aren’t fast-forward merges. That’s a very good thing, and the default.
The fast-forward only requirement should save you here. git push origin stable should be denied. You would not expect to branch that contains all of your development series commits to be a simple fast-forward merge on top of your stable series. But it is!
Your workflow for bug fixes says you fix the bug then merge to master. After the last bug fix your branches look something like this:
master = d2 m0 (s1 s0) d1 d0 b stable = s1 s0 b b is some common base that will always exist sX are commits to the stable branch mX are merges of the stable into the edvelopment dX are development commits
After the latest bug fix your polluted stable branch becomes:
stable = s2 d2 m0 (s1 s0) d1 d0 b
How on earth can this be a fast-forward merge for the push. Well git log may interleave s1 d1 d0 s0 is any number of ways, but in reality the branch is actually a graph:
s2 d2 m0 / \ s1 d0 s0 d0 \ / b
And it all becomes clear. To fast-foward this on stable = s1 s0 b just apply m0 then d2 s2.
What can you do? You could not rebase, you could always merge stable^ into master, you can also write an update hook to disallow development commits in the stable branch.
Timeouts from condor_rm and condor_submit
December 16, 2009 by spinningmattThe condor_schedd is an event driven process, like all other Condor daemons. It spends its time waiting in select(2) for events to process. Events include: condor_q queries, spawning and reaping condor_shadow processes, accepting condor_submit submissions, negotiating with the Negotiator, removing jobs during condor_rm. The responsiveness of the Schedd to user interaction, e.g. condor_q, condor_rm, condor_submit, and process interaction, e.g. messages with condor_shadow, condor_startd or condor_negotiator, is effected by how long it takes to process an event and how many events are waiting to be processed.
For instance, if a thousand condor_shadow processes start up at the same time there may be a thousand keep-alive messages for the Schedd to process after a single call to select. Once select returns, no new events will be considered until the Schedd calls select again. A condor_rm request would have to wait. Likewise, if any one event takes a long time to process, such as a negotiation cycle, it can also keep the Schedd from getting back to select and accepting new events.
Basically, to function well, the Schedd needs to get back to select as fast as possible.
From a user perspective, when the Schedd does not get back to select quickly, a condor_rm or condor_submit attempt may appear to fail, e.g.
$ time condor_rm -a Could not remove all jobs. real 0m20.069s user 0m0.020s sys 0m0.020s
As of the Condor 7.4 series, this rarely happens because of internal events that the Schedd is processing. The Schedd uses structures that allow such events to be interleaved with calls to select. However, some events still take long periods of time, e.g. the removal of 300,000 jobs above. One such event is a negotiation cycle initiated by the Negotiator. If a condor_rm, condor_q, condor_submit, etc happens during a negotiation, there is a good chance it may timeout.
Though a simple re-try of the tool will often succeed, this timeout may be annoying to users of the tools, be they people or processes. An alternative to a re-try is to extend the timeout used by the tool. The default timeout is 20 seconds, which is very often long enough, but may not be in large pools.
To extend the timeout for condor_submit, put SUBMIT_TIMEOUT_MULTIPLER=3 in the configuration file read by condor_submit. To extend the timeout for condor_q, condor_rm, etc, put TOOL_TIMEOUT_MULTILIER=3 in the configuration file read by the tool. These changes will take the default timeout, 20 seconds, and multiply it by 3, giving the Schedd 60 seconds to respond. For instance, with 100Ks of jobs in the queue:
$ _CONDOR_TOOL_TIMEOUT_MULTIPLIER=3 time condor_rm -a All jobs marked for removal. 0.01user 0.02system 0:53.99elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (0major+4374minor)pagefaults 0swaps
Cap job runtime: Debugging periodic job policy in a Condor pool
December 5, 2009 by spinningmattJob policy includes a set of periodic expressions; PeriodicHold, PeriodicRelease, and PeriodicRemove. Periodic expressions on a job are evaluated in the context of the job’s ad. They should evaluate to a boolean; if PeriodicRemove is TRUE, then remove the job. They are evaluated by the Schedd and a Shadow. The Schedd evaluates them as frequently as PERIODIC_EXPR_INTERVAL, which defaults to 60 seconds in condor 7.4 and 300 before. The Shadow evaluates them periodically, based on PERIODIC_EXPR_INTERVAL, and every time it receives an update from the Starter running the job. Updates occur periodically, controlled by STARTER_UPDATE_INTERVAL on the Starter.
Say you want to put a job on hold if it runs for more than 60 seconds. To do this you need job policy and two points of reference; the start time of the job, and the current time.
The job policy you want to use is PeriodicHold.
For the two time references you need to look at a job’s ad to see what is available. You can see a job’s ad with condor_q -long. The start time is available as JobCurrentStartDate, seconds since Epoch, and it would appear that ServerTime is the current time, seconds since Epoch.
Here’s a job you might write:
executable = /bin/sleep arguments = 29m notification = never periodic_hold = (ServerTime - JobCurrentStartDate) >= 60 queue 1
If you submit it and wait a minute you will see it is on Hold.
$ condor_q -- Submitter: robin.local : : robin.local ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 matt 11/5 19:03 0+00:00:19 H 0 2.0 sleep 29m 1 jobs; 0 idle, 0 running, 1 held
But, it only ran for 19 seconds. Check why.
$ condor_q -hold -- Submitter: robin.local : : robin.local ID OWNER HELD_SINCE HOLD_REASON 1.0 matt 11/5 19:04 The UNKNOWN (never set) PeriodicHold expres 1 jobs; 0 idle, 0 running, 1 held
The hold reason is truncated, but you can use condor_q -long to see all of it.
$ condor_q -long | grep ^HoldReason HoldReason = "The UNKNOWN (never set) PeriodicHold expression '' evaluated to UNDEFINED" HoldReasonCode = 5 HoldReasonSubCode = 0
This is saying that something went wrong and your PeriodicHold expression evaluated to UNDEFINED. You wanted it to evaluate to a boolean. The way expression evaluation in ClassAds happens, if part of an expression evaluates to UNDEFINED then chances are the entire expression will. For instance, the expression A + 1 will evaluate to the value of the A attribute plus 1. When A is a number like 2, the expression evaluates to 2 + 1 => 3. When A is not defined, the expression evaluates to UNDEFINED + 1 => UNDEFINED.
There is a handy way to debug expression in Condor, use the debug() ClassAd function.
Your job becomes:
executable = /bin/sleep arguments = 29m notification = never periodic_hold = debug((ServerTime - JobCurrentStartDate) >= 60) queue 1
When you submit the job and look in either the SchedLog or ShadowLog you will see “Classad debug” messages.
ShadowLog:
...: Classad debug: ServerTime --> UNDEFINED ...: Classad debug: JobCurrentStartDate --> ERROR ...: Classad debug: JobCurrentStartDate --> 1257476798 ...: Classad debug: debug((ServerTime - JobCurrentStartDate) >= 60) --> UNDEFINED
Very clearly, ServerTime is evaluating to UNDEFINED. That may seem strange because it looked like it was part of the job ad. However, ServerTime is somewhat special. Likely buggy. It is only added to a job ad in response to a query, e.g. condor_q. It is not actually part of the job ad. Annoying.
There is a solution. Another special attribute is CurrentTime. It is available to all expressions, but it is not a visible attribute on a job ad. Also annoying. You have to know it is there. Using it we can rewrite the job as follows.
executable = /bin/sleep arguments = 29m notification = never periodic_hold = debug((CurrentTime - JobCurrentStartDate) >= 60) queue 1
After submitting the job and waiting a minute you can see that it has been put on hold.
$ condor_q -- Submitter: robin.local : : robin.local ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 3.0 matt 11/5 19:09 0+00:01:00 H 0 0.0 sleep 29m 1 jobs; 0 idle, 0 running, 1 held
The details of the hold are also much more what you would expect.
$ condor_q -long | grep ^HoldReason HoldReason = "The job attribute PeriodicHold expression 'debug((CurrentTime - JobCurrentStartDate) >= 60)' evaluated to TRUE" HoldReasonCode = 3 HoldReasonSubCode = 0
In the ShadowLog you can see that CurrentTime is available as expected.
ShadowLog:
...: Classad debug: CurrentTime --> 1257477062 ...: Classad debug: JobCurrentStartDate --> ERROR ...: Classad debug: JobCurrentStartDate --> 1257477002 ...: Classad debug: debug((CurrentTime - JobCurrentStartDate) >= 60) --> 1
Remember to remove debug() from your expressions.
Custom ClassAd attributes in Condor: FreeMemoryMB via STARTD_CRON
November 17, 2009 by spinningmattClassAds enable a lot of power in Condor. ClassAds are the data representation used for most everything in Condor. Generally, a ClassAd is a set of name-value pairs. Values are typed, and include the usual suspects plus expressions. In Condor, jobs, users, slots, daemons, etc. all have a ClassAd representation with their own set of attributes, e.g. a job has a Requirements attribute whose value is an expression such as FreeMemoryMB > 1024, and a slot has a Memory attribute whose value is the amount of memory allocated to the slot.
ClassAds are schema-free, which means they can be arbitrarily extended. Any given ad can have any attribute a user or administrator wants to put in it. Attributes are given meaning by when and where they are referenced.
On a slot there are different classes of attributes. Resource statistics, such as Disk and LoadAvg; resource properties, such as NumCpus and HasVM; policy expressions, such as Requirements or Rank; etc.
One attribute not provided by default on a slot is a statistic representing the total amount of free memory on the system. There are multiple interpretations of what free memory might really mean. On Linux, it could be the number in the free column on the Mem: row reported by the free program, e.g.
$ free -m
total used free shared buffers cached
Mem: 2016 1790 225 0 93 845
-/+ buffers/cache: 851 1165
Swap: 1983 0 1983
or it might be the value on the buffers/cache line. It might be some combination of totalram, freeram, sharedram, bufferedram as reported by sysinfo(2). It might also include information about totalswap and freeswap, also from sysinfo(2).
The meaning is often a function of what kinds of policies are desired in a Condor deployment.
Once you have picked a meaning for your deployment, Condor provides you with the STARTD_CRON mechanism to include a FreeMemoryMB attribute in your slot ads. From there the attribute can be referenced by policy on jobs, during negotiation, and slot policy.
You need two things: first, a way to calculate the value for FreeMemoryMB, we’ll use the simple bash script below that pulls FreeMem: out of /proc/meminfo; second, configuration available to the condor_startd to run the program.
free_memory_mb.sh:
#!/bin/sh
FREE_MEMORY_KB=$(grep ^MemFree < /proc/meminfo | awk '{print $2}')
echo "FreeMemoryMB = $((FREE_MEMORY_KB / 1024))"
condor_config.local:
STARTD_CRON_JOBLIST = FREE_MEMORY_MB STARTD_CRON_FREE_MEMORY_MB_EXECUTABLE = $(LIBEXEC)/free_memory_mb.sh STARTD_CRON_FREE_MEMORY_MB_PERIOD = $(UPDATE_INTERVAL)s
Notes: First, the documentation is out of sync with 7.4.0 and the units on _PERIOD must be specified; second, UPDATE_INTERVAL needs to be defined, it specifies how often the condor_startd will periodically send updates to the Collector.
After reconfiguring the Startd, or just restarting condor, you can view the new attribute with condor_status:
$ condor_status -long | grep ^FreeMemoryMB | sort | uniq -c
10 FreeMemoryMB = 174
$ free -m
total used free shared buffers cached
Mem: 2016 1842 173 0 101 874
-/+ buffers/cache: 866 1149
Swap: 1983 0 1983
Yes, the values are different between FreeMemoryMB and free. The amount of free memory is changing constantly and we are just sampling it. You can increase the sampling rate, but beware that means you will generate more frequent updates to the Collector. Maybe not a problem when you have 32 machines and 256 slots, but definitely something to consider when you have 3000 machines and 24000 slots.
Final note: A better name for the attribute representing free memory on a system is TotalFreeMemoryMB to remain consistent with other attributes. For instance, Disk is a slot’s share of the TotalDisk free on the system.
Submitting a workflow to Condor via SOAP using Java
November 2, 2009 by spinningmatt
In Condor, a workflow is called a DAG. DAG stand for directed acyclic graph. DAGs in Condor are managed by processes called condor_dagman. condor_dagman is a program that takes a description of a DAG as input and walks it; submitting jobs and monitoring their progress. The condor_dagman process itself is often submitted to Condor and managed by the Schedd like any other job. The only thing special about condor_dagman jobs is that they are run on the same machine as the Schedd. Typically in Condor’s Scheduler Universe. Historical note: the first job submitted to Condor via SOAP was a DAG.
Here’s a recipe for submitting a DAG to Condor via SOAP:
0) Get your favorite SOAP library. Here we’ll use Apache Axis: http://www.apache.org/dyn/closer.cgi/ws/axis/1_4
You’ll want to untar it:
$ tar zxf axis-bin-1_4.tar.gz
1) Generate and compile your SOAP library code
# To avoid passing -classpath to java and javac, set CLASSPATH in your env $ export CLASSPATH=.:$(echo axis-1_4/lib/*.jar | tr ' ' ':') # Have Axis generate Java stubs $ java org.apache.axis.wsdl.WSDL2Java file:///usr/share/condor/webservice/condorSchedd.wsdl # Compile the stubs $ javac condor/*.java
2) Write a DAG. This one is pretty simple, it’s a diamond: D depends on B and C, B and C depend on A:
~/dagman/diamond.dag: Job A A.submit Job B B.submit Job C C.submit Job D D.submit PARENT A CHILD B C PARENT B C CHILD D ~/dagman/A.submit: executable = /bin/sleep arguments = 112 output = A.out log = diamond.log queue ~/dagman/B.submit: executable = /bin/sleep arguments = 358 output = B.out log = diamond.log queue ~/dagman/C.submit: executable = /bin/sleep arguments = 132 output = C.out log = diamond.log queue ~/dagman/D.submit: executable = /bin/sleep arguments = 134 output = D.out log = diamond.log queue
3) Write a program that will submit your DAG via SOAP.
The Schedd’s Submit() function takes a ClassAd representing a job. What condor_submit does is take a submit file and convert it into a ClassAd. To look at a submit file for a DAG run condor_submit_dag -no_submit diamond.dag and read diamond.dag.condor.sub. That will give you a hint at what kind of environment condor_dagman wants to run in. Note: remove_kill_sig, arguments, environment and on_exit_remove.
However, diamond.dag.condor.sub is not a ClassAd. To see the ClassAd you can run diamond.dag.condor.sub through condor_submit. Do so with condor_submit -dump diamond.dag.condor.sub.ad diamond.dag.condor.sub. Have a look at diamond.dag.condor.sub.ad and note RemoveKillSig, Arguments, Env and OnExitRemove.
Now you have the basis for what a DAG job needs to run. In the example code I used a little extra knowledge to generate the ClassAd, for brevity. You can use the Schedd’s CreateJobTemplate function to help generate the ClassAd for you instead. Extending arrays is kinda annoying in Java, but cake in python.
Start with this example: CondorSubmitDAG.java
4) Configure your condor_schedd to accept SOAP requests. The basic configuration you will need is:
ENABLE_SOAP = TRUE ALLOW_SOAP = * QUEUE_ALL_USERS_TRUSTED = TRUE SCHEDD_ARGS = -p 1984
This configuration lets anyone talk to your Schedd and submit jobs as any user. For deployment, you should restrict this access by narrowing the ALLOW_SOAP and by setting QUEUE_ALL_USERS_TRUSTED to FALSE. Note: changing QUEUE_ALL_USERS_TRUSTED requires that clients can authenticate themselves via SSL.
This configuration also gives you a fixed port, 1984, for the condor_schedd. Otherwise the port is ephemeral, and you’ll have to query the Collector to find it.
5) Compile your submission program and submit your DAG.
$ javac CondorSubmitDAG.java $ java CondorSubmitDAG http://localhost:1984 soapmonkey /some/shared/space/where/you/put/dagman/diamond.dag
You can watch the DAG run with condor_q, and condor_q -dag. Don’t be afraid of the —????— for the DAG itself, that’s just because I pruned the job ad to the bare minimum to run. condor_q expects a few extra attributes to be present. If you use CreateJobTemplate you’ll get all the attributes condor_q wants.
condor_master for managing processes
October 21, 2009 by spinningmatt
#!/usr/bin/env python
from socket import socket, htons, AF_INET, SOCK_DGRAM
from array import array
def send_alive(pid, timeout=300, master_host="127.0.0.1", master_port=1271):
"""
Send a UDP packet to the condor_master containing the
DC_CHILDALIVE command.
This will have the master register a trigger to fire in timeout
seconds. When the trigger fires the pid will be killed. Each time
the DC_CHILDALIVE is sent the trigger's timer is reset to fire in
timeout seconds.
DC_CHILDALIVE should be sent every timeout/3 seconds.
"""
sock = socket(AF_INET, SOCK_DGRAM)
sock.sendto(build_message(pid, timeout), (master_host, master_port))
def build_message(pid, timeout):
"""
Build a datagram packet to send to the condor_master.
The package format is (command, pid, timeout). The command is
always DC_CHILDALIVE (the integer 60008). The pid is the pid of
the process the master is monitoring, i.e. getpid if this
script. The timeout is the amount of time, in seconds, the master
will wait before killing the pid. Each field in the packet must be
8 bytes long, thus the padding.
"""
DC_CHILDALIVE = 60008
message=array('H')
message.append(0); message.append(0); message.append(0) # padding
message.append(htons(DC_CHILDALIVE))
message.append(0); message.append(0); message.append(0) # padding
message.append(htons(pid))
message.append(0); message.append(0); message.append(0) # padding
message.append(htons(timeout))
return message.tostring()
#
# The condor_master is a daemon that can run arbitrary executables,
# monitor them for failures, restart them, watch for executable
# updates, and send obituary emails.
#
# The condor_master can monitor any program it starts for abnormal
# termination, e.g. return code != 0 or caused by a signal. It can
# also detect hung processes if they periodically send DC_CHILDALIVE
# commands.
#
# The condor_master cleans up process trees, not just the processes it
# directly started.
#
# Example usage...
#
# Run this script from the condor_master, e.g.
# env CONDOR_CONFIG=ONLY_ENV \
# _CONDOR_WATCH_UDP_COMMAND_SOCKET=TRUE \
# _CONDOR_NETWORK_INTERFACE=127.0.0.1 \
# _CONDOR_MASTER_LOG=master.log \
# _CONDOR_MASTER=$(which condor_master) \
# _CONDOR_PROG=$(which <this script>) \
# _CONDOR_PROC_ARGS="-some -args 3" \
# _CONDOR_DAEMON_LIST=MASTER,PROG \
# condor_master -p 1271 -pidfile master.pid
#
# At some point kill -STOP or -TERM this script and watch the
# condor_master react.
#
# condor_master configuration/features:
#
# CONDOR_CONFIG=ONLY_ENV
# - useful, only look in the ENV for config, no config file needed
#
# _CONDOR_WATCH_UDP_COMMAND_SOCKET=TRUE
# - required, make master listen for UDP commands
#
# _CONDOR_NETWORK_INTERFACE=<ip>
# - useful, make master listen on <ip>
#
# _CONDOR_MASTER_LOG=<file>
# - required, the master's log file
#
# _CONDOR_MASTER_DEBUG=<level>
# - D_ALWAYS is default D_FULLDEBUG shows a lot more
#
# _CONDOR_MASTER=<path to condor_master>
# - required, master will restart itself if its executable is
# updated
#
# _CONDOR_PROG=<path to this script>
# - required, the path to an executable the master will start and
# monitor
#
# _CONDOR_PROG_ARGS=<arguments>
# - useful, if the executable needs the master to pass it arguments
#
# _CONDOR_DAEMON_LIST=MASTER,PROG
# - required, list of executables the master will monitor
#
# _CONDOR_MAIL=/bin/mail
# - useful, no default, if email notification of PROG hang/crash is
# desired. don't specify and get no emails.
#
# _CONDOR_CONDOR_ADMIN=admin@fqdn
# - useful, the address to send emails to
#
# _CONDOR_MASTER_BACKOFF_CONSTANT, C, default 9 seconds
# _CONDOR_MASTER_BACKOFF_FACTOR, K, default 2 seconds
# _CONDOR_MASTER_BACKOFF_CEILING, T, default 3600 seconds
# _CONDOR_MASTER_RECOVER_FACTOR, default 500 seconds
# - useful, parameters to control the exponential backoff
# start delay = min(T, C + K^n),
# n is the # of restarts since last recovery
# a recovery is a run of RECOVER_FACTOR without a crash
#
# _CONDOR_MASTER_CHECK_NEW_EXEC_INTERVAL=<seconds>
# - useful, default 300, how often to check executable timestamps
# and potentially restart processes
#
# _CONDOR_MASTER_NEW_BINARY_DELAY=<seconds>
# - useful, default 120, time waited after noticing a timestamp
# change before restarting the executable
#
# condor_master -p <port> -pidfile <file>
# - specify a port for the master to listen on, default is
# ephemerial, and specify a file where the master writes its pid,
# default is nowhere
# also: -t (log to stdout) -f (run in foreground)
#
# Advanced note: condor_squawk can send DC_CHILDALIVE as well, e.g.
# $ echo "command DC_CHILDALIVE 1234 15" | condor_squawk "<127.0.0.1:1271>"
#
import time, os
timeout = 30
while True:
send_alive(os.getpid(), timeout)
time.sleep(timeout/3)
Live Migration: How it could work
August 25, 2009 by spinningmattThe Condor infrastructure historically prefers jobs that run on a fixed set of resources. A parallel universe job may run in multiple resources at once, but won’t add or remove resources dynamically. All other jobs just run on a single resource.
Condor has long had functionality to periodically checkpoint a process and resume from a checkpoint. Historically the functionality was part of the standard universe. It is also available in the virtual machine universe.
The checkpoint and resume functionality can provide a basis for migration. For instance, a job may run, checkpoint, be evicted and then be run again. The second time the job will start from the last checkpoint to minimize bad put. This is a powerful mode of operation that handles many use cases. One use case not handled is live migration. At issue is the be evicted step in the process.
Migration is about moving a process from one resource to another. Whenever a process migrates there will be some interruption in its execution. Live migration is about minimizing that interruption.
An eviction in Condor is when an executing job is removed from the resource where it is running and placed back into the queue as an idle job ready to be scheduled and run again. This process often does not take less than 10s of seconds. It involves terminating the job and transferring any state off the execution resource, waiting for a negotiation cycle where a new resource is found for the job, transferring the jobs state the to new resource and starting the job.
This is clearly not live migration, and it is not intended to be.
Live migration could be something that Condor’s scheduler and execution infrastructure becomes deeply aware of, but the existing infrastructure may already be enough. Generally, it is preferable to write general, instead of specialized, features into the infrastructure. It is better to either show the infrastructure simply cannot handle a use case and needs to be extended, or show that it can handle a use case but an extension will simplify the case and simplify or enable others.
Imagine,
0) job A is submitted
1) job A starts running on resource X
2) job B is submitted, containing knowledge of job A
3) job B starts running on resource Y
Now, job A and B are special in that A knows how to transfer its state to someone else when asked and B knows how to ask A to transfer its state.
4) job B asks job A to transfer state and execution to it
5) after the transfer, job B starts executing and job A terminates
In the end, the new job B has taken over for job A, which may no longer exist.
The semantics jobs A and B must provide are not very complex. The details in implementing them very well may be. Many research and production technologies have tried to generally address this problem. One such, currently popular, technology is virtualization. Any virtualization technology worth its salt can handle migration between resources and often with minimal execution interruption.
Condor’s virtual machine universe uses the libvirt API, which provides it the ability to direct different hypervisors in performing migrations. Jobs that run on Condor can do the same. To actually do virtual machine migration in Condor a number of details would need to be worked out: who is allowed to initiate the migration; who should be terminating the source job (job A); what to do about different network topologies; how to most transparently link the source and destination (job B) jobs; how to be tolerant to faults before, during and after transfer; how to evaluate and pick a particular technology for a particular workload; what general enhancements can or should go into the infrastructure; … . It can be done, and not everything must be done to have a useful migration solution.
When it comes to general features for the infrastructure, one noticeable downsides to the procedure above is anything tracking job A has to know that job A’s termination signals the need to track job B instead of a normal termination. The transfer operation between jobs A and B has to be robust. It must handle the cases where the transfer fails, potentially due to either job A or B being evicted. It is likely not the case that job A and B should both think they are the owner of their shared state. All this information could be discernible at the tracking level, or it could be completely invisible. Often it is desirable to be complete invisible. Making the information invisible means providing a mechanism to link job A and B as a single unit in the infrastructure. Also, being invisible cannot involve hiding or swallowing critical information.
Another potential infrastructure enhancement is a means to link two jobs, though they may not be presented as a single unit. For instance, job B could be linked with job A meaning it can find out where job A is running, and job A could contain policy to officially terminate once job B has taken over execution. The ability to link jobs is actually one that is often requested. For instance, being able to have a set of jobs that always run on separate resources is quite desirable. Simplistically, the possibility of a STARTD_SLOT_ATTRS for jobs could be quite powerful.
Beyond tracking and sharing, the ability to have a multi-homed or multi-execution job could attack the common long tail in large workloads. When only a few jobs are left to be run, they could be matched and executed on multiple resources, with all but the first to complete being discarded.
Yet another enhancement that could assist in improving migration support, is the ability for a job to be matched with a resource while it is already executing. For instance, a running job might have ended up on a low ranked execution resource. It is stuck running there until it completes or gets evicted. It cannot express policy that states: I’ll run on my current resource until I complete, unless a higher ranked resource shows up in the first 5 minutes of my execution, in which case I want to be restarted on the other resource.
All these potential Condor enhancements would help in implementing or optimizing a migration solution, but are not specific to migration and do not necessarily involve deep knowledge about migration to be introduced to the scheduler or execution resources.
Condor and the Cloud @ Open Source Cloud Computing Forum, 22 July 2009
July 24, 2009 by spinningmattRed Hat hosted an Open Source Cloud Computing forum on 22 July 2009. A number of technologies for building clouds were presented, including my session on Condor. The proceedings of the forum will be available online for 12 months. The Condor and the Cloud presentation is also attached to this blog.
Submitting to Condor from QMF: A minimal Vanilla Universe job
April 30, 2009 by spinningmattPart of Red Hat’s MRG product is a management system that covers Condor. At the core of that system is the Qpid Management Framework (QMF), built on top of AMQP. Condor components modeled in QMF allow for information retrieval and control injection. I previously discussed How QMF Submission to Condor Could work, and now there’s a prototype of that interface.
Jobs in Condor are represented as ClassAds, a.k.a. a job ad, which are essentially schema-free. Any schema they have is defined by the code paths that handle jobs. This includes not only the name of attributes but also the types of their values.. For instance, a job ad does not have to have an attribute that describes what program to execute, the Cmd attribute whose value is a string, but if it does then the job can actually run.
Who cares what’s in a job ad?
Most all components in a Condor system handle jobs in one way or another, which means they all help define what a job ad looks like. To complicate things a bit, different components handle jobs in different ways depending on the job’s type or requested features. For simplicity I’m only going to discuss jobs that want to run in the Vanilla Universe with minimal extra features, e.g. no file transfer requests.
The Schedd is the first component that deals with jobs. Its purpose is to manage them. It insists all jobs it manages have: an Owner string, an identify of the job’s owner; a ClusterId string, an identifier assigned by the Schedd; a ProcId string, an identifier within the ClusterId, also assigned by the Schedd; a JobStatus integer, specifying the state the job is in, e.g. Idle, Held, Running; and, a JobUniverse integer, denoying part of a job’s type, e.g. Vanilla Universe or Grid Universe.
The Negotaitor is sent jobs by the Schedd to perform matching with machines. To do that matching it requires that a job have a Requirements expression.
The Shadow helps the Schedd manage a job while it is running. The Schedd gives it the job ad to the Shadow, and the Shadow insists on finding an Iwd string representing a path to the job’s initial working directory.
The Startd handles jobs similarly to the Negotiator. Since matching is bi-directional in Condor and execution resources get final say in running a job, the Startd evaluates the job’s Requirements expression before accepting it.
The Starter is responsible for actually running the job. It requires that a job have the Cmd string attribute. Without one the Starter does not know what program to run. An additional requirement the Starter imposes is the validity of the Owner string. If the Starter is to impersonate the Owner, then the string must specify an identify known on the Starter’s machine.
Now, those are not the only attributes on a job ad. Often components will fill in sane defaults for attributes they need but are not present. For instance, the Shadow will fill in the TransferFiles string, and the Schedd will assume a MaxHosts integer.
What about tools?
It’s not just the components that manage jobs that care about attributes on a job. The condor_q command-line tool also expects to find certain attributes to help it display information. In addition to those attributes already described, it requires: a QDate integer, the number of seconds since EPOCH when the job was submitted; a RemoteUserCpu float, an attribute set while a job runs; a JobPrio integer, specifying the job’s relative priority to other jobs; and, an ImageSize integer, specifying the amount of memory used by the job in KiB.
What does a submitter care about?
Anyone who wants to submit a job to Condor is going to have to specify enough attributes on the job ad to make all the Condor components happy. Those attributes will vary depending on the type of job and the features it desires. Luckily Condor can fill in many attributes with sane defaults. For instance, condor_q wants a QDate and a RemoteUserCpu. Those are two attributes that the submitter should not be able to or have to specify.
To perform a simple submission for running a pre-staged program, the job ad needs: a Cmd, a Requirements, a JobUniverse, an Iwd, and an Owner. Additionally, an Args string is possible if the Cmd takes arguments.
Given the knownledge of what attributes are required on a job ad, and using the QMF Python Console Tutorial, I was able to quickly write up an example program to submit a job via QMF.
#!/usr/bin/env python
from qmf.console import Session
from sys import exit
(EXPR_TYPE, INTEGER_TYPE, FLOAT_TYPE, STRING_TYPE) = (0, 1, 2, 3)
UNIVERSE = {"VANILLA": 5, "SCHEDULER": 7, "GRID": 9, "JAVA": 10, "PARALLEL": 11, "LOCAL": 12, "VM": 13}
JOB_STATUS = ("", "Idle", "Running", "Removed", "Completed", "Held", "")
ad = {"Cmd": {"TYPE": STRING_TYPE, "VALUE": "/bin/sleep"},
"Args": {"TYPE": STRING_TYPE, "VALUE": "120"},
"Requirements": {"TYPE": EXPR_TYPE, "VALUE": "TRUE"},
"JobUniverse": {"TYPE": INTEGER_TYPE, "VALUE": "%s" % (UNIVERSE["VANILLA"],)},
"Iwd": {"TYPE": STRING_TYPE, "VALUE": "/tmp"},
"Owner": {"TYPE": STRING_TYPE, "VALUE": "nobody"}}
session = Session(); session.addBroker("amqp://localhost:5672")
schedulers = session.getObjects(_class="scheduler", _package="mrg.grid")
result = schedulers[0].Submit(ad)
if result.status:
print "Error submitting job:", result.text
exit(1)
print "Submitted job:", result.Id
jobs = session.getObjects(_class="job", _package="mrg.grid")
job = reduce(lambda x,y: x or (y.CustomId == result.Id and y), jobs, False)
if not job:
print "Did not find job"
exit(2)
print "Job status:", JOB_STATUS[job.JobStatus]
print "Job properties:"
for prop in job.getProperties():
print " ",prop[0],"=",prop[1]
The program does more than just submit, it also looks up the job based on what has been published via QMF. The job that it does submit runs /bin/sleep 120 as the nobody user from /tmp on any, Requirements = TRUE, execution node.
The job’s ClassAd is presented as nested maps. The top level map holds attribute names mapped to values. Those values are themselves maps that specify the type of the actual value and a representation of the actual value. All representations of values are strings. The type specifies how the string should be handled, e.g. if it should be parsed into an int or float.
Two good sources of information about job ad attributes are the UW’s Condor Manual’s Appendix, and the output from condor_submit -dump job.ad when run against a job submission file.