Advanced scheduling: Execute in the future with job deferral

September 24, 2012

One advanced scheduling feature of Condor is the ability to set a time, in the future, when a job should be run. This is called a deferral time.

Using the deferral_time command, you simply specify a time, in seconds since EPOCH, when your job should run:

executable = /bin/date
log = deferral.log
output = deferral.out
error = deferral.err

deferral_time = 1357016400

queue

Use date(1) to generate the deferral_time.

$ date -d @1357016400
Tue Jan  1 00:00:00 EST 2013
$ date +%s -d "2013-01-01 00:00:00"
1357016400

After submitting the job and waiting until 1 Jan 2013, you can see the result by looking in deferral.log and deferral.out.

$ grep ^00 deferral.log
000 (001.000.000) 08/15 22:33:00 Job submitted from host: <127.0.0.1:56006>
001 (001.000.000) 01/01 00:00:00 Job executing on host: <127.0.0.2:57590>
006 (001.000.000) 01/01 00:00:00 Image size of job updated: 75
005 (001.000.000) 01/01 00:00:00 Job terminated.

$ cat deferral.out
Tue Jan  1 00:00:00 EST 2013

Of course there is no guarantee that a resource will be available at a precise time in the future. A job that does not run at its deferral_time will be put on Hold for manual intervention.

To reduce the likelihood of missing the deferral_time and needing manual intervention, the deferral_prep_time and deferral_window commands are available. Respectively, they specify the amount of time before the deferral_time that the job can be matched with a resource and how long after the deferral_time execution is acceptable.

executable = /bin/date
log = deferral.log
output = deferral.out
error = deferral.err

deferral_time = 1357016400

# 1 day = 24 hour * 60 min * 60 sec = 86,400 seconds
# 1/2 day = 86,400 sec / 2 = 43,200 seconds
deferral_prep_time = 86400
deferral_window = 43200

queue

In the example above, the job may be matched to a resource, where it will keep the resource Claimed/Busy for up to a day (deferral_prep_time) in advance of its actual run. This will make it more likely that the job will run at precisely the deferral_time. It also means that for accounting purposes, you will be charged for using the resource, though the job has not yet run.

Additionally, if the job is not matched or otherwise does not start at precisely deferral_time, it has half a day (deferral_window) to run before it is put on hold for manual intervention.

That’s it.

Advertisements

MATLAB jobs and Condor

September 17, 2012

The question of how to run MATLAB jobs with Condor comes up from time to time and there is no central location for the knowledge.

If you want to use MATLAB with Condor,

Read http://sysadminhelp.cms.caltech.edu/AOK/HTC/, it provides an informative overview and detailed example, including how to use mcc to create a standalone executable.

Or read http://www.liv.ac.uk/csd/escience/condor/matlab/

Or read http://www.lehigh.edu/computing/hpc/running/condor_matlab.html

Possibly read https://condor-wiki.cs.wisc.edu/index.cgi/wiki?p=HowToRunMatlab

Likely just enter “MATLAB condor” into your favorite search engine and you will find useful results.

The Owner state

September 11, 2012

What is the Owner state? Why are my slots in the Owner state? Why do jobs not start immediately after I restart Condor? How do I keep slots from going into the Owner state?

$ condor_status
Name               OpSys      Arch   *State*  Activity LoadAv Mem   ActvtyTime
eeyore.local       LINUX      X86_64 *Owner*  Idle     0.480  7783  0+00:00:04
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        1     1       0         0       0          0
               Total        1     1       0         0       0          0

All these questions require understanding a bit of Condor policy. Specifically, the START policy on execute nodes.

In Condor, the decision to run a job on a resource, a.k.a. a slot, is made by both the job and the resource. The job specifies the requirements it has for a resource, e.g. Memory > 1024. And, the resource specifies the requirements it has for jobs, e.g. MemoryUsage < Memory. If both do not agree, the job will not be run on the resource.

First, the Owner state is an indication that the resource is not available to run jobs. The resource is in use by its Owner. Historically, this has been someone at the resource’s console. Generally, think of Owner meaning Unavailable.

Second, the resource specifies its requirements using the START configuration parameter.

$ condor_config_val -v START
START: ( (KeyboardIdle > 15 * 60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner")) )
  Defined in '/etc/condor/condor_config', line 754.

When the START policy evaluates to False, i.e. starting jobs is not allowed, the resource will appear in the Owner state.

Note that the default START policy does not care about aspects of the jobs that might run. It cares entirely about aspects of the resource itself. It is designed to protect a user at the console (KeyboardIdle) and other processes already running on the machine (LoadAvg – CondorLoadAvg).

Third, jobs may not start immediately on restart because the KeyboardIdle timer has been reset to 0. This means waiting 15 minutes for the START policy, specifically KeyboardIdle > 15 * 60, to evaluate to True and the resource to become available.

Finally, on dedicated resources, which are very common, the KeyboardIdle component of the START policy can be removed. In fact, it is quite common to simply set START = TRUE.

Maintaining tight accounting group quota usage with preemption

September 4, 2012

Many metrics in a distributed system should be viewed over time. However, desires are often for instantaneous views. Take for example usage data.

Usage data in Condor is commonly viewed through accounting groups. Accounting groups can be given quotas that specify what percent of a pool each can use.

Instead of viewing accounting group usage summed over a week or a day or even hours, it is common to want any snapshot of usage to match configured quotas. At the same time, it is common to want accounting groups to exceed their quota if resources are available. These two desires are at odds.

Erik Erlandson has written about how to use preemption to maintain tight accounting group quota usage.

Instantaneous views should not be relied on, but with preemption it is possible to reduce the window you sum over.

Now, if preemption is not an option, the desire to see usage snapshots matching configured quotas must be abandoned.

Pool utilization and schedd statistic graphs

June 22, 2012

Assuming you are gathering pool utilization and schedd statistics, you might be able to see something like this,

Queue depth and job rates

This graph is for a single schedd and may show queue depth’s, a.k.a. the number of jobs waiting in the queue, impact on job submission, start and completion rates. The submission rate is on the top. The start and completion rates overlap, which is good. I say may show because there are other factors involved that have not been ruled out, such as other processes on the system that started to run it out of memory. Note that the base rate is a function of job duration and number of available slots. Despite having hundreds of slots, the max rate is quite low because the jobs were minutes long.

Over this nine day period, as the queue grew to 1.8 million jobs, the utilization remained above 95%,

Pool utilization

Wallaby: Skeleton Group

June 19, 2012

Read about Wallaby’s Skeleton Group feature. Working similar to /etc/skel for accounts on a single system, it provides base configuration to nodes as they join a pool. It is especially useful for pools with dynamic and opportunistic resources.

Schedd stats with OpenSTDB

June 14, 2012

Building on Pool utilization with OpenSTDB, the condor_schedd also advertises a plethora of useful statistics that can be harvested with condor_status.

Make the metrics,

$ tsdb mkmetric condor.schedd.jobs.idle condor.schedd.jobs.running condor.schedd.jobs.held condor.schedd.jobs.mean_runtime condor.schedd.jobs.mean_waittime condor.schedd.jobs.historical.mean_runtime condor.schedd.jobs.historical.mean_waittime condor.schedd.jobs.submission_rate condor.schedd.jobs.start_rate condor.schedd.jobs.completion_rate
...

Obtain schedd_stats_for_opentsdb.sh and run,

$ while true; do ./schedd_stats_for_opentsdb.sh; sleep 15; done | nc -w 30 tsdb-host 4242 &

View the results at,

http://tsdb-host:4242/#start=1h-ago&m=sum:condor.schedd.jobs.idle&o=&m=sum:condor.schedd.jobs.running&o=&m=sum:condor.schedd.jobs.held&o=&m=sum:10m-avg:condor.schedd.jobs.submission_rate&o=axis%20x1y2&m=sum:10m-avg:condor.schedd.jobs.start_rate&o=axis%20x1y2&m=sum:10m-avg:condor.schedd.jobs.completion_rate&o=axis%20x1y2&ylabel=jobs&y2label=rates&yrange=%5B0:%5D&y2range=%5B0:%5D&key=out%20center%20top%20horiz%20box

Pool utilization with OpenSTDB

June 12, 2012

Merging the pool utilization script with OpenTSDB.

Once you have followed the stellar OpenTSDB Getting Started guide, make the metrics with,

$ tsdb mkmetric condor.pool.slots.unavail condor.pool.slots.avail condor.pool.slots.total condor.pool.slots.used condor.pool.slots.used_of_avail condor.pool.slots.used_of_total condor.pool.cpus.unavail condor.pool.cpus.avail condor.pool.cpus.total condor.pool.cpus.used condor.pool.cpus.used_of_avail condor.pool.cpus.used_of_total condor.pool.memory.unavail condor.pool.memory.avail condor.pool.memory.total condor.pool.memory.used condor.pool.memory.used_of_avail condor.pool.memory.used_of_total
metrics condor.pool.slots.unavail: [0, 0, 1]
metrics condor.pool.slots.avail: [0, 0, 2]
metrics condor.pool.slots.total: [0, 0, 3]
...
metrics condor.pool.memory.used_of_total: [0, 0, 17]

Obtain utilization_for_opentsdb.sh before running,

$ while true; do ./utilization_for_opentsdb.sh; sleep 15; done | nc -w 30 tsdb-host 4242 &

View the results at,

http://tsdb-host:4242/#start=1h-ago&m=sum:condor.pool.cpus.total&o=&m=sum:condor.pool.cpus.used&o=&m=sum:condor.pool.cpus.used_of_avail&o=axis%20x1y2&ylabel=cpus&y2label=%2525+utilization&yrange=%5B0:%5D&y2range=%5B0:1%5D&key=out%20center%20top%20horiz%20box

The number of statistics about operating Condor pools has been growing over the years. All are easily retrieved via condor_status for feeding into OpenTSDB.

Hadoop JobTracker and NameNode configuration error: FileNotFoundException jobToken

June 8, 2012

FYI for anyone running into java.io.FileNotFoundException: File file:/tmp/hadoop-USER/mapred/system/JOBID/jobToken does not exist. when running Hadoop MapReduce jobs.

Your JobTracker needs access to an HDFS NameNode to share information with TaskTrackers. If the JobTracker is misconfigured and cannot connect to a NameNode it will write to local disk instead. Or at least it will in hadoop-1.0.1. The result is TaskTrackers fail to find the jobToken in HDFS, or on their local disks, and will throw a FileNotFoundException.

I ran into this because I did not have a fs.default.name property in my JobTracker’s mapred-site.xml. I was putting it in hdfs-site.xml instead.

I have not found messages in the JobTracker (or TaskTracker) logs that explicitly indicate this misconfiguration.

Oops.

Service as a Job: Hadoop MapReduce TaskTracker

June 6, 2012

Continuing to build on other examples of services run as jobs, such as HDFS NameNode and HDFS DataNode, here is an example for the Hadoop MapReduce framework’s TaskTracker.

mapred_tasktracker.sh

#!/bin/sh -x

# condor_chirp in /usr/libexec/condor
export PATH=$PATH:/usr/libexec/condor

HADOOP_TARBALL=$1
JOBTRACKER_ENDPOINT=$2

# Note: bin/hadoop uses JAVA_HOME to find the runtime and tools.jar,
#       except tools.jar does not seem necessary therefore /usr works
#       (there's no /usr/lib/tools.jar, but there is /usr/bin/java)
export JAVA_HOME=/usr

# When we get SIGTERM, which Condor will send when
# we are kicked, kill off the datanode and gather logs
function term {
   ./bin/hadoop-daemon.sh stop tasktracker
# Useful if we can transfer data back
#   tar czf logs.tgz logs
#   cp logs.tgz $_CONDOR_SCRATCH_DIR
}

# Unpack
tar xzfv $HADOOP_TARBALL

# Move into tarball, inefficiently
cd $(tar tzf $HADOOP_TARBALL | head -n1)

# Configure,
#  . http.address must be set to port 0 (ephemeral)
cat > conf/mapred-site.xml <<EOF
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>$JOBTRACKER_ENDPOINT</value>
  </property>
  <property>
    <name>mapred.task.tracker.http.address</name>
    <value>0.0.0.0:0</value>
  </property>
</configuration>
EOF

# Try to shutdown cleanly
trap term SIGTERM

export HADOOP_CONF_DIR=$PWD/conf
export HADOOP_PID_DIR=$PWD
export HADOOP_LOG_DIR=$_CONDOR_SCRATCH_DIR/logs
./bin/hadoop-daemon.sh start tasktracker

# Wait for pid file
PID_FILE=$(echo hadoop-*-tasktracker.pid)
while [ ! -s $PID_FILE ]; do sleep 1; done
PID=$(cat $PID_FILE)

# Report back some static data about the tasktracker
# e.g. condor_chirp set_job_attr SomeAttr SomeData
# None at the moment.

# While the tasktracker is running, collect and report back stats
while kill -0 $PID; do
   # Collect stats and chirp them back into the job ad
   # None at the moment.
   sleep 30
done

The job description below uses the same templating technique as the DataNode example. The description uses a variable JobTrackerAddress, provided on the command-line as an argument to condor_submit.

mapred_tasktracker.job

# Submit w/ condor_submit -a JobTrackerAddress=<address>
# e.g. <address> = $HOSTNAME:9001

cmd = mapred_tasktracker.sh
args = hadoop-1.0.1-bin.tar.gz $(JobTrackerAddress)

transfer_input_files = hadoop-1.0.1-bin.tar.gz

# RFE: Ability to get output files even when job is removed
#transfer_output_files = logs.tgz
#transfer_output_remaps = "logs.tgz logs.$(cluster).tgz"
output = tasktracker.$(cluster).out
error = tasktracker.$(cluster).err

log = tasktracker.$(cluster).log

kill_sig = SIGTERM

# Want chirp functionality
+WantIOProxy = TRUE

should_transfer_files = yes
when_to_transfer_output = on_exit

requirements = HasJava =?= TRUE

queue

From here you can run condor_submit -a JobTrackerAddress=$HOSTNAME:9001 mapred_tasktracker.job a few times and build up a MapReduce cluster. Assuming you are already running a JobTrackers. You can also hit the JobTracker’s web interface to see TaskTrackers check in.

Assuming you are also running an HDFS instance, you can run some jobs against your new cluster.

For fun, I ran time hadoop jar hadoop-test-1.0.1.jar mrbench -verbose -maps 100 -inputLines 100 three times against 4 TaskTrackers, 8 TaskTrackers and 12 TaskTrackers. The resulting run times were,

# nodes runtime
4 01:26
8 00:50
12 00:38

%d bloggers like this: