Posts Tagged ‘Schedd’

Quick walk with Condor: Looking at Scheduler performance w/o notification

April 24, 2011

Recently I did a quick walk with the Schedd, looking at its submission and completion rates. Out of the box, submitting jobs with no special consideration for performance, the Schedd comfortably ran 55 jobs per second.

Without sending notifications, the Schedd can sustain a rate of 85 jobs per second.

I ran the test again, this time with notification=never and in two configurations: first, with 500,000 jobs submitted upfront; second, with submissions occurring during completions. The idea was to get an idea for performance when the Shadow is not burdened with sending email notifications of job completions, and to figure out how the Schedd performs with respect to servicing condor_submit at the same time it is running jobs.

First, submitting 500,000 jobs and then letting them drain showed a sustained rate of about 86 jobs per second.

Upfront submission of 500K jobs, drain at rate of 85 jobs per second

Second, building up the submission rate to about 100 jobs per second showed a sustained rate of about 83 jobs per second (81 shown in graph below).

Submission and drain rates of 85 jobs per second

The results are quite satisfying, and show the Schedd can sustain a reasonably high job execution rate at the same time it services submissions.

Advertisements

Quick walk with Condor: Looking at Scheduler performance

April 15, 2011

This was a simple walk to get a feel for what a single, out of the box, 7.6 condor_schedd could handle for a job load. There were about 1,500 dynamic slots and all jobs were 5 second sleeps.

It turns out that without errors in the system, a single Schedd can sustain at least a rate of 55 jobs per second.

Out of the box condor_schedd performance visualized with Cumin

Graphs courtesy Cumin and Ernie Allen‘s recent visualization of Erik Erlandson‘s new Schedd stats. This is a drill-down into a Scheduler.

Here’s what happened. I started submitting 25 jobs (queue 25) every 5 seconds. You can’t see this unfortunately, it is off the left side of the graphs. The submission, start and completion rates were all equal at 5 per second. Every five/ten/fifteen minutes, when I felt like it, I ramped that up a bit, by increasing the jobs per submit (queue 50 then 100 then 150 then 200) and the rate of submission (every 5 seconds then 2 seconds then 1). The scaling up matched my expectations. At 50 jobs every second, I saw 10 jobs submitted/started/completed per second. At 100 job, the rates were 20 per second. I eventually worked it up to rates about 50-55 per second.

Then we got to the 30 minute mark in the graphs. B shows that Shadows started failing. I let this run for about 10 minutes, the Schedd’s rates fluctuated down to between 45-50 jobs per second, and then kicked up the submission rate. The spike in submissions is quite visible to the right of A in the Rates graph.

At this point the Schedd was sustaining about 45 jobs per second and the Shadow exception rate was fairly sustained. I decided to kill off the submissions, also quite visible. The Schedd popped back up to 55 jobs per second and finished off its backlog.

A bit about the errors, simple investigation: condor_status -sched -long | grep ExitCode, oh a bunch of 108s; grep -B10 “STATUS 108” /var/log/condor/ShadowLog, oh a bunch of network issues and some evictions; pull out the hosts the Shadows were connecting to, oh mostly a couple nodes that were reported as being flakey; done.

Throughout this entire walk, the Mean start times turned out to be an interesting graph. It shows two things: 0) Mean time to start cumulative – the mean time between a job is queued to when it first starts running, over the lifetime of the Schedd; and, 1) Mean time to start – the same metric, but over an adjustable window, defaulting to 5 minutes. Until the exceptions started and when I blew the submissions out, around C, the mean queue time/wait time/time to start was consistently between 4 and 6 seconds. I did not look at the service rate on the back side of the job’s execution, e.g. how long it waited before termination was recognized, mail was sent, etc.

That’s about it. Though, remember this was done with about 1,500 slots. I didn’t vary the number of slots, say to see if the Schedd would have issues sustaining more than 1,500 concurrent jobs. I also didn’t do much in the way of calculations to compare the mean start time to the backlog (idle job count).

Also, if anyone tries this themselves, I would suggest notification=never in your submissions. It will prevent an email getting sent for each job, will save the Schedd and Shadow a good amount of work, and in my case would have resulted in a 389MB reduction in email.

Migrating workflows between Schedds with DAGMan

March 15, 2010

In a distributed system, migration of resources can happen at numerous levels. For instance, in the context of Condor, migration could be of running job processes between execution machines, jobs between Schedd’s, Schedds between machines, Negotaitors between machines. Here the focus is on workflows between Schedds. A workflow is a directed acyclic graph (DAG) of jobs managed by DAGMan.

DAGMan at its simplest is a process, condor_dagman, that is submitted and run as a job. It typically runs under the Schedd in what’s known as the Scheduler Universe. Different universes in Condor provide different contracts with jobs. The Scheduler Universe lets a job run local to the Schedd and provides certain signaling guarantees in response to condor_hold/release/rm. DAGMan is written to have all its state persisted in either job execution log files (UserLogs) or its input file, a DAG. It is highly tolerant of faults in the execution of jobs it is managing, and in its own execution.

Migration of a workflow between Schedds amounts to moving the condor_dagman job running a workflow between Schedds. The Schedd does not support migration of jobs between Schedds. However, since DAGMan keeps no state it cannot reconstruct, it can be logically migrated through removal (condor_rm) and re-submission (condor_submit_dag).

For instance,

$ _CONDOR_SCHEDD_NAME=ScheddA@ condor_submit_dag migration.dag
Submitting jobs(s).
-----------------------------------------------------------------------
File for submitting this DAG to Condor           : migration.dag.condor.sub
Log of DAGMan debugging messages                 : migration.dag.dagman.out
Log of Condor library output                     : migration.dag.lib.out
Log of Condor library error messages             : migration.dag.lib.err
Log of the life of condor_dagman itself          : migration.dag.dagman.log

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 1.
-----------------------------------------------------------------------

$ condor_q -dag -global
-- Schedd: ScheddA@ : 
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   matt            3/15 10:39   0+00:00:04 R  0   1.7  condor_dagman -f -
  11.0    |-M0           3/15 10:39   0+00:00:00 I  0   0.0  sleep 19
  21.0    |-M1           3/15 10:39   0+00:00:00 I  0   0.0  sleep 80
3 jobs; 2 idle, 1 running, 0 held

$ condor_rm -name ScheddA@ 1.0   
Job 1.0 marked for removal

$ _CONDOR_SCHEDD_NAME=ScheddB@ condor_submit_dag migration.dag
Running rescue DAG 1
-----------------------------------------------------------------------
File for submitting this DAG to Condor           : migration.dag.condor.sub
Log of DAGMan debugging messages                 : migration.dag.dagman.out
Log of Condor library output                     : migration.dag.lib.out
Log of Condor library error messages             : migration.dag.lib.err
Log of the life of condor_dagman itself          : migration.dag.dagman.log

Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 2.
-----------------------------------------------------------------------

$ condor_q -dag -global
-- Schedd: ScheddB@ : 
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   2.0   matt            3/15 10:44   0+00:00:04 R  0   1.7  condor_dagman -f -
  12.0    |-M00          3/15 10:45   0+00:00:03 R  0   0.0  sleep 19
  22.0    |-M11          3/15 10:45   0+00:00:02 R  0   0.0  sleep 84
3 jobs; 0 idle, 3 running, 0 held

There are two important things going on for this to work. First, _CONDOR_SCHEDD_NAME, a way of setting the configuration parameter SCHEDD_NAME from a shell environment, specifies where condor_submit_dag will submit the workflow, and, because it is recorded in the condor_dagman job’s environment, where all the jobs that condor_dagman submits will go. This is important because DAGMan only tracks job ids, not job id + schedd name.

Second, the job ids, 1.0 11.0 21.0 and 2.0 12.0 22.0. As just mentioned, DAGMan only keeps track of job ids. This means that the Schedds cannot have overlapping job id spaces. To achieve this use SCHEDD_CLUSTER_INITIAL_VALUE and SCHEDD_CLUSTER_INCREMENT_VALUE. Give each Schedd a unique initial cluster value, and set the cluster increment value to one more than the number of Schedds.

Additionally, condor_hold on the DAGMan job will prevent it from submitting new nodes and allow existing ones to complete. Useful for draining off a submission during migration.

Timeouts from condor_rm and condor_submit

December 16, 2009

The condor_schedd is an event driven process, like all other Condor daemons. It spends its time waiting in select(2) for events to process. Events include: condor_q queries, spawning and reaping condor_shadow processes, accepting condor_submit submissions, negotiating with the Negotiator, removing jobs during condor_rm. The responsiveness of the Schedd to user interaction, e.g. condor_q, condor_rm, condor_submit, and process interaction, e.g. messages with condor_shadow, condor_startd or condor_negotiator, is effected by how long it takes to process an event and how many events are waiting to be processed.

For instance, if a thousand condor_shadow processes start up at the same time there may be a thousand keep-alive messages for the Schedd to process after a single call to select. Once select returns, no new events will be considered until the Schedd calls select again. A condor_rm request would have to wait. Likewise, if any one event takes a long time to process, such as a negotiation cycle, it can also keep the Schedd from getting back to select and accepting new events.

Basically, to function well, the Schedd needs to get back to select as fast as possible.

From a user perspective, when the Schedd does not get back to select quickly, a condor_rm or condor_submit attempt may appear to fail, e.g.

$ time condor_rm -a

Could not remove all jobs.

real	0m20.069s
user	0m0.020s
sys	0m0.020s

As of the Condor 7.4 series, this rarely happens because of internal events that the Schedd is processing. The Schedd uses structures that allow such events to be interleaved with calls to select. However, some events still take long periods of time, e.g. the removal of 300,000 jobs above. One such event is a negotiation cycle initiated by the Negotiator. If a condor_rm, condor_q, condor_submit, etc happens during a negotiation, there is a good chance it may timeout.

Though a simple re-try of the tool will often succeed, this timeout may be annoying to users of the tools, be they people or processes. An alternative to a re-try is to extend the timeout used by the tool. The default timeout is 20 seconds, which is very often long enough, but may not be in large pools.

To extend the timeout for condor_submit, put SUBMIT_TIMEOUT_MULTIPLER=3 in the configuration file read by condor_submit. To extend the timeout for condor_q, condor_rm, etc, put TOOL_TIMEOUT_MULTILIER=3 in the configuration file read by the tool. These changes will take the default timeout, 20 seconds, and multiply it by 3, giving the Schedd 60 seconds to respond. For instance, with 100Ks of jobs in the queue:

$ _CONDOR_TOOL_TIMEOUT_MULTIPLIER=3 time condor_rm -a
All jobs marked for removal.
0.01user 0.02system 0:53.99elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+4374minor)pagefaults 0swaps

%d bloggers like this: