Archive for April, 2011

Quick walk with Condor: Looking at Scheduler performance w/o notification

April 24, 2011

Recently I did a quick walk with the Schedd, looking at its submission and completion rates. Out of the box, submitting jobs with no special consideration for performance, the Schedd comfortably ran 55 jobs per second.

Without sending notifications, the Schedd can sustain a rate of 85 jobs per second.

I ran the test again, this time with notification=never and in two configurations: first, with 500,000 jobs submitted upfront; second, with submissions occurring during completions. The idea was to get an idea for performance when the Shadow is not burdened with sending email notifications of job completions, and to figure out how the Schedd performs with respect to servicing condor_submit at the same time it is running jobs.

First, submitting 500,000 jobs and then letting them drain showed a sustained rate of about 86 jobs per second.

Upfront submission of 500K jobs, drain at rate of 85 jobs per second

Second, building up the submission rate to about 100 jobs per second showed a sustained rate of about 83 jobs per second (81 shown in graph below).

Submission and drain rates of 85 jobs per second

The results are quite satisfying, and show the Schedd can sustain a reasonably high job execution rate at the same time it services submissions.

Advertisements

Quick walk with Condor: Looking at Scheduler performance

April 15, 2011

This was a simple walk to get a feel for what a single, out of the box, 7.6 condor_schedd could handle for a job load. There were about 1,500 dynamic slots and all jobs were 5 second sleeps.

It turns out that without errors in the system, a single Schedd can sustain at least a rate of 55 jobs per second.

Out of the box condor_schedd performance visualized with Cumin

Graphs courtesy Cumin and Ernie Allen‘s recent visualization of Erik Erlandson‘s new Schedd stats. This is a drill-down into a Scheduler.

Here’s what happened. I started submitting 25 jobs (queue 25) every 5 seconds. You can’t see this unfortunately, it is off the left side of the graphs. The submission, start and completion rates were all equal at 5 per second. Every five/ten/fifteen minutes, when I felt like it, I ramped that up a bit, by increasing the jobs per submit (queue 50 then 100 then 150 then 200) and the rate of submission (every 5 seconds then 2 seconds then 1). The scaling up matched my expectations. At 50 jobs every second, I saw 10 jobs submitted/started/completed per second. At 100 job, the rates were 20 per second. I eventually worked it up to rates about 50-55 per second.

Then we got to the 30 minute mark in the graphs. B shows that Shadows started failing. I let this run for about 10 minutes, the Schedd’s rates fluctuated down to between 45-50 jobs per second, and then kicked up the submission rate. The spike in submissions is quite visible to the right of A in the Rates graph.

At this point the Schedd was sustaining about 45 jobs per second and the Shadow exception rate was fairly sustained. I decided to kill off the submissions, also quite visible. The Schedd popped back up to 55 jobs per second and finished off its backlog.

A bit about the errors, simple investigation: condor_status -sched -long | grep ExitCode, oh a bunch of 108s; grep -B10 “STATUS 108” /var/log/condor/ShadowLog, oh a bunch of network issues and some evictions; pull out the hosts the Shadows were connecting to, oh mostly a couple nodes that were reported as being flakey; done.

Throughout this entire walk, the Mean start times turned out to be an interesting graph. It shows two things: 0) Mean time to start cumulative – the mean time between a job is queued to when it first starts running, over the lifetime of the Schedd; and, 1) Mean time to start – the same metric, but over an adjustable window, defaulting to 5 minutes. Until the exceptions started and when I blew the submissions out, around C, the mean queue time/wait time/time to start was consistently between 4 and 6 seconds. I did not look at the service rate on the back side of the job’s execution, e.g. how long it waited before termination was recognized, mail was sent, etc.

That’s about it. Though, remember this was done with about 1,500 slots. I didn’t vary the number of slots, say to see if the Schedd would have issues sustaining more than 1,500 concurrent jobs. I also didn’t do much in the way of calculations to compare the mean start time to the backlog (idle job count).

Also, if anyone tries this themselves, I would suggest notification=never in your submissions. It will prevent an email getting sent for each job, will save the Schedd and Shadow a good amount of work, and in my case would have resulted in a 389MB reduction in email.

Ordering Jobs: Job Priority and Submitters

April 13, 2011

Ordering jobs. The order jobs are put in for execution is scoped within a Schedd and then within a Submitter. A Submitter is the job’s Owner or AccountingGroup. This means that no order is specified between jobs of different submitters on a single Schedd, and jobs of even the same submitter on different Schedds are not ordered with respect to one another. Adjustments in the order does not impact already running jobs.

The default order operates similarly to FCFS. Jobs that are submitted earlier than others get a chance to run before later jobs. Specifically, jobs with a lower QDate run before those with a higher. If QDates are equal, jobs with a lower ClusterId run before those with a higher. If ClusterIds are equal, jobs with a lower ProcId run before those with a higher.

The order is similar to FCFS instead of exactly FCFS. A job that is not runnable, because it is on Hold or Requirements are not satisfied, will be skipped until it becomes runnable.

A job’s priority, its JobPrio attribute, provides a way to alter the default ordering. It is considered first, i.e. before QDate. If JobPrios are equal, order continues base on QDate etc. JobPrio works in the opposite direction from other ordering attribute. Jobs with a larger value run first, e.g. JobPrio = 1 runs before 0 which in turn runs before -1.

The JobPrio attribute of a job is controlled by the priority command in a submit file, and defaults to a value of 0. It can also be adjusted using the condor_prio command-line utility. If using a UserLog, the log command in a submit file, which you should be doing, you can see priority adjustments as they happen. Keep in mind that the JobPrio only helps order jobs to the point that they start running. Once running, adjustment to a lower priority will not cause them to stop and higher priority jobs to run.

Let’s get concrete. Observing execution order via a UserLog, using a single slot (NUM_CPUS=1), stopping the Negotiator, submitting jobs, then starting the Negotiation, let’s consider: 0) a single user submitting with priority 1) then adjusting with condor_prio, 2) multiple users submitting with priority, 3) a single user submitting with multiple AccountingGroups adjusting priority with condor_prio, 4) multiple users submitting with a single AccountingGroup adjusting priority with condor_prio.

The Negotiator stopping and starting is to make sure all jobs get considered at the same time. Without it, you may not get the same results.

NOTE: Endpoints, dates and hours anonymized to protect the guilty.


Zero: a single user submitting with priority

$ cat > zero.job << EOF
cmd = /bin/sleep
args = 1
log = zero.log
priority = -1
queue
priority = 0
queue
priority = 1
queue
EOF

$ condor_submit zero.job
Submitting job(s)...
3 job(s) submitted to cluster 1.

$ grep -e ^000 -e ^001 zero.log
000 (001.000.000) 03/17 01:40:55 Job submitted from host: 
000 (001.001.000) 03/17 01:40:55 Job submitted from host: 
000 (001.002.000) 03/17 01:40:55 Job submitted from host: 
001 (001.002.000) 03/17 01:41:08 Job executing on host: 
001 (001.001.000) 03/17 01:41:10 Job executing on host: 
001 (001.000.000) 03/17 01:41:13 Job executing on host: 

Success, execution of 1.2 then 1.1 then 1.0.


One: a single user submitting with priority, then adjusting with condor_prio

$ condor_off -negotiator

$ condor_submit -a log=one.log zero.job
Submitting job(s)...
3 job(s) submitted to cluster 1.

$ condor_prio 1.1 -p 3

$ condor_on -negotiator

$ grep -e ^000 -e ^001 -e ^033 one.log
000 (001.000.000) 03/17 01:46:40 Job submitted from host: 
000 (001.001.000) 03/17 01:46:40 Job submitted from host: 
000 (001.002.000) 03/17 01:46:40 Job submitted from host: 
033 (001.001.000) 03/17 01:47:02 Changing job attribute JobPrio from 0 to 3
001 (001.001.000) 03/17 01:47:50 Job executing on host: 
001 (001.002.000) 03/17 01:47:53 Job executing on host: 
001 (001.000.000) 03/17 01:47:55 Job executing on host: 

Success, priority change is visible for 1.1, and execution follows 1.1 then 1.2 then 1.0. Also, the change is priority for 1.1 is visible.


Two: multiple users submitting with priority

$ cat > two.job << EOF
cmd = /bin/sleep
args = 1
log = two.log
priority = -1
queue
priority = 0
queue
priority = 1
queue
EOF

$ condor_off -negotiator

userA $ condor_submit two.job
Submitting job(s)...
3 job(s) submitted to cluster 1.

userB $ condor_submit two.job
Submitting job(s)...
3 job(s) submitted to cluster 2.

$ condor_on -negotiator

userA $ grep -e ^000 -e ^001 two.log 
000 (001.000.000) 03/17 01:34:20 Job submitted from host: 
000 (001.001.000) 03/17 01:34:20 Job submitted from host: 
000 (001.002.000) 03/17 01:34:20 Job submitted from host: 
001 (001.002.000) 03/17 01:35:15 Job executing on host: 
001 (001.001.000) 03/17 01:35:17 Job executing on host: 
001 (001.000.000) 03/17 01:35:19 Job executing on host: 

userB $ grep -e ^000 -e ^001 two.log 
000 (002.000.000) 03/17 01:34:21 Job submitted from host: 
000 (002.001.000) 03/17 01:34:21 Job submitted from host: 
000 (002.002.000) 03/17 01:34:21 Job submitted from host: 
001 (002.002.000) 03/17 01:35:35 Job executing on host: 
001 (002.001.000) 03/17 01:35:38 Job executing on host: 
001 (002.000.000) 03/17 01:35:40 Job executing on host: 

Success, jobs ran in priority order and the priority is scoped to the user – you did not see 1.2 -> 2.2 -> 1.1 -> 2.1 -> 2.0 -> 1.0. Why though did 1.x run before 2.x? Because of a different type of priority: user priority.

$ condor_userprio
Last Priority Update:  3/17 01:36
                                    Effective
User Name                           Priority 
------------------------------      ---------
userA@localhost                          0.50
userB@localhost                          0.50
------------------------------      ---------
Number of users shown: 2                           

That means they can run in any order decided by the Negotiator. A topic for another time is how user priority works, but a hint is,

$ condor_userprio -setprio userA@localhost 100
The priority of userA@localhost was set to 100.000000

$ condor_userprio
Last Priority Update:  3/17 01:36
                                    Effective
User Name                           Priority 
------------------------------      ---------
userB@localhost                          0.50
userA@localhost                        100.00
------------------------------      ---------
Number of users shown: 2

$ condor_off -negotiator

userA $ condor_submit two.job
Submitting job(s)...
3 job(s) submitted to cluster 3.

userB $ condor_submit two.job
Submitting job(s)...
3 job(s) submitted to cluster 4.

$ condor_on -negotiator

userA $ grep -e ^000 -e ^001 two.log 
000 (003.000.000) 03/17 01:36:49 Job submitted from host: 
000 (003.001.000) 03/17 01:36:49 Job submitted from host: 
000 (003.002.000) 03/17 01:36:49 Job submitted from host: 
001 (003.002.000) 03/17 01:37:22 Job executing on host: 
001 (003.001.000) 03/17 01:37:24 Job executing on host: 
001 (003.000.000) 03/17 01:37:26 Job executing on host: 

userB $ grep -e ^000 -e ^001 two.log 
000 (004.000.000) 03/17 01:36:51 Job submitted from host: 
000 (004.001.000) 03/17 01:36:51 Job submitted from host: 
000 (004.002.000) 03/17 01:36:51 Job submitted from host: 
001 (004.002.000) 03/17 01:37:02 Job executing on host: 
001 (004.001.000) 03/17 01:37:04 Job executing on host: 
001 (004.000.000) 03/17 01:37:06 Job executing on host: 

Success, userB’s jobs ran in priority order before userA’s jobs.


Three: a single user submitting with multiple AccountingGroups adjusting priority with condor_prio

$ cat > three-a.job < three-b.job << EOF
cmd = /bin/sleep
args = 1
log = three.log
+AccountingGroup = "three.b"
priority = -3
queue
priority = 0
queue
priority = 3
queue
EOF

$ condor_off -negotiator

$ condor_submit three-a.job
Submitting job(s)...
3 job(s) submitted to cluster 1.

$ condor_submit three-b.job
Submitting job(s)...
3 job(s) submitted to cluster 2.

$ condor_prio 2.1 -p 1

$ condor_on -negotiator

$ grep -e ^000 -e ^001 -e ^033 three.log
000 (001.000.000) 03/17 01:56:15 Job submitted from host: 
000 (001.001.000) 03/17 01:56:15 Job submitted from host: 
000 (001.002.000) 03/17 01:56:15 Job submitted from host: 
000 (002.000.000) 03/17 01:56:17 Job submitted from host: 
000 (002.001.000) 03/17 01:56:17 Job submitted from host: 
000 (002.002.000) 03/17 01:56:17 Job submitted from host: 
033 (002.001.000) 03/17 01:56:57 Changing job attribute JobPrio from 0 to 1
001 (001.002.000) 03/17 01:57:08 Job executing on host: 
001 (001.001.000) 03/17 01:57:11 Job executing on host: 
001 (001.000.000) 03/17 01:57:13 Job executing on host: 
001 (002.002.000) 03/17 01:57:29 Job executing on host: 
001 (002.001.000) 03/17 01:57:31 Job executing on host: 
001 (002.000.000) 03/17 01:57:33 Job executing on host: 

Success, jobs ran in priority order and the priority is scoped to the AccountingGroup even though they are from the same user. Why though did 1.x run before 2.x? Because of group priority, same thing as user priority. Groups three.a and three.b have the same global priority, stored with the Negotiator.

$ condor_userprio 
Last Priority Update:  3/17  02:00
                                    Effective
User Name                           Priority 
------------------------------      ---------
three.a@localhost                     0.50
three.b@localhost                     0.50
------------------------------      ---------
Number of users shown: 2

That means they can run in any order decided by the Negotiator.

$ condor_userprio -setprio three.a@localhost 100
The priority of three.a@localhost was set to 100.000000

$ condor_userprio
Last Priority Update:  3/17  02:01
                                    Effective
User Name                           Priority
------------------------------      ---------
three.b@localhost                     0.50
three.a@localhost                   100.00
------------------------------      ---------
Number of users shown: 2

$ condor_off -negotiator

$ condor_submit three-a.job
Submitting job(s)...
3 job(s) submitted to cluster 3.

$ condor_submit three-b.job
Submitting job(s)...
3 job(s) submitted to cluster 4.

$ condor_on -negotiator

$ grep -e ^000 -e ^001 three.log
000 (003.000.000) 03/17 02:01:50 Job submitted from host: 
000 (003.001.000) 03/17 02:01:50 Job submitted from host: 
000 (003.002.000) 03/17 02:01:50 Job submitted from host: 
000 (004.000.000) 03/17 02:01:51 Job submitted from host: 
000 (004.001.000) 03/17 02:01:51 Job submitted from host: 
000 (004.002.000) 03/17 02:01:51 Job submitted from host: 
001 (004.002.000) 03/17 02:01:58 Job executing on host: 
001 (004.001.000) 03/17 02:02:00 Job executing on host: 
001 (004.000.000) 03/17 02:02:03 Job executing on host: 
001 (003.002.000) 03/17 02:02:18 Job executing on host: 
001 (003.001.000) 03/17 02:02:20 Job executing on host: 
001 (003.000.000) 03/17 02:02:23 Job executing on host: 

Success, three.b’s jobs ran in priority order before three.a’s jobs.


Four: multiple users submitting with a single AccountingGroup adjusting priority with condor_prio

$ cat > four.job << EOF
cmd = /bin/sleep
args = 1
log = four.log
+AccountingGroup = "four.user"
priority = -1
queue
priority = 0
queue
priority = 1
queue
EOF

$ condor_off -negotiator

userA $ condor_submit four.job
Submitting job(s)...
3 job(s) submitted to cluster 1.

userB $ condor_submit four.job
Submitting job(s)...
3 job(s) submitted to cluster 2.

$ condor_q
-- Submitter: localhost :  : localhost
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   userA           3/17  02:05   0+00:00:00 I  -1  0.0  sleep 1
   1.1   userA           3/17  02:05   0+00:00:00 I  0   0.0  sleep 1
   1.2   userA           3/17  02:05   0+00:00:00 I  1   0.0  sleep 1
   2.0   userB           3/17  02:05   0+00:00:00 I  -1  0.0  sleep 1
   2.1   userB           3/17  02:05   0+00:00:00 I  0   0.0  sleep 1
   2.2   userB           3/17  02:05   0+00:00:00 I  1   0.0  sleep 1
6 jobs; 6 idle, 0 running, 0 held

userB $ condor_prio 2.1 -p 2
userB $ condor_prio 2.2 -p 3
userA $ condor_prio 1.0 -p -2

$ condor_q
-- Submitter: localhost :  : localhost
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   userA           3/17  02:05   0+00:00:00 I  -2  0.0  sleep 1           
   1.1   userA           3/17  02:05   0+00:00:00 I  0   0.0  sleep 1           
   1.2   userA           3/17  02:05   0+00:00:00 I  1   0.0  sleep 1           
   2.0   userB           3/17  02:05   0+00:00:00 I  -1  0.0  sleep 1           
   2.1   userB           3/17  02:05   0+00:00:00 I  2   0.0  sleep 1           
   2.2   userB           3/17  02:05   0+00:00:00 I  3   0.0  sleep 1           
6 jobs; 6 idle, 0 running, 0 held

$ condor_on -negotiator

userA $ grep -e ^000 -e ^001 -e ^033 four.log
000 (001.000.000) 03/17 02:05:42 Job submitted from host: 
000 (001.001.000) 03/17 02:05:42 Job submitted from host: 
000 (001.002.000) 03/17 02:05:42 Job submitted from host: 
033 (001.000.000) 03/17 02:06:14 Changing job attribute JobPrio from -1 to -2
001 (001.002.000) 03/17 02:06:24 Job executing on host: 
001 (001.001.000) 03/17 02:06:26 Job executing on host: 
001 (001.000.000) 03/17 02:06:29 Job executing on host: 

userB $ grep -e ^000 -e ^001 -e ^033 four.log
000 (002.000.000) 03/17 02:05:44 Job submitted from host: 
000 (002.001.000) 03/17 02:05:44 Job submitted from host: 
000 (002.002.000) 03/17 02:05:44 Job submitted from host: 
033 (002.001.000) 03/17 02:05:50 Changing job attribute JobPrio from 0 to 2
033 (002.002.000) 03/17 02:05:57 Changing job attribute JobPrio from 1 to 3
001 (002.002.000) 03/17 02:06:21 Job executing on host: 
001 (002.001.000) 03/17 02:06:23 Job executing on host: 
001 (002.000.000) 03/17 02:06:28 Job executing on host: 

Success, execution went 2.2 (06:21) -> 2.1 (06:23) -> 1.2 (06:24) -> 1.1 (06:26) -> 2.0 (06:28) -> 1.0 (06:29).

Know your environment: Importance of infrastructure – DNS

April 6, 2011

Consider running nscd on your core nodes.

All programs run in some environment. A program’s environment will have a significant impact on how the program executes. All programs should be very aware of how they interact with their environment.

Applications rely on their host operating system for much of their executing environment. An application might simply rely on the operating system for memory allocation, maybe even without alloc/free. Or an application may use a complex mixture of accessing disk, memory, network, semaphores, etc. All of these resources have trade-offs.

Distributed systems execute in an even more complex mixture. They will often hit all of the operating system provided resources, but also network present services. In doing so, those needing to understand the execution environment expands beyond just the application to administrators.

During some scale testing of Condor, an unusual execution pattern appeared in condor_submit and condor_q. Periodic, 5 to 15 minute stalls at about a 20% frequency. Noticeable as gaps in log files and debug output. Try LogTimeGap.awk.

Investigation with strace revealed the time gaps were during communication with DNS servers. Specifically, resolving an alias used in COLLECTOR_HOST. Try strace -e connect,sendto,recvfrom,poll condor_q and time _CONDOR_TOOL_DEBUG=D_ALL condor_q -debug 2>&1 | awk -f LogTimeGap.awk 1. Simply replacing the alias with the CNAME it referred to eliminated the gaps immediately. See bug 682442. A straightforward resolution, but it may not go far enough.

The issue was slow name resolution, and a single case was worked around. What about other cases though? What if non-alias resolutions start taking a long time. Two more complete solutions appear: 0) perform a transform step on configuration and pre-resolve all hostnames, 1) add another service to the environment, one designed to mitigate these issues, called nscd.

The first option is fairly straightforward, and quite reasonable for many deployments. Especially deployments that take advantage of Wallaby to simplify configuration management. Though, it will take some care, will have to be Condor aware, and may have a gap if host based authentication is being used. Host based authentication is used out of the box and in many deployments, so it may be a high barrier to entry.

The second option appears simpler. It amounts to service nscd start and possibly chkconfig --levels 23 nscd on. It will handle host based authentication configurations more simply, and does not require the transformation step. However, you will have added yet another service to Condor’s environment, and, for that matter, a service you know will have a significant impact on execution. What happens when someone redeploying forgets to enable nscd, if nscd starts to misbehave itself, or nscd simply is not available. Are you sure nscd inter-operates with your round-robin domain aliases. nscd is going to interpose itself for other applications on the same system, were they written with proper layering in mind. A simple solution, but with possibly non-obvious implications.


Some interesting data during a particularly bad time period –

Running time condor_reschedule 1,000 times with and without nscd.

Without,

# Distribution of delays at second resolution, samples x time,
# 0 is the target. 0 is not hit, some executions up to 25 seconds.
$ grep ^real no-nscd.log | sed 's/.*m\([^.]*\).*/\1/' | sort -n | uniq -c
    327 0
    403 5
    197 10
     61 15
     10 20
      2 25

# Breakdown of sub-second executions, looking for consistency.
# Execution is not consistent, or not as consistent as with nscd.
$ grep ^real no-nscd.log | grep 0m0 | sort -n | uniq -c
     10 real	0m0.024s
     19 real	0m0.025s
     52 real	0m0.026s
     71 real	0m0.027s
     55 real	0m0.028s
     41 real	0m0.029s
     27 real	0m0.030s
      7 real	0m0.031s
      6 real	0m0.032s
      7 real	0m0.033s
      6 real	0m0.034s
      4 real	0m0.035s
      6 real	0m0.037s
      1 real	0m0.038s
      4 real	0m0.039s
      2 real	0m0.040s
      1 real	0m0.041s
      1 real	0m0.042s
      1 real	0m0.043s
      1 real	0m0.045s
      1 real	0m0.053s
      1 real	0m0.057s
      1 real	0m0.071s
      1 real	0m0.126s
      1 real	0m0.230s

With,

# Distribution of delays at second resolution, samples x time,
# 0 is the target.
$ grep ^real nscd.log | sed 's/.*m\([^.]*\).*/\1/' | sort -n | uniq -c
   1000 0

# Breakdown of sub-second executions, looking for consistency.
$ grep ^real nscd.log | grep 0m0 | sort -n | uniq -c
     69 real	0m0.012s
    888 real	0m0.013s
     36 real	0m0.014s
      6 real	0m0.015s
      1 real	0m0.016s

LogTimeGap.awk

#!/bin/awk -f

function parse_time(string) {
   return mktime(gensub(/([^/]*)\/([^ ]*)\/([^ ]*) ([^:]*):([^:]*):([^ ]*) .*/,
                        "1984 \\1 \\2 \\4 \\5 \\6", "g"))
}

BEGIN {
   previous_time = 0; previous_line = ""; current_time = 0
   ARGC = 1
   MAX_GAP = ARGV[1]
   if (MAX_GAP == "") MAX_GAP = 30
   print "Maximum allowable gap:", MAX_GAP, "seconds"
}

{
   current_time = parse_time($0)
   gap = current_time - previous_time
   if (previous_time > 0 && gap > MAX_GAP) {
	   print "Found gap of " gap " seconds:\n", previous_line "\n", $0
   }
   previous_line = $0
   previous_time = current_time
}

END { }

%d bloggers like this: