Archive for the ‘Cumin’ Category

A protocol mismatch: Some unnecessary Starter->Shadow communication

May 1, 2011

During previous runs that push an out of the box Schedd to see what it can handle with respect to job rates, Shadows were repeatedly complaining about being delayed in reusing claims.

It turned out that a bug introduced in Sept 2010 was causing a protocol mismatch, which when removed, allows the Schedd to sustain about 95 job completions per second.

During the Condor 7.5 development series, the Schedd gained the ability to reuse existing Shadows to serially run jobs. This change has huge benefits reducing load on the Schedd and underlying OS. A topic for another time, Condor relies heavily on process separation in its architecture, and Shadows are effectively pooled workers – think workers in a thread pool.

Here are the relevant logs at the time of the delayed claim reuse.

ShadowLog –

04/21/11 15:12:49 (1.1) (23946): Reporting job exit reason 100 and attempting to fetch new job.
04/21/11 15:12:49 (1.1) (23946): Reading job ClassAd from STDIN
04/21/11 15:12:53 (1.1) (23946): Switching to new job 1.1
04/21/11 15:12:53 (?.?) (23946): Initializing a VANILLA shadow for job 1.1
04/21/11 15:12:53 (1.1) (23946): no UserLog found
04/21/11 15:12:53 (1.1) (23946): ClaimStartd is true, trying to claim startd 
04/21/11 15:12:53 (1.1) (23946): Requesting claim description
04/21/11 15:12:53 (1.1) (23946): in RemoteResource::initStartdInfo()
04/21/11 15:12:53 (1.1) (23946): Request was NOT accepted for claim description
04/21/11 15:12:53 (1.1) (23946): Completed REQUEST_CLAIM to startd description
04/21/11 15:12:53 (1.1) (23946): Entering DCStartd::activateClaim()
04/21/11 15:12:53 (1.1) (23946): DCStartd::activateClaim: successfully sent command, reply is: 2
04/21/11 15:12:53 (1.1) (23946): Request to run on none  was DELAYED (previous job still being vacated)
04/21/11 15:12:53 (1.1) (23946): activateClaim(): will try again in 1 seconds
04/21/11 15:12:54 (1.1) (23946): Entering DCStartd::activateClaim()
04/21/11 15:12:54 (1.1) (23946): DCStartd::activateClaim: successfully sent command, reply is: 1
04/21/11 15:12:54 (1.1) (23946): Request to run on none  was ACCEPTED

StartLog –

04/21/11 15:12:49 Changing activity: Idle -> Busy
04/21/11 15:12:49 Received TCP command 60008 (DC_CHILDALIVE) from unauthenticated@unmapped , access level DAEMON
04/21/11 15:12:49 Received job ClassAd update from starter.
04/21/11 15:12:49 Received job ClassAd update from starter.
04/21/11 15:12:49 Closing job ClassAd update socket from starter.
04/21/11 15:12:49 Called deactivate_claim_forcibly()
04/21/11 15:12:49 In Starter::kill() with pid 24278, sig 3 (SIGQUIT)
04/21/11 15:12:49 Send_Signal(): Doing kill(24278,3) [SIGQUIT]
04/21/11 15:12:49 in starter:killHard starting kill timer
04/21/11 15:12:52 9400356 kbytes available for ".../execute"
04/21/11 15:12:52 Publishing ClassAd for 'MIPS' to slot 1
04/21/11 15:12:52 Publishing ClassAd for 'KFLOPS' to slot 1
04/21/11 15:12:52 Trying to update collector 
04/21/11 15:12:52 Attempting to send update via TCP to collector eeyore.local 
04/21/11 15:12:52 Sent update to 1 collector(s)
04/21/11 15:12:53 Got REQUEST_CLAIM while in Claimed state, ignoring.
04/21/11 15:12:53 DaemonCore: No more children processes to reap.
04/21/11 15:12:53 Got activate claim while starter is still alive.
04/21/11 15:12:53 Telling shadow to try again later.
04/21/11 15:12:53 Starter pid 24278 exited with status 0
04/21/11 15:12:53 Canceled hardkill-starter timer (2535)
04/21/11 15:12:53 State change: starter exited
04/21/11 15:12:53 Changing activity: Busy -> Idle
04/21/11 15:12:54 Got activate_claim request from shadow ()

StarterLog –

04/21/11 15:12:49 Job 1.1 set to execute immediately
04/21/11 15:12:49 Starting a VANILLA universe job with ID: 1.1
04/21/11 15:12:49 IWD: /tmp
04/21/11 15:12:49 About to exec /bin/true 
04/21/11 15:12:49 Create_Process succeeded, pid=24279
04/21/11 15:12:49 Process exited, pid=24279, status=0
04/21/11 15:12:49 Got SIGQUIT.  Performing fast shutdown.
04/21/11 15:12:49 ShutdownFast all jobs.
04/21/11 15:12:53 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from .
04/21/11 15:12:53 IO: Failed to read packet header
04/21/11 15:12:53 **** condor_starter (condor_STARTER) pid 24278 EXITING WITH STATUS 0
04/21/11 15:12:54 Setting maximum accepts per cycle 4.

First, the ShadowLog shows the Shadow is being told, by the Startd, to delay claim reuse. The Shadow decides to sleep() for a second. Second, the StartLog shows the Startd rejecting the claim reuse because the Starter still exists. Third, the StarterLog shows the Starter still exists because it is waiting in condor_read().

Using D_FULLDEBUG and D_NETWORK, it turned out that the Starter was trying to talk to the Shadow, sending an update, when the Shadow was not expecting to process anything from the Starter. The Shadow was not responding, causing the Starter to wait and eventually timeout.

The issue was the update not being properly blocked from executing during the shutdown. A two line patch and re-running the submission-completion test, resulted in zero delays in claim reuse and a bump to about 95 jobs processed per second.

The dip in the rates is from SHADOW_WORKLIFE and a topic for the discussion on pooled Shadows.

Quick walk with Condor: Looking at Scheduler performance w/o notification

April 24, 2011

Recently I did a quick walk with the Schedd, looking at its submission and completion rates. Out of the box, submitting jobs with no special consideration for performance, the Schedd comfortably ran 55 jobs per second.

Without sending notifications, the Schedd can sustain a rate of 85 jobs per second.

I ran the test again, this time with notification=never and in two configurations: first, with 500,000 jobs submitted upfront; second, with submissions occurring during completions. The idea was to get an idea for performance when the Shadow is not burdened with sending email notifications of job completions, and to figure out how the Schedd performs with respect to servicing condor_submit at the same time it is running jobs.

First, submitting 500,000 jobs and then letting them drain showed a sustained rate of about 86 jobs per second.

Upfront submission of 500K jobs, drain at rate of 85 jobs per second

Second, building up the submission rate to about 100 jobs per second showed a sustained rate of about 83 jobs per second (81 shown in graph below).

Submission and drain rates of 85 jobs per second

The results are quite satisfying, and show the Schedd can sustain a reasonably high job execution rate at the same time it services submissions.

Quick walk with Condor: Looking at Scheduler performance

April 15, 2011

This was a simple walk to get a feel for what a single, out of the box, 7.6 condor_schedd could handle for a job load. There were about 1,500 dynamic slots and all jobs were 5 second sleeps.

It turns out that without errors in the system, a single Schedd can sustain at least a rate of 55 jobs per second.

Out of the box condor_schedd performance visualized with Cumin

Graphs courtesy Cumin and Ernie Allen‘s recent visualization of Erik Erlandson‘s new Schedd stats. This is a drill-down into a Scheduler.

Here’s what happened. I started submitting 25 jobs (queue 25) every 5 seconds. You can’t see this unfortunately, it is off the left side of the graphs. The submission, start and completion rates were all equal at 5 per second. Every five/ten/fifteen minutes, when I felt like it, I ramped that up a bit, by increasing the jobs per submit (queue 50 then 100 then 150 then 200) and the rate of submission (every 5 seconds then 2 seconds then 1). The scaling up matched my expectations. At 50 jobs every second, I saw 10 jobs submitted/started/completed per second. At 100 job, the rates were 20 per second. I eventually worked it up to rates about 50-55 per second.

Then we got to the 30 minute mark in the graphs. B shows that Shadows started failing. I let this run for about 10 minutes, the Schedd’s rates fluctuated down to between 45-50 jobs per second, and then kicked up the submission rate. The spike in submissions is quite visible to the right of A in the Rates graph.

At this point the Schedd was sustaining about 45 jobs per second and the Shadow exception rate was fairly sustained. I decided to kill off the submissions, also quite visible. The Schedd popped back up to 55 jobs per second and finished off its backlog.

A bit about the errors, simple investigation: condor_status -sched -long | grep ExitCode, oh a bunch of 108s; grep -B10 “STATUS 108” /var/log/condor/ShadowLog, oh a bunch of network issues and some evictions; pull out the hosts the Shadows were connecting to, oh mostly a couple nodes that were reported as being flakey; done.

Throughout this entire walk, the Mean start times turned out to be an interesting graph. It shows two things: 0) Mean time to start cumulative – the mean time between a job is queued to when it first starts running, over the lifetime of the Schedd; and, 1) Mean time to start – the same metric, but over an adjustable window, defaulting to 5 minutes. Until the exceptions started and when I blew the submissions out, around C, the mean queue time/wait time/time to start was consistently between 4 and 6 seconds. I did not look at the service rate on the back side of the job’s execution, e.g. how long it waited before termination was recognized, mail was sent, etc.

That’s about it. Though, remember this was done with about 1,500 slots. I didn’t vary the number of slots, say to see if the Schedd would have issues sustaining more than 1,500 concurrent jobs. I also didn’t do much in the way of calculations to compare the mean start time to the backlog (idle job count).

Also, if anyone tries this themselves, I would suggest notification=never in your submissions. It will prevent an email getting sent for each job, will save the Schedd and Shadow a good amount of work, and in my case would have resulted in a 389MB reduction in email.

%d bloggers like this: