Archive for May, 2011

A protocol mismatch: Some unnecessary Starter->Shadow communication

May 1, 2011

During previous runs that push an out of the box Schedd to see what it can handle with respect to job rates, Shadows were repeatedly complaining about being delayed in reusing claims.

It turned out that a bug introduced in Sept 2010 was causing a protocol mismatch, which when removed, allows the Schedd to sustain about 95 job completions per second.

During the Condor 7.5 development series, the Schedd gained the ability to reuse existing Shadows to serially run jobs. This change has huge benefits reducing load on the Schedd and underlying OS. A topic for another time, Condor relies heavily on process separation in its architecture, and Shadows are effectively pooled workers – think workers in a thread pool.

Here are the relevant logs at the time of the delayed claim reuse.

ShadowLog –

04/21/11 15:12:49 (1.1) (23946): Reporting job exit reason 100 and attempting to fetch new job.
04/21/11 15:12:49 (1.1) (23946): Reading job ClassAd from STDIN
04/21/11 15:12:53 (1.1) (23946): Switching to new job 1.1
04/21/11 15:12:53 (?.?) (23946): Initializing a VANILLA shadow for job 1.1
04/21/11 15:12:53 (1.1) (23946): no UserLog found
04/21/11 15:12:53 (1.1) (23946): ClaimStartd is true, trying to claim startd 
04/21/11 15:12:53 (1.1) (23946): Requesting claim description
04/21/11 15:12:53 (1.1) (23946): in RemoteResource::initStartdInfo()
04/21/11 15:12:53 (1.1) (23946): Request was NOT accepted for claim description
04/21/11 15:12:53 (1.1) (23946): Completed REQUEST_CLAIM to startd description
04/21/11 15:12:53 (1.1) (23946): Entering DCStartd::activateClaim()
04/21/11 15:12:53 (1.1) (23946): DCStartd::activateClaim: successfully sent command, reply is: 2
04/21/11 15:12:53 (1.1) (23946): Request to run on none  was DELAYED (previous job still being vacated)
04/21/11 15:12:53 (1.1) (23946): activateClaim(): will try again in 1 seconds
04/21/11 15:12:54 (1.1) (23946): Entering DCStartd::activateClaim()
04/21/11 15:12:54 (1.1) (23946): DCStartd::activateClaim: successfully sent command, reply is: 1
04/21/11 15:12:54 (1.1) (23946): Request to run on none  was ACCEPTED

StartLog –

04/21/11 15:12:49 Changing activity: Idle -> Busy
04/21/11 15:12:49 Received TCP command 60008 (DC_CHILDALIVE) from unauthenticated@unmapped , access level DAEMON
04/21/11 15:12:49 Received job ClassAd update from starter.
04/21/11 15:12:49 Received job ClassAd update from starter.
04/21/11 15:12:49 Closing job ClassAd update socket from starter.
04/21/11 15:12:49 Called deactivate_claim_forcibly()
04/21/11 15:12:49 In Starter::kill() with pid 24278, sig 3 (SIGQUIT)
04/21/11 15:12:49 Send_Signal(): Doing kill(24278,3) [SIGQUIT]
04/21/11 15:12:49 in starter:killHard starting kill timer
04/21/11 15:12:52 9400356 kbytes available for ".../execute"
04/21/11 15:12:52 Publishing ClassAd for 'MIPS' to slot 1
04/21/11 15:12:52 Publishing ClassAd for 'KFLOPS' to slot 1
04/21/11 15:12:52 Trying to update collector 
04/21/11 15:12:52 Attempting to send update via TCP to collector eeyore.local 
04/21/11 15:12:52 Sent update to 1 collector(s)
04/21/11 15:12:53 Got REQUEST_CLAIM while in Claimed state, ignoring.
04/21/11 15:12:53 DaemonCore: No more children processes to reap.
04/21/11 15:12:53 Got activate claim while starter is still alive.
04/21/11 15:12:53 Telling shadow to try again later.
04/21/11 15:12:53 Starter pid 24278 exited with status 0
04/21/11 15:12:53 Canceled hardkill-starter timer (2535)
04/21/11 15:12:53 State change: starter exited
04/21/11 15:12:53 Changing activity: Busy -> Idle
04/21/11 15:12:54 Got activate_claim request from shadow ()

StarterLog –

04/21/11 15:12:49 Job 1.1 set to execute immediately
04/21/11 15:12:49 Starting a VANILLA universe job with ID: 1.1
04/21/11 15:12:49 IWD: /tmp
04/21/11 15:12:49 About to exec /bin/true 
04/21/11 15:12:49 Create_Process succeeded, pid=24279
04/21/11 15:12:49 Process exited, pid=24279, status=0
04/21/11 15:12:49 Got SIGQUIT.  Performing fast shutdown.
04/21/11 15:12:49 ShutdownFast all jobs.
04/21/11 15:12:53 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from .
04/21/11 15:12:53 IO: Failed to read packet header
04/21/11 15:12:53 **** condor_starter (condor_STARTER) pid 24278 EXITING WITH STATUS 0
04/21/11 15:12:54 Setting maximum accepts per cycle 4.

First, the ShadowLog shows the Shadow is being told, by the Startd, to delay claim reuse. The Shadow decides to sleep() for a second. Second, the StartLog shows the Startd rejecting the claim reuse because the Starter still exists. Third, the StarterLog shows the Starter still exists because it is waiting in condor_read().

Using D_FULLDEBUG and D_NETWORK, it turned out that the Starter was trying to talk to the Shadow, sending an update, when the Shadow was not expecting to process anything from the Starter. The Shadow was not responding, causing the Starter to wait and eventually timeout.

The issue was the update not being properly blocked from executing during the shutdown. A two line patch and re-running the submission-completion test, resulted in zero delays in claim reuse and a bump to about 95 jobs processed per second.

The dip in the rates is from SHADOW_WORKLIFE and a topic for the discussion on pooled Shadows.


%d bloggers like this: