Archive for the ‘Networking’ Category

Detecting clock skew with Condor

June 8, 2011

Anyone who works with distributed systems knows the value of maintaining clock synchronization. Within reason of course. If you do not work with such systems, imagine debugging something by reading a trace log. The messages will be in an order you can reason about, chronological. Now imagine if the order were jumbled, or worse you did not know the order was jumbled. In a distributed system, you often have to correlate logs between systems. If the clocks are skewed it is difficult to reconstruct a timeline.

This is not a new trick, but it is one I just found useful again and thought worth sharing:

$ condor_status -master -constraint '((MyCurrentTime-LastHeardFrom) > 60) || ((MyCurrentTime-LastHeardFrom) < -60)' \
   -format "%s\t" Name \
   -format "%d\n" '(MyCurrentTime - LastHeardFrom)'
eeyore.local	75

Turns out that system has a clock about 75 seconds in the future, and did not have ntpd installed.

See also GT1671.

Advertisements

A protocol mismatch: Some unnecessary Starter->Shadow communication

May 1, 2011

During previous runs that push an out of the box Schedd to see what it can handle with respect to job rates, Shadows were repeatedly complaining about being delayed in reusing claims.

It turned out that a bug introduced in Sept 2010 was causing a protocol mismatch, which when removed, allows the Schedd to sustain about 95 job completions per second.

During the Condor 7.5 development series, the Schedd gained the ability to reuse existing Shadows to serially run jobs. This change has huge benefits reducing load on the Schedd and underlying OS. A topic for another time, Condor relies heavily on process separation in its architecture, and Shadows are effectively pooled workers – think workers in a thread pool.

Here are the relevant logs at the time of the delayed claim reuse.

ShadowLog –

04/21/11 15:12:49 (1.1) (23946): Reporting job exit reason 100 and attempting to fetch new job.
04/21/11 15:12:49 (1.1) (23946): Reading job ClassAd from STDIN
04/21/11 15:12:53 (1.1) (23946): Switching to new job 1.1
04/21/11 15:12:53 (?.?) (23946): Initializing a VANILLA shadow for job 1.1
04/21/11 15:12:53 (1.1) (23946): no UserLog found
04/21/11 15:12:53 (1.1) (23946): ClaimStartd is true, trying to claim startd 
04/21/11 15:12:53 (1.1) (23946): Requesting claim description
04/21/11 15:12:53 (1.1) (23946): in RemoteResource::initStartdInfo()
04/21/11 15:12:53 (1.1) (23946): Request was NOT accepted for claim description
04/21/11 15:12:53 (1.1) (23946): Completed REQUEST_CLAIM to startd description
04/21/11 15:12:53 (1.1) (23946): Entering DCStartd::activateClaim()
04/21/11 15:12:53 (1.1) (23946): DCStartd::activateClaim: successfully sent command, reply is: 2
04/21/11 15:12:53 (1.1) (23946): Request to run on none  was DELAYED (previous job still being vacated)
04/21/11 15:12:53 (1.1) (23946): activateClaim(): will try again in 1 seconds
04/21/11 15:12:54 (1.1) (23946): Entering DCStartd::activateClaim()
04/21/11 15:12:54 (1.1) (23946): DCStartd::activateClaim: successfully sent command, reply is: 1
04/21/11 15:12:54 (1.1) (23946): Request to run on none  was ACCEPTED

StartLog –

04/21/11 15:12:49 Changing activity: Idle -> Busy
04/21/11 15:12:49 Received TCP command 60008 (DC_CHILDALIVE) from unauthenticated@unmapped , access level DAEMON
04/21/11 15:12:49 Received job ClassAd update from starter.
04/21/11 15:12:49 Received job ClassAd update from starter.
04/21/11 15:12:49 Closing job ClassAd update socket from starter.
04/21/11 15:12:49 Called deactivate_claim_forcibly()
04/21/11 15:12:49 In Starter::kill() with pid 24278, sig 3 (SIGQUIT)
04/21/11 15:12:49 Send_Signal(): Doing kill(24278,3) [SIGQUIT]
04/21/11 15:12:49 in starter:killHard starting kill timer
04/21/11 15:12:52 9400356 kbytes available for ".../execute"
04/21/11 15:12:52 Publishing ClassAd for 'MIPS' to slot 1
04/21/11 15:12:52 Publishing ClassAd for 'KFLOPS' to slot 1
04/21/11 15:12:52 Trying to update collector 
04/21/11 15:12:52 Attempting to send update via TCP to collector eeyore.local 
04/21/11 15:12:52 Sent update to 1 collector(s)
04/21/11 15:12:53 Got REQUEST_CLAIM while in Claimed state, ignoring.
04/21/11 15:12:53 DaemonCore: No more children processes to reap.
04/21/11 15:12:53 Got activate claim while starter is still alive.
04/21/11 15:12:53 Telling shadow to try again later.
04/21/11 15:12:53 Starter pid 24278 exited with status 0
04/21/11 15:12:53 Canceled hardkill-starter timer (2535)
04/21/11 15:12:53 State change: starter exited
04/21/11 15:12:53 Changing activity: Busy -> Idle
04/21/11 15:12:54 Got activate_claim request from shadow ()

StarterLog –

04/21/11 15:12:49 Job 1.1 set to execute immediately
04/21/11 15:12:49 Starting a VANILLA universe job with ID: 1.1
04/21/11 15:12:49 IWD: /tmp
04/21/11 15:12:49 About to exec /bin/true 
04/21/11 15:12:49 Create_Process succeeded, pid=24279
04/21/11 15:12:49 Process exited, pid=24279, status=0
04/21/11 15:12:49 Got SIGQUIT.  Performing fast shutdown.
04/21/11 15:12:49 ShutdownFast all jobs.
04/21/11 15:12:53 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from .
04/21/11 15:12:53 IO: Failed to read packet header
04/21/11 15:12:53 **** condor_starter (condor_STARTER) pid 24278 EXITING WITH STATUS 0
04/21/11 15:12:54 Setting maximum accepts per cycle 4.

First, the ShadowLog shows the Shadow is being told, by the Startd, to delay claim reuse. The Shadow decides to sleep() for a second. Second, the StartLog shows the Startd rejecting the claim reuse because the Starter still exists. Third, the StarterLog shows the Starter still exists because it is waiting in condor_read().

Using D_FULLDEBUG and D_NETWORK, it turned out that the Starter was trying to talk to the Shadow, sending an update, when the Shadow was not expecting to process anything from the Starter. The Shadow was not responding, causing the Starter to wait and eventually timeout.

The issue was the update not being properly blocked from executing during the shutdown. A two line patch and re-running the submission-completion test, resulted in zero delays in claim reuse and a bump to about 95 jobs processed per second.

The dip in the rates is from SHADOW_WORKLIFE and a topic for the discussion on pooled Shadows.

Know your environment: Importance of infrastructure – DNS

April 6, 2011

Consider running nscd on your core nodes.

All programs run in some environment. A program’s environment will have a significant impact on how the program executes. All programs should be very aware of how they interact with their environment.

Applications rely on their host operating system for much of their executing environment. An application might simply rely on the operating system for memory allocation, maybe even without alloc/free. Or an application may use a complex mixture of accessing disk, memory, network, semaphores, etc. All of these resources have trade-offs.

Distributed systems execute in an even more complex mixture. They will often hit all of the operating system provided resources, but also network present services. In doing so, those needing to understand the execution environment expands beyond just the application to administrators.

During some scale testing of Condor, an unusual execution pattern appeared in condor_submit and condor_q. Periodic, 5 to 15 minute stalls at about a 20% frequency. Noticeable as gaps in log files and debug output. Try LogTimeGap.awk.

Investigation with strace revealed the time gaps were during communication with DNS servers. Specifically, resolving an alias used in COLLECTOR_HOST. Try strace -e connect,sendto,recvfrom,poll condor_q and time _CONDOR_TOOL_DEBUG=D_ALL condor_q -debug 2>&1 | awk -f LogTimeGap.awk 1. Simply replacing the alias with the CNAME it referred to eliminated the gaps immediately. See bug 682442. A straightforward resolution, but it may not go far enough.

The issue was slow name resolution, and a single case was worked around. What about other cases though? What if non-alias resolutions start taking a long time. Two more complete solutions appear: 0) perform a transform step on configuration and pre-resolve all hostnames, 1) add another service to the environment, one designed to mitigate these issues, called nscd.

The first option is fairly straightforward, and quite reasonable for many deployments. Especially deployments that take advantage of Wallaby to simplify configuration management. Though, it will take some care, will have to be Condor aware, and may have a gap if host based authentication is being used. Host based authentication is used out of the box and in many deployments, so it may be a high barrier to entry.

The second option appears simpler. It amounts to service nscd start and possibly chkconfig --levels 23 nscd on. It will handle host based authentication configurations more simply, and does not require the transformation step. However, you will have added yet another service to Condor’s environment, and, for that matter, a service you know will have a significant impact on execution. What happens when someone redeploying forgets to enable nscd, if nscd starts to misbehave itself, or nscd simply is not available. Are you sure nscd inter-operates with your round-robin domain aliases. nscd is going to interpose itself for other applications on the same system, were they written with proper layering in mind. A simple solution, but with possibly non-obvious implications.


Some interesting data during a particularly bad time period –

Running time condor_reschedule 1,000 times with and without nscd.

Without,

# Distribution of delays at second resolution, samples x time,
# 0 is the target. 0 is not hit, some executions up to 25 seconds.
$ grep ^real no-nscd.log | sed 's/.*m\([^.]*\).*/\1/' | sort -n | uniq -c
    327 0
    403 5
    197 10
     61 15
     10 20
      2 25

# Breakdown of sub-second executions, looking for consistency.
# Execution is not consistent, or not as consistent as with nscd.
$ grep ^real no-nscd.log | grep 0m0 | sort -n | uniq -c
     10 real	0m0.024s
     19 real	0m0.025s
     52 real	0m0.026s
     71 real	0m0.027s
     55 real	0m0.028s
     41 real	0m0.029s
     27 real	0m0.030s
      7 real	0m0.031s
      6 real	0m0.032s
      7 real	0m0.033s
      6 real	0m0.034s
      4 real	0m0.035s
      6 real	0m0.037s
      1 real	0m0.038s
      4 real	0m0.039s
      2 real	0m0.040s
      1 real	0m0.041s
      1 real	0m0.042s
      1 real	0m0.043s
      1 real	0m0.045s
      1 real	0m0.053s
      1 real	0m0.057s
      1 real	0m0.071s
      1 real	0m0.126s
      1 real	0m0.230s

With,

# Distribution of delays at second resolution, samples x time,
# 0 is the target.
$ grep ^real nscd.log | sed 's/.*m\([^.]*\).*/\1/' | sort -n | uniq -c
   1000 0

# Breakdown of sub-second executions, looking for consistency.
$ grep ^real nscd.log | grep 0m0 | sort -n | uniq -c
     69 real	0m0.012s
    888 real	0m0.013s
     36 real	0m0.014s
      6 real	0m0.015s
      1 real	0m0.016s

LogTimeGap.awk

#!/bin/awk -f

function parse_time(string) {
   return mktime(gensub(/([^/]*)\/([^ ]*)\/([^ ]*) ([^:]*):([^:]*):([^ ]*) .*/,
                        "1984 \\1 \\2 \\4 \\5 \\6", "g"))
}

BEGIN {
   previous_time = 0; previous_line = ""; current_time = 0
   ARGC = 1
   MAX_GAP = ARGV[1]
   if (MAX_GAP == "") MAX_GAP = 30
   print "Maximum allowable gap:", MAX_GAP, "seconds"
}

{
   current_time = parse_time($0)
   gap = current_time - previous_time
   if (previous_time > 0 && gap > MAX_GAP) {
	   print "Found gap of " gap " seconds:\n", previous_line "\n", $0
   }
   previous_line = $0
   previous_time = current_time
}

END { }

Firewalling execute nodes: Avoid LOWPORT/HIGHPORT, use IN_LOWPORT/IN_HIGHPORT

August 8, 2010

We all know why firewalls are setup. Typical firewall configurations minimize inbound connections and allow unrestricted outbound connections.

Condor primarily uses ephemeral ports for inbound connections. To assist configuration with firewalls, it has long provided LOWPORT and HIGHPORT configuration options to constrain the port range it uses. Going beyond port range management, Condor has grown to include the Condor Connection Broker (CCB), to reverse connections when components are entirely hidden by firewalls, and condor_shared_port, to reduce the inbound port footprint on a machine to one.

Unfortunately, there is a disconnect in the typical firewall configuration and what LOWPORT/HIGHPORT configuration expresses. LOWPORT/HIGHPORT constraints both inbound and outbound port usage.

On an execute node, Condor will run a condor_master, a condor_startd and a few condor_starter processes, one per job. All must be able to accept connections. For a node that can run 4 jobs, the minimum number of inbound ports open in the node’s firewall is 6, one for each of the 6 potential processes. However, those processes will use more than just one port during its lifetime. In fact, the processes may have 3 open connections at some point. Using LOWPORT/HIGHPORT, that means setting a range that is 3 times wider than is necessary. It is possible to reduce that because not all processes will use all 3 connections at once, until they do. Going low is fragile.

Luckily, Condor provides IN_LOWPORT/IN_HIGHPORT and OUT_LOWPORT/OUT_HIGHPORT. For a typical firewall configuration, ignore the OUT_’s and use the IN_’s, e.g. IN_LOWPORT = 10000, IN_HIGHPORT = 10005. You will be much happier.

Port usage running 4 jobs with configuration,

ALL_DEBUG = D_NETWORK
IN_LOWPORT = 10000
IN_HIGHPORT = 10015
OUT_LOWPORT = 20000
OUT_HIGHPORT = 20015

Looks like,

MasterLog:08/08/10 09:25:10 Sock::bindWithin - bound to 10012...
MasterLog:08/08/10 09:25:13 Sock::bindWithin - bound to 10000...
MasterLog:08/08/10 09:25:18 Sock::bindWithin - bound to 20009...
StartLog:08/08/10 09:25:24 Sock::bindWithin - bound to 20015...
StartLog:08/08/10 09:25:24 Sock::bindWithin - bound to 20007...
StartLog:08/08/10 09:25:28 Sock::bindWithin - bound to 20003...
StartLog:08/08/10 09:25:34 Sock::bindWithin - bound to 10007...
StartLog:08/08/10 09:25:34 Sock::bindWithin - bound to 10013...
StartLog:08/08/10 09:25:34 Sock::bindWithin - bound to 10006...
StartLog:08/08/10 09:25:34 Sock::bindWithin - bound to 10015...
StarterLog.slot1:08/08/10 09:25:34 Sock::bindWithin - bound to 20007...
StarterLog.slot1:08/08/10 09:25:34 Sock::bindWithin - bound to 20008...
StarterLog.slot2:08/08/10 09:25:34 Sock::bindWithin - bound to 20003...
StarterLog.slot2:08/08/10 09:25:34 Sock::bindWithin - bound to 20013...
StarterLog.slot3:08/08/10 09:25:34 Sock::bindWithin - bound to 20011...
StarterLog.slot3:08/08/10 09:25:34 Sock::bindWithin - bound to 20012...
StarterLog.slot4:08/08/10 09:25:34 Sock::bindWithin - bound to 20013...
StarterLog.slot4:08/08/10 09:25:34 Sock::bindWithin - bound to 20004...

That’s 6 inbound ports and 12 outbound ports, with a few reused.


%d bloggers like this: