Know your environment: Importance of infrastructure – DNS

Consider running nscd on your core nodes.

All programs run in some environment. A program’s environment will have a significant impact on how the program executes. All programs should be very aware of how they interact with their environment.

Applications rely on their host operating system for much of their executing environment. An application might simply rely on the operating system for memory allocation, maybe even without alloc/free. Or an application may use a complex mixture of accessing disk, memory, network, semaphores, etc. All of these resources have trade-offs.

Distributed systems execute in an even more complex mixture. They will often hit all of the operating system provided resources, but also network present services. In doing so, those needing to understand the execution environment expands beyond just the application to administrators.

During some scale testing of Condor, an unusual execution pattern appeared in condor_submit and condor_q. Periodic, 5 to 15 minute stalls at about a 20% frequency. Noticeable as gaps in log files and debug output. Try LogTimeGap.awk.

Investigation with strace revealed the time gaps were during communication with DNS servers. Specifically, resolving an alias used in COLLECTOR_HOST. Try strace -e connect,sendto,recvfrom,poll condor_q and time _CONDOR_TOOL_DEBUG=D_ALL condor_q -debug 2>&1 | awk -f LogTimeGap.awk 1. Simply replacing the alias with the CNAME it referred to eliminated the gaps immediately. See bug 682442. A straightforward resolution, but it may not go far enough.

The issue was slow name resolution, and a single case was worked around. What about other cases though? What if non-alias resolutions start taking a long time. Two more complete solutions appear: 0) perform a transform step on configuration and pre-resolve all hostnames, 1) add another service to the environment, one designed to mitigate these issues, called nscd.

The first option is fairly straightforward, and quite reasonable for many deployments. Especially deployments that take advantage of Wallaby to simplify configuration management. Though, it will take some care, will have to be Condor aware, and may have a gap if host based authentication is being used. Host based authentication is used out of the box and in many deployments, so it may be a high barrier to entry.

The second option appears simpler. It amounts to service nscd start and possibly chkconfig --levels 23 nscd on. It will handle host based authentication configurations more simply, and does not require the transformation step. However, you will have added yet another service to Condor’s environment, and, for that matter, a service you know will have a significant impact on execution. What happens when someone redeploying forgets to enable nscd, if nscd starts to misbehave itself, or nscd simply is not available. Are you sure nscd inter-operates with your round-robin domain aliases. nscd is going to interpose itself for other applications on the same system, were they written with proper layering in mind. A simple solution, but with possibly non-obvious implications.


Some interesting data during a particularly bad time period –

Running time condor_reschedule 1,000 times with and without nscd.

Without,

# Distribution of delays at second resolution, samples x time,
# 0 is the target. 0 is not hit, some executions up to 25 seconds.
$ grep ^real no-nscd.log | sed 's/.*m\([^.]*\).*/\1/' | sort -n | uniq -c
    327 0
    403 5
    197 10
     61 15
     10 20
      2 25

# Breakdown of sub-second executions, looking for consistency.
# Execution is not consistent, or not as consistent as with nscd.
$ grep ^real no-nscd.log | grep 0m0 | sort -n | uniq -c
     10 real	0m0.024s
     19 real	0m0.025s
     52 real	0m0.026s
     71 real	0m0.027s
     55 real	0m0.028s
     41 real	0m0.029s
     27 real	0m0.030s
      7 real	0m0.031s
      6 real	0m0.032s
      7 real	0m0.033s
      6 real	0m0.034s
      4 real	0m0.035s
      6 real	0m0.037s
      1 real	0m0.038s
      4 real	0m0.039s
      2 real	0m0.040s
      1 real	0m0.041s
      1 real	0m0.042s
      1 real	0m0.043s
      1 real	0m0.045s
      1 real	0m0.053s
      1 real	0m0.057s
      1 real	0m0.071s
      1 real	0m0.126s
      1 real	0m0.230s

With,

# Distribution of delays at second resolution, samples x time,
# 0 is the target.
$ grep ^real nscd.log | sed 's/.*m\([^.]*\).*/\1/' | sort -n | uniq -c
   1000 0

# Breakdown of sub-second executions, looking for consistency.
$ grep ^real nscd.log | grep 0m0 | sort -n | uniq -c
     69 real	0m0.012s
    888 real	0m0.013s
     36 real	0m0.014s
      6 real	0m0.015s
      1 real	0m0.016s

LogTimeGap.awk

#!/bin/awk -f

function parse_time(string) {
   return mktime(gensub(/([^/]*)\/([^ ]*)\/([^ ]*) ([^:]*):([^:]*):([^ ]*) .*/,
                        "1984 \\1 \\2 \\4 \\5 \\6", "g"))
}

BEGIN {
   previous_time = 0; previous_line = ""; current_time = 0
   ARGC = 1
   MAX_GAP = ARGV[1]
   if (MAX_GAP == "") MAX_GAP = 30
   print "Maximum allowable gap:", MAX_GAP, "seconds"
}

{
   current_time = parse_time($0)
   gap = current_time - previous_time
   if (previous_time > 0 && gap > MAX_GAP) {
	   print "Found gap of " gap " seconds:\n", previous_line "\n", $0
   }
   previous_line = $0
   previous_time = current_time
}

END { }
Advertisements

Tags: , , , ,

3 Responses to “Know your environment: Importance of infrastructure – DNS”

  1. Name Says:

    Having had too many experiences with nscd causing whacky problems, there’s no way that I would ever recommend “Consider running nscd on your core nodes.”

    • spinningmatt Says:

      Fair enough if you include info behind “wacky problems” – otherwise folks can’t make informed decisions.

  2. Lans Carstensen Says:

    I definitely recommend nscd for any grid deployments. We’ll also be looking at SSSD and measuring it in comparison to nscd, it is scoped very differently but appears to offer equivalent NSS level caching. NSS caching is critical in an LDAP environment also to avoid erratic behavior due to passwd and group id resolutions.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: