Posts Tagged ‘Performance’

Daylight savings, costs CPU

March 20, 2012

World daylight savings map

Erik ran into a surprising performance problem over the weekend.

If you live in an orange area of the map, mktime(3) may be killing your performance.

tm_isdst is important. Don’t guess!

Don’t construct struct tm manually. Or, even better, let mktime decide DST for you. Set tm_isdst = -1.

Are you impacted? Run your programs under callgrind, vary between Standard and Daylight, and look at mktime’s profile. Using Erik’s jig,

$ TZ=EDT valgrind --tool=callgrind ./test_mktime
...
$ TZ=EST valgrind --tool=callgrind ./test_mktime
...
$ callgrind_annotate --inclusive=yes --tree=calling ...
...

$ grep "mktime " EST.txt EDT.txt 
EST.txt:Profiled target:  ./test_mktime (PID 29595, part 1)
EST.txt:    1,765  >   ???:mktime (1x) [/lib64/libc-2.5.so]
EST.txt:  546,244  *  ???:mktime [/lib64/libc-2.5.so]
EST.txt:  542,350  >   ???:mktime (1x) [/lib64/libc-2.5.so]

EDT.txt:Profiled target:  ./test_mktime (PID 18878, part 1)
EDT.txt:    4,819  >   ???:mktime (1x) [/lib64/libc-2.5.so]
EDT.txt:   23,119  *  ???:mktime [/lib64/libc-2.5.so]

It is currently EDT.

Advertisements

Quick walk with Condor: Looking at Scheduler performance w/o notification

April 24, 2011

Recently I did a quick walk with the Schedd, looking at its submission and completion rates. Out of the box, submitting jobs with no special consideration for performance, the Schedd comfortably ran 55 jobs per second.

Without sending notifications, the Schedd can sustain a rate of 85 jobs per second.

I ran the test again, this time with notification=never and in two configurations: first, with 500,000 jobs submitted upfront; second, with submissions occurring during completions. The idea was to get an idea for performance when the Shadow is not burdened with sending email notifications of job completions, and to figure out how the Schedd performs with respect to servicing condor_submit at the same time it is running jobs.

First, submitting 500,000 jobs and then letting them drain showed a sustained rate of about 86 jobs per second.

Upfront submission of 500K jobs, drain at rate of 85 jobs per second

Second, building up the submission rate to about 100 jobs per second showed a sustained rate of about 83 jobs per second (81 shown in graph below).

Submission and drain rates of 85 jobs per second

The results are quite satisfying, and show the Schedd can sustain a reasonably high job execution rate at the same time it services submissions.

Quick walk with Condor: Looking at Scheduler performance

April 15, 2011

This was a simple walk to get a feel for what a single, out of the box, 7.6 condor_schedd could handle for a job load. There were about 1,500 dynamic slots and all jobs were 5 second sleeps.

It turns out that without errors in the system, a single Schedd can sustain at least a rate of 55 jobs per second.

Out of the box condor_schedd performance visualized with Cumin

Graphs courtesy Cumin and Ernie Allen‘s recent visualization of Erik Erlandson‘s new Schedd stats. This is a drill-down into a Scheduler.

Here’s what happened. I started submitting 25 jobs (queue 25) every 5 seconds. You can’t see this unfortunately, it is off the left side of the graphs. The submission, start and completion rates were all equal at 5 per second. Every five/ten/fifteen minutes, when I felt like it, I ramped that up a bit, by increasing the jobs per submit (queue 50 then 100 then 150 then 200) and the rate of submission (every 5 seconds then 2 seconds then 1). The scaling up matched my expectations. At 50 jobs every second, I saw 10 jobs submitted/started/completed per second. At 100 job, the rates were 20 per second. I eventually worked it up to rates about 50-55 per second.

Then we got to the 30 minute mark in the graphs. B shows that Shadows started failing. I let this run for about 10 minutes, the Schedd’s rates fluctuated down to between 45-50 jobs per second, and then kicked up the submission rate. The spike in submissions is quite visible to the right of A in the Rates graph.

At this point the Schedd was sustaining about 45 jobs per second and the Shadow exception rate was fairly sustained. I decided to kill off the submissions, also quite visible. The Schedd popped back up to 55 jobs per second and finished off its backlog.

A bit about the errors, simple investigation: condor_status -sched -long | grep ExitCode, oh a bunch of 108s; grep -B10 “STATUS 108” /var/log/condor/ShadowLog, oh a bunch of network issues and some evictions; pull out the hosts the Shadows were connecting to, oh mostly a couple nodes that were reported as being flakey; done.

Throughout this entire walk, the Mean start times turned out to be an interesting graph. It shows two things: 0) Mean time to start cumulative – the mean time between a job is queued to when it first starts running, over the lifetime of the Schedd; and, 1) Mean time to start – the same metric, but over an adjustable window, defaulting to 5 minutes. Until the exceptions started and when I blew the submissions out, around C, the mean queue time/wait time/time to start was consistently between 4 and 6 seconds. I did not look at the service rate on the back side of the job’s execution, e.g. how long it waited before termination was recognized, mail was sent, etc.

That’s about it. Though, remember this was done with about 1,500 slots. I didn’t vary the number of slots, say to see if the Schedd would have issues sustaining more than 1,500 concurrent jobs. I also didn’t do much in the way of calculations to compare the mean start time to the backlog (idle job count).

Also, if anyone tries this themselves, I would suggest notification=never in your submissions. It will prevent an email getting sent for each job, will save the Schedd and Shadow a good amount of work, and in my case would have resulted in a 389MB reduction in email.

Know your environment: Importance of infrastructure – DNS

April 6, 2011

Consider running nscd on your core nodes.

All programs run in some environment. A program’s environment will have a significant impact on how the program executes. All programs should be very aware of how they interact with their environment.

Applications rely on their host operating system for much of their executing environment. An application might simply rely on the operating system for memory allocation, maybe even without alloc/free. Or an application may use a complex mixture of accessing disk, memory, network, semaphores, etc. All of these resources have trade-offs.

Distributed systems execute in an even more complex mixture. They will often hit all of the operating system provided resources, but also network present services. In doing so, those needing to understand the execution environment expands beyond just the application to administrators.

During some scale testing of Condor, an unusual execution pattern appeared in condor_submit and condor_q. Periodic, 5 to 15 minute stalls at about a 20% frequency. Noticeable as gaps in log files and debug output. Try LogTimeGap.awk.

Investigation with strace revealed the time gaps were during communication with DNS servers. Specifically, resolving an alias used in COLLECTOR_HOST. Try strace -e connect,sendto,recvfrom,poll condor_q and time _CONDOR_TOOL_DEBUG=D_ALL condor_q -debug 2>&1 | awk -f LogTimeGap.awk 1. Simply replacing the alias with the CNAME it referred to eliminated the gaps immediately. See bug 682442. A straightforward resolution, but it may not go far enough.

The issue was slow name resolution, and a single case was worked around. What about other cases though? What if non-alias resolutions start taking a long time. Two more complete solutions appear: 0) perform a transform step on configuration and pre-resolve all hostnames, 1) add another service to the environment, one designed to mitigate these issues, called nscd.

The first option is fairly straightforward, and quite reasonable for many deployments. Especially deployments that take advantage of Wallaby to simplify configuration management. Though, it will take some care, will have to be Condor aware, and may have a gap if host based authentication is being used. Host based authentication is used out of the box and in many deployments, so it may be a high barrier to entry.

The second option appears simpler. It amounts to service nscd start and possibly chkconfig --levels 23 nscd on. It will handle host based authentication configurations more simply, and does not require the transformation step. However, you will have added yet another service to Condor’s environment, and, for that matter, a service you know will have a significant impact on execution. What happens when someone redeploying forgets to enable nscd, if nscd starts to misbehave itself, or nscd simply is not available. Are you sure nscd inter-operates with your round-robin domain aliases. nscd is going to interpose itself for other applications on the same system, were they written with proper layering in mind. A simple solution, but with possibly non-obvious implications.


Some interesting data during a particularly bad time period –

Running time condor_reschedule 1,000 times with and without nscd.

Without,

# Distribution of delays at second resolution, samples x time,
# 0 is the target. 0 is not hit, some executions up to 25 seconds.
$ grep ^real no-nscd.log | sed 's/.*m\([^.]*\).*/\1/' | sort -n | uniq -c
    327 0
    403 5
    197 10
     61 15
     10 20
      2 25

# Breakdown of sub-second executions, looking for consistency.
# Execution is not consistent, or not as consistent as with nscd.
$ grep ^real no-nscd.log | grep 0m0 | sort -n | uniq -c
     10 real	0m0.024s
     19 real	0m0.025s
     52 real	0m0.026s
     71 real	0m0.027s
     55 real	0m0.028s
     41 real	0m0.029s
     27 real	0m0.030s
      7 real	0m0.031s
      6 real	0m0.032s
      7 real	0m0.033s
      6 real	0m0.034s
      4 real	0m0.035s
      6 real	0m0.037s
      1 real	0m0.038s
      4 real	0m0.039s
      2 real	0m0.040s
      1 real	0m0.041s
      1 real	0m0.042s
      1 real	0m0.043s
      1 real	0m0.045s
      1 real	0m0.053s
      1 real	0m0.057s
      1 real	0m0.071s
      1 real	0m0.126s
      1 real	0m0.230s

With,

# Distribution of delays at second resolution, samples x time,
# 0 is the target.
$ grep ^real nscd.log | sed 's/.*m\([^.]*\).*/\1/' | sort -n | uniq -c
   1000 0

# Breakdown of sub-second executions, looking for consistency.
$ grep ^real nscd.log | grep 0m0 | sort -n | uniq -c
     69 real	0m0.012s
    888 real	0m0.013s
     36 real	0m0.014s
      6 real	0m0.015s
      1 real	0m0.016s

LogTimeGap.awk

#!/bin/awk -f

function parse_time(string) {
   return mktime(gensub(/([^/]*)\/([^ ]*)\/([^ ]*) ([^:]*):([^:]*):([^ ]*) .*/,
                        "1984 \\1 \\2 \\4 \\5 \\6", "g"))
}

BEGIN {
   previous_time = 0; previous_line = ""; current_time = 0
   ARGC = 1
   MAX_GAP = ARGV[1]
   if (MAX_GAP == "") MAX_GAP = 30
   print "Maximum allowable gap:", MAX_GAP, "seconds"
}

{
   current_time = parse_time($0)
   gap = current_time - previous_time
   if (previous_time > 0 && gap > MAX_GAP) {
	   print "Found gap of " gap " seconds:\n", previous_line "\n", $0
   }
   previous_line = $0
   previous_time = current_time
}

END { }

Where does the memory go? The condor_shadow.

May 24, 2010

The condor_schedd’s architecture currently represents each running job as a condor_shadow process it can directly manage. I’m going to skip the pros & cons of threading vs processes, or multiplexing vs not. Those are great topics, but not of interest right now. The condor_schedd uses processes and does not multiplex, so when it is managing 10,000 running jobs it will have 10,000 running condor_shadow processes. Other than the shock of 10,000 processes, what are some immediate concerns? 1) CPU usage – actually very small, condor_shadow processes spend all their time waiting on I/O, 2) I/O usage – condor_shadows may spend all their time waiting on I/O, but when that I/O opens it is typically small, maybe 10K network and a few K disk, 3) memory usage – now that’s an interesting one.

Quick ways to look at memory usage?

1) top, but it’s not going to be as helpful as you might like.

$ top -n1 -p $(pidof condor_shadow-base)                          
...
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
10487 matt      20   0 10092 3756 3176 S  0.0  0.2   0:00.01 condor_shadow-b

This gives a general idea about the memory usage of the condor_shadow. It has about 3756KB of resident memory and is sharing about 3176KB.

What we really want to see is the amount of memory that cannot be shared between condor_shadow processes. This will give us a rough lower bound on how much memory is needed for 10,000 condor_shadows. top might suggest that number is 3756-3176K=580K, but that’s a common mistake. A quick web search will tell you why.

2) pmap, which will give a better idea and where the memory is going.

$ pmap -d $(pidof condor_shadow-base)
Address   Kbytes Mode  Offset           Device    Mapping
00110000     884 r-x-- 0000000000000000 0fd:00000 libstdc++.so.6.0.13
001ed000      16 r---- 00000000000dc000 0fd:00000 libstdc++.so.6.0.13
001f1000       8 rw--- 00000000000e0000 0fd:00000 libstdc++.so.6.0.13
001f3000      24 rw--- 0000000000000000 000:00000   [ anon ]
001f9000      44 r-x-- 0000000000000000 0fd:00000 libnss_files-2.11.1.so
00204000       4 r---- 000000000000a000 0fd:00000 libnss_files-2.11.1.so
00205000       4 rw--- 000000000000b000 0fd:00000 libnss_files-2.11.1.so
0063a000     116 r-x-- 0000000000000000 0fd:00000 libgcc_s-4.4.3-20100127.so.1
00657000       4 rw--- 000000000001c000 0fd:00000 libgcc_s-4.4.3-20100127.so.1
00733000       4 r-x-- 0000000000000000 000:00000   [ anon ]
00745000       8 r-x-- 0000000000000000 0fd:00000 libcom_err.so.2.1
00747000       4 rw--- 0000000000002000 0fd:00000 libcom_err.so.2.1
0074a000       8 r-x-- 0000000000000000 0fd:00000 libkeyutils-1.2.so
0074c000       4 rw--- 0000000000001000 0fd:00000 libkeyutils-1.2.so
0074f000     716 r-x-- 0000000000000000 0fd:00000 libkrb5.so.3.3
00802000      24 rw--- 00000000000b3000 0fd:00000 libkrb5.so.3.3
0082d000     276 r-x-- 0000000000000000 0fd:00000 libfreebl3.so
00872000       4 rw--- 0000000000045000 0fd:00000 libfreebl3.so
00873000      16 rw--- 0000000000000000 000:00000   [ anon ]
00880000     180 r-x-- 0000000000000000 0fd:00000 libgssapi_krb5.so.2.2
008ad000       4 rw--- 000000000002d000 0fd:00000 libgssapi_krb5.so.2.2
008b0000     168 r-x-- 0000000000000000 0fd:00000 libk5crypto.so.3.1
008da000       4 rw--- 000000000002a000 0fd:00000 libk5crypto.so.3.1
008dd000      32 r-x-- 0000000000000000 0fd:00000 libkrb5support.so.0.1
008e5000       4 rw--- 0000000000007000 0fd:00000 libkrb5support.so.0.1
008e8000      28 r-x-- 0000000000000000 0fd:00000 libcrypt-2.11.1.so
008ef000       4 r---- 0000000000007000 0fd:00000 libcrypt-2.11.1.so
008f0000       4 rw--- 0000000000008000 0fd:00000 libcrypt-2.11.1.so
008f1000     156 rw--- 0000000000000000 000:00000   [ anon ]
0091a000     328 r-x-- 0000000000000000 0fd:00000 libssl.so.1.0.0
0096c000      16 rw--- 0000000000051000 0fd:00000 libssl.so.1.0.0
00974000     120 r-x-- 0000000000000000 0fd:00000 ld-2.11.1.so
00992000       4 r---- 000000000001d000 0fd:00000 ld-2.11.1.so
00993000       4 rw--- 000000000001e000 0fd:00000 ld-2.11.1.so
00996000    1464 r-x-- 0000000000000000 0fd:00000 libc-2.11.1.so
00b04000       8 r---- 000000000016e000 0fd:00000 libc-2.11.1.so
00b06000       4 rw--- 0000000000170000 0fd:00000 libc-2.11.1.so
00b07000      12 rw--- 0000000000000000 000:00000   [ anon ]
00b0c000      88 r-x-- 0000000000000000 0fd:00000 libpthread-2.11.1.so
00b22000       4 r---- 0000000000015000 0fd:00000 libpthread-2.11.1.so
00b23000       4 rw--- 0000000000016000 0fd:00000 libpthread-2.11.1.so
00b24000       8 rw--- 0000000000000000 000:00000   [ anon ]
00b28000      12 r-x-- 0000000000000000 0fd:00000 libdl-2.11.1.so
00b2b000       4 r---- 0000000000002000 0fd:00000 libdl-2.11.1.so
00b2c000       4 rw--- 0000000000003000 0fd:00000 libdl-2.11.1.so
00b2f000     160 r-x-- 0000000000000000 0fd:00000 libm-2.11.1.so
00b57000       4 r---- 0000000000027000 0fd:00000 libm-2.11.1.so
00b58000       4 rw--- 0000000000028000 0fd:00000 libm-2.11.1.so
00b5b000      72 r-x-- 0000000000000000 0fd:00000 libz.so.1.2.3
00b6d000       4 rw--- 0000000000011000 0fd:00000 libz.so.1.2.3
00b7b000     112 r-x-- 0000000000000000 0fd:00000 libselinux.so.1
00b97000       4 r---- 000000000001b000 0fd:00000 libselinux.so.1
00b98000       4 rw--- 000000000001c000 0fd:00000 libselinux.so.1
00b9b000      80 r-x-- 0000000000000000 0fd:00000 libresolv-2.11.1.so
00baf000       4 ----- 0000000000014000 0fd:00000 libresolv-2.11.1.so
00bb0000       4 r---- 0000000000014000 0fd:00000 libresolv-2.11.1.so
00bb1000       4 rw--- 0000000000015000 0fd:00000 libresolv-2.11.1.so
00bb2000       8 rw--- 0000000000000000 000:00000   [ anon ]
042a3000    1476 r-x-- 0000000000000000 0fd:00000 libcrypto.so.1.0.0
04414000      80 rw--- 0000000000170000 0fd:00000 libcrypto.so.1.0.0
04428000      12 rw--- 0000000000000000 000:00000   [ anon ]
049f0000     188 r-x-- 0000000000000000 0fd:00000 libpcre.so.0.0.1
04a1f000       4 rw--- 000000000002e000 0fd:00000 libpcre.so.0.0.1
08048000    2496 r-x-- 0000000000000000 0fd:00001 condor_shadow-base
082b8000       8 rw--- 0000000000270000 0fd:00001 condor_shadow-base
082ba000      12 rw--- 0000000000000000 000:00000   [ anon ]
09c64000     396 rw--- 0000000000000000 000:00000   [ anon ]
b7819000      24 rw--- 0000000000000000 000:00000   [ anon ]
b7839000      12 rw--- 0000000000000000 000:00000   [ anon ]
bf930000      84 rw--- 0000000000000000 000:00000   [ stack ]
mapped: 10092K    writeable/private: 972K    shared: 0K

“writable/private: 972K” is a better number to go by. It suggests that 10,000 shadows will require at least 972*10,000/2^20=9.269GB of memory.

So that’s where the memory is. Now why is it there?

I’ve been playing with Massif lately, so it seemed like a reasonable place to start. It can produce a usage graph,

    KB
335.8^                                                                      # 
     |                                                  ::::::::::@:::::::::#:
     |                                             @@:@::::: :::: @: : :::::#:
     |                                       ::::::@ :@::::: :::: @: : :::::#:
     |                                       ::::::@ :@::::: :::: @: : :::::#:
     |                                @::::::::::::@ :@::::: :::: @: : :::::#:
     |                         ::@::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                       @:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                      :@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                     ::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                    @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                    @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                  @@@::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                  @ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                 @@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                @@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                @@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |              @@@@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |              @ @@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |             @@ @@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
   0 +----------------------------------------------------------------------->Mi
     0                                                                   17.22

Massif tells us about dynamic memory allocations, which in this case amount to 335KB. That is only a third of the 972K of private memory reported by pmap. This suggests that two-thirds of the private memory in the shadow may be from static allocations.

objdump is another tool I’ve been using lately to inspect ELF binaries.

$ objdump -t condor_shadow-base | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | tail
082bc900 l     O .bss	00000100              pe_logname.7591
082bca00 l     O .bss	00000100              pe_user.7592
082bc160 l     O .bss	00000100              priv_identifier::id
082b8d80 l     O .data	00000120              CondorEnvironList
082bbbc0 g     O .bss	000001c4              ConfigTab
082bbf20 l     O .bss	00000200              priv_history
082ba920 l     O .bss	00001000              answer.7409
082b9920 l     O .bss	00001000              answer.7437
082b89a0 g       .data	00000000              __data_start
082b89a0  w      .data	00000000              data_start

What’s this, two 1K static buffers named answer.

* * *

After some code inspection it turned out that two functions in ckpt_name.c contained static char answer[MAXPATHLEN], where MAXPATHLEN is of course 1024. One of those functions, gen_exec_name, was dead code and easy to remove. The other, gen_ckpt_name, was active, with all but one of its call sites immediately strdup()’ing its result. After necessary modifications,

$ objdump -t condor_shadow-sans-answers | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | tail
082b9720 g     O .bss	000000e8              MaxLog
082b8b00 l     O .data	000000f0              SigNameArray
082ba900 l     O .bss	00000100              pe_logname.7591
082baa00 l     O .bss	00000100              pe_user.7592
082ba160 l     O .bss	00000100              priv_identifier::id
082b8d80 l     O .data	00000120              CondorEnvironList
082b9bc0 g     O .bss	000001c4              ConfigTab
082b9f20 l     O .bss	00000200              priv_history
082b89a0 g       .data	00000000              __data_start
082b89a0  w      .data	00000000              data_start

And while running,

$ pmap -d $(pidof condor_shadow-sans-answers) | grep -e rw -e private
...
082b8000       8 rw--- 0000000000270000 0fd:00001 condor_shadow-sans-answers
082ba000       4 rw--- 0000000000000000 000:00000   [ anon ]
089f0000     396 rw--- 0000000000000000 000:00000   [ anon ]
b7877000      24 rw--- 0000000000000000 000:00000   [ anon ]
b7897000      12 rw--- 0000000000000000 000:00000   [ anon ]
bff58000      84 rw--- 0000000000000000 000:00000   [ stack ]
mapped: 10084K    writeable/private: 964K    shared: 0K

We see an expected decrease in the shadow’s private address space. But, even though those answer buffers were the largest around, we’re aiming at small game. This is a ~0.8% improvement, or ~78MB out of our ~9.269GB. At least we now have a procedure for further reducing memory, even if it is a fraction of a percent at a time. In fact, with a little more digging there are more possibilities here,

$ objdump -t condor_shadow-sans-answers | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | \
   grep NULL_XACTION | tail -n1                     
082babbc l     O .bss	00000004              classad::NULL_XACTION

While classad::NULL_XACTION may be small, there are quite a few of them,

$ objdump -t condor_shadow-sans-answers | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | \
   awk '{print $6}' | sort | uniq -c | sort -n | tail
      2 DEFAULT_INDENT
      2 EvalBool(compat_classad::ClassAd*,
      2 SharedPortEndpoint::SharedPortEndpoint(char
      2 SharedPortEndpoint::UseSharedPort(MyString*,
      2 std::__ioinit
      2 SubsystemInfo::setClass(SubsystemInfoLookup
      3 vtable
     11 guard
    107 classad::NULL_XACTION

* * *

And it turns out classad::NULL_XACTION is not only defined in a header file that is included in many places, but it is completely unused. Note: Please don’t put data like NULL_XACTION in a header.

$ objdump -t condor_shadow-sans-answers-null_xaction | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | \
   awk '{print $6}' | sort | uniq -c | sort -n | tail
      1 ZZZZZ
      2 DEFAULT_INDENT
      2 EvalBool(compat_classad::ClassAd*,
      2 SharedPortEndpoint::SharedPortEndpoint(char
      2 SharedPortEndpoint::UseSharedPort(MyString*,
      2 std::__ioinit
      2 SubsystemInfo::setClass(SubsystemInfoLookup
      3 vtable
     11 guard

Since NULL_XACTION is not actually used its removal should not change the private address space of the shadow, but instead the overall binary and memory size.

$ pmap -d $(pidof condor_shadow-sans-answers-null_xaction) | grep -e rw -e private
...
082b2000       8 rw--- 0000000000269000 0fd:00001 condor_shadow-sans-answers-null_xaction
082b4000       4 rw--- 0000000000000000 000:00000   [ anon ]
08a0a000     396 rw--- 0000000000000000 000:00000   [ anon ]
b77f5000      24 rw--- 0000000000000000 000:00000   [ anon ]
b7815000      12 rw--- 0000000000000000 000:00000   [ anon ]
bfd28000      84 rw--- 0000000000000000 000:00000   [ stack ]
mapped: 10060K    writeable/private: 964K    shared: 0K

Which appears to be the case,

$ size *
   text	   data	    bss	    dec	    hex	filename
2554154	   5032	  14360	2573546	 2744ea	condor_shadow-base
2554146	   5032	   6168	2565346	 2724e2	condor_shadow-sans-answers
2528350	   4648	   5724	2538722	 26bce2	condor_shadow-sans-answers-null_xaction

Ok, so that was just a diversion. Back to big game. From the original pmap dump, there are a number of libraries linked in with the shadow that are using memory of their own. One of those libraries, libcrypt is using far more than the others.

008e8000      28 r-x-- 0000000000000000 0fd:00000 libcrypt-2.11.1.so
008ef000       4 r---- 0000000000007000 0fd:00000 libcrypt-2.11.1.so
008f0000       4 rw--- 0000000000008000 0fd:00000 libcrypt-2.11.1.so
008f1000     156 rw--- 0000000000000000 000:00000   [ anon ]

* * *

The existing build system for Condor, which uses imake(!), has a nasty habit of linking all binaries with all libraries. And as it turns out, crypt is not used by Condor code anymore. It has been replaced by OpenSSL’s DES implementation, likely many years ago.

Running without libcrypt,

$ pmap -d $(pidof condor_shadow-sans-answers-null_xaction-crypt) | grep -e rw -e private 
005e6000       4 rw--- 000000000000b000 0fd:00000 libnss_files-2.11.1.so
00657000       4 rw--- 000000000001c000 0fd:00000 libgcc_s-4.4.3-20100127.so.1
0073b000       8 rw--- 00000000000e0000 0fd:00000 libstdc++.so.6.0.13
0073d000      24 rw--- 0000000000000000 000:00000   [ anon ]
00747000       4 rw--- 0000000000002000 0fd:00000 libcom_err.so.2.1
0074c000       4 rw--- 0000000000001000 0fd:00000 libkeyutils-1.2.so
00802000      24 rw--- 00000000000b3000 0fd:00000 libkrb5.so.3.3
008ad000       4 rw--- 000000000002d000 0fd:00000 libgssapi_krb5.so.2.2
008da000       4 rw--- 000000000002a000 0fd:00000 libk5crypto.so.3.1
008e5000       4 rw--- 0000000000007000 0fd:00000 libkrb5support.so.0.1
0096c000      16 rw--- 0000000000051000 0fd:00000 libssl.so.1.0.0
00993000       4 rw--- 000000000001e000 0fd:00000 ld-2.11.1.so
00b06000       4 rw--- 0000000000170000 0fd:00000 libc-2.11.1.so
00b07000      12 rw--- 0000000000000000 000:00000   [ anon ]
00b23000       4 rw--- 0000000000016000 0fd:00000 libpthread-2.11.1.so
00b24000       8 rw--- 0000000000000000 000:00000   [ anon ]
00b2c000       4 rw--- 0000000000003000 0fd:00000 libdl-2.11.1.so
00b58000       4 rw--- 0000000000028000 0fd:00000 libm-2.11.1.so
00b6d000       4 rw--- 0000000000011000 0fd:00000 libz.so.1.2.3
00b98000       4 rw--- 000000000001c000 0fd:00000 libselinux.so.1
00bb1000       4 rw--- 0000000000015000 0fd:00000 libresolv-2.11.1.so
00bb2000       8 rw--- 0000000000000000 000:00000   [ anon ]
04414000      80 rw--- 0000000000170000 0fd:00000 libcrypto.so.1.0.0
04428000      12 rw--- 0000000000000000 000:00000   [ anon ]
04a1f000       4 rw--- 000000000002e000 0fd:00000 libpcre.so.0.0.1
082ad000       8 rw--- 0000000000264000 0fd:00001 condor_shadow-sans-answers-null_xaction-crypt
082af000       4 rw--- 0000000000000000 000:00000   [ anon ]
0908e000     396 rw--- 0000000000000000 000:00000   [ anon ]
b77bd000      20 rw--- 0000000000000000 000:00000   [ anon ]
b77dc000      12 rw--- 0000000000000000 000:00000   [ anon ]
bf820000      84 rw--- 0000000000000000 000:00000   [ stack ]
mapped: 9548K    writeable/private: 780K    shared: 0K

We see a huge improvement in both mapped and private memory usage. 780K/972K shows a ~19% improvement, or a reduction in memory of ~1.8GB over 10,000 running jobs.

BIG NOTE: The condor_shadow used above is from the 7.5 development series, which contains a problem around inclusion of a static parameter table on its heap. Using a version without the config table and some config minimzation tricks yields similar results, private memory footprint going from 708K to 516K (516/708=0.728). All on my 32-bit laptop of course. Memory on a 64-bit architecture goes from ~1,400K to ~1,000K (1000/1400=0.714).

There is clearly still more to be done here. In addition to evaluating the other libraries linked into the shadow and eliminating many of the static buffers, the memory allocations reported by Massif can be explored. However, the config table factored out there is likely only 100KB to explore.

NFS and Job Initial Working Directory (Iwd)

February 14, 2010

Condor deployments tend to include a network file system, such as NFS, AFS or SMB, which allows users easy access to their files across many machines. The presence of such file systems also means that a user can skip using Condor’s file transfer mechanisms and have their jobs write output or read input directly from the networked locations, often the user’s home directory. Condor is more than happy to do this, as long as the user’s credentials are available to access the home directory, which is often the case. Condor will even go one step further.

Sometime in the past, a user was automating their job submission to Condor, similar to what DAGMan does, and ran into a problem when their files were written to NFS. Their meta-scheduler, as they’re called, was reading job output files and getting stale cached data. This meant the job may have completed but the machine on which the meta-scheduler was running only saw part of the output. To get around this issue the condor_schedd, which in this case was managing jobs for the meta-scheduler, was changed to try and flush the NFS cache for the job’s Iwd. When a job completes the Schedd checks to see if the Iwd is on NFS, and if so creates a temporary file that is immediately deleted. The Schedd’s log reports “Forcing NFS sync of Iwd” and a .condor_nfs_sync_XXXXXX file briefly lives in the Iwd. This of course has pros and cons.

On the plus side, this is helpful to meta-schedulers because now they never have to bother making sure data sources aren’t stale. Arguably the meta-scheduler should be fixed in this situation. On the negative side, all jobs that have an Iwd in NFS now incur a penalty in the form of some NFS round trips when they complete. This penalty can actually be very dramatic, even halving the number of jobs a single Schedd can complete in a second.

To address the performance hit, in Condor 7.4, the IwdFlushNFSCache job attribute was introduced. It defaults to True, and can be changed in a submit file with +IwdFlushNFSCache = False or for all new jobs with IwdFlushNFSCache = False followed by SUBMIT_EXPRS = IwdFlushNFSCache in configuration. As expected, IwdFlushNFSCache works as a guard to the code in the condor_schedd that flushes the Iwd on job completion.

Maybe in future versions of Condor (7.5+) the default will become False and those who need the cache flushing functionality will place +IwdFlushNFSCache = True in their submit files.


%d bloggers like this: