Archive for the ‘MRG’ Category

Submitting a DAG via Aviary using Python

September 16, 2011

Submitting individual jobs through Condor’s various interfaces is, unsurprisingly, the first thing people do. A quick second is submitting DAGs. I have previously discussed this in Java with BirdBath.

Aviary is a suite of APIs that expose Condor features via powerful, easy to use developer interfaces. It builds on experience from other implementations and takes an approach of exposing common use cases through clean abstractions, while maintaining the Condor philosophy of giving experts access to extended features.

The code is maintained in the contributions section of the Condor repository and is documented in the Grid Developer Guide.

The current implementation provides a SOAP interface for job submission, control and query. It is broken in two parts: a plugin to the condor_schedd that exposes the submission and control; and, a daemon, aviary_query_server, exposing the data querying capabilities.

Installation on Fedora 15 and beyond is a simple yum install condor-aviary. The condor-aviary package includes configuration placed in /etc/condor/config.d. A reconfig of the condor_master, to start the aviary_query_server, and restart of the condor_schedd, to load the plugin, is necessary.

Once installed, there are examples in the repository, including a python submit script.

Starting with the submit.py above, submit_dag.py is a straightforward extension following the CondorSubmitDAG.java example.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright 2009-2011 Red Hat, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# uses Suds - https://fedorahosted.org/suds/
from suds.client import Client
import sys, pwd, os, logging, argparse

def attr_builder(type_, format):
    def attr(name, value):
        attr = client.factory.create("ns0:Attribute")
        attr.name = name
        attr.type = type_
        attr.value = format % (value,)
        return attr
    return attr
string_attr=attr_builder('STRING', '%s')
int_attr=attr_builder('INTEGER', '%d')
expr_attr=attr_builder('EXPRESSION', '%s')


parser = argparse.ArgumentParser(description='Submit a job remotely via SOAP.')
parser.add_argument('-v', '--verbose', action='store_true',
                    default=False, help='enable SOAP logging')
parser.add_argument('-u', '--url', action='store', nargs='?', dest='url',
                    default='http://localhost:9090/services/job/submitJob',
                    help='http or https URL prefix to be added to cmd')
parser.add_argument('dag', action='store', help='full path to dag file')
args =  parser.parse_args()


uid = pwd.getpwuid(os.getuid())[0] or "nobody"

client = Client('file:/var/lib/condor/aviary/services/job/aviary-job.wsdl')

client.set_options(location=args.url)

if args.verbose:
    logging.basicConfig(level=logging.INFO)
    logging.getLogger('suds.client').setLevel(logging.DEBUG)
    print client

try:
    result = client.service.submitJob(
        '/usr/bin/condor_dagman',
        '-f -l . -Debug 3 -AutoRescue 1 -DoRescueFrom 0 -Allowversionmismatch -Lockfile %s.lock -Dag %s' % (args.dag, args.dag),
        uid,
        args.dag[:args.dag.rindex('/')],
        args.dag[args.dag.rindex('/')+1:],
        [],
        [string_attr('Env', '_CONDOR_MAX_DAGMAN_LOG=0;_CONDOR_DAGMAN_LOG=%s.dagman.out' % (args.dag,)),
         int_attr('JobUniverse', 7),
         string_attr('UserLog', args.dag + '.dagman.log'),
         string_attr('RemoveKillSig', 'SIGUSR1'),
         expr_attr('OnExitRemove', '(ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >= 0 && ExitCode <= 2))')]
	)
except Exception, e:
    print 'invocation failed at: ', args.url
    print e
    sys.exit(1)	

if result.status.code != 'OK':
    print result.status.code,'; ', result.status.text
    sys.exit(1)

print args.verbose and result or result.id.job

Getting Started: Multiple node Condor pool with firewalls

June 21, 2011

Creating a Condor pool with no firewalls up is quite a simple task. Before the condor_shared_port daemon, doing the same with firewalls was a bit painful.

Condor uses dynamic ports for everything except the Collector. The Collector endpoint is the bootstrap. This means a Schedd might start up on a random ephemeral port, and each of its shadows might as well. This causes headaches for firewalls as large ranges of ports need to be opened for communication. There are ways to control the ephemeral range used. Unfortunately, doing so just reduced the port range some, did not guarantee Condor was on the ports, and could limit scale.

The condor_shared_port daemon allows Condor to use a single inbound port on a machine.

Again, using Fedora 15. I had no luck with firewalld and firewall-cmd. Instead I fell back to using straight iptables.

The first thing to do is pick a port for Condor to use on your machines. The simplest thing to do is pick 9618, the port typically known as the Collector’s port.

On all machines where Condor is going to run, you want to –

# lokkit --enabled

# service iptables start
Starting iptables (via systemctl):  [  OK  ]

# service iptables status
Table: filter
Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination
1    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           state RELATED,ESTABLISHED
2    ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0
3    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0
4    REJECT     all  --  0.0.0.0/0            0.0.0.0/0 reject-with icmp-host-prohibited

Chain FORWARD (policy ACCEPT)
num  target     prot opt source               destination
1    REJECT     all  --  0.0.0.0/0            0.0.0.0/0 reject-with icmp-host-prohibited

Chain OUTPUT (policy ACCEPT)
num  target     prot opt source               destination

If you want to ssh to the machine again, be sure to insert rules above the “REJECT ALL — …” –

# iptables -I INPUT 4 -p tcp -m tcp --dport 22 -j ACCEPT

And open a port, both TCP and UDP, for the shared port daemon –

# iptables -I INPUT 5 -p tcp -m tcp --dport condor -j ACCEPT
# iptables -I INPUT 6 -p udp -m udp --dport condor -j ACCEPT

Next you want to configure Condor to use the shared port daemon, with port 9618 –

# cat > /etc/condor/config.d/41shared_port.config
SHARED_PORT_ARGS = -p 9618
DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
COLLECTOR_HOST = $(CONDOR_HOST)?sock=collector
USE_SHARED_PORT = TRUE
^D

In order, SHARED_PORT_ARGS tells the shared port daemon to listen on port 9618, DAEMON_LIST tells the master to start the shared port daemon, COLLECTOR_HOST specifies that the collector will be on the sock named “collector”, and finally USE_SHARED_PORT tells all daemons to register and use the shared port daemon.

After you put that configuration on all your systems, run service condor restart, and go.

You will have the shared port daemon listening on 9618 (condor), and all communication between machines will around through it.

# lsof -i | grep $(pidof condor_shared_port)
condor_sh 31040  condor    8u  IPv4  74105      0t0  TCP *:condor (LISTEN)
condor_sh 31040  condor    9u  IPv4  74106      0t0  UDP *:condor

That’s right, you have a condor pool with firewalls and a single port opened for communication on each node.

A protocol mismatch: Some unnecessary Starter->Shadow communication

May 1, 2011

During previous runs that push an out of the box Schedd to see what it can handle with respect to job rates, Shadows were repeatedly complaining about being delayed in reusing claims.

It turned out that a bug introduced in Sept 2010 was causing a protocol mismatch, which when removed, allows the Schedd to sustain about 95 job completions per second.

During the Condor 7.5 development series, the Schedd gained the ability to reuse existing Shadows to serially run jobs. This change has huge benefits reducing load on the Schedd and underlying OS. A topic for another time, Condor relies heavily on process separation in its architecture, and Shadows are effectively pooled workers – think workers in a thread pool.

Here are the relevant logs at the time of the delayed claim reuse.

ShadowLog –

04/21/11 15:12:49 (1.1) (23946): Reporting job exit reason 100 and attempting to fetch new job.
04/21/11 15:12:49 (1.1) (23946): Reading job ClassAd from STDIN
04/21/11 15:12:53 (1.1) (23946): Switching to new job 1.1
04/21/11 15:12:53 (?.?) (23946): Initializing a VANILLA shadow for job 1.1
04/21/11 15:12:53 (1.1) (23946): no UserLog found
04/21/11 15:12:53 (1.1) (23946): ClaimStartd is true, trying to claim startd 
04/21/11 15:12:53 (1.1) (23946): Requesting claim description
04/21/11 15:12:53 (1.1) (23946): in RemoteResource::initStartdInfo()
04/21/11 15:12:53 (1.1) (23946): Request was NOT accepted for claim description
04/21/11 15:12:53 (1.1) (23946): Completed REQUEST_CLAIM to startd description
04/21/11 15:12:53 (1.1) (23946): Entering DCStartd::activateClaim()
04/21/11 15:12:53 (1.1) (23946): DCStartd::activateClaim: successfully sent command, reply is: 2
04/21/11 15:12:53 (1.1) (23946): Request to run on none  was DELAYED (previous job still being vacated)
04/21/11 15:12:53 (1.1) (23946): activateClaim(): will try again in 1 seconds
04/21/11 15:12:54 (1.1) (23946): Entering DCStartd::activateClaim()
04/21/11 15:12:54 (1.1) (23946): DCStartd::activateClaim: successfully sent command, reply is: 1
04/21/11 15:12:54 (1.1) (23946): Request to run on none  was ACCEPTED

StartLog –

04/21/11 15:12:49 Changing activity: Idle -> Busy
04/21/11 15:12:49 Received TCP command 60008 (DC_CHILDALIVE) from unauthenticated@unmapped , access level DAEMON
04/21/11 15:12:49 Received job ClassAd update from starter.
04/21/11 15:12:49 Received job ClassAd update from starter.
04/21/11 15:12:49 Closing job ClassAd update socket from starter.
04/21/11 15:12:49 Called deactivate_claim_forcibly()
04/21/11 15:12:49 In Starter::kill() with pid 24278, sig 3 (SIGQUIT)
04/21/11 15:12:49 Send_Signal(): Doing kill(24278,3) [SIGQUIT]
04/21/11 15:12:49 in starter:killHard starting kill timer
04/21/11 15:12:52 9400356 kbytes available for ".../execute"
04/21/11 15:12:52 Publishing ClassAd for 'MIPS' to slot 1
04/21/11 15:12:52 Publishing ClassAd for 'KFLOPS' to slot 1
04/21/11 15:12:52 Trying to update collector 
04/21/11 15:12:52 Attempting to send update via TCP to collector eeyore.local 
04/21/11 15:12:52 Sent update to 1 collector(s)
04/21/11 15:12:53 Got REQUEST_CLAIM while in Claimed state, ignoring.
04/21/11 15:12:53 DaemonCore: No more children processes to reap.
04/21/11 15:12:53 Got activate claim while starter is still alive.
04/21/11 15:12:53 Telling shadow to try again later.
04/21/11 15:12:53 Starter pid 24278 exited with status 0
04/21/11 15:12:53 Canceled hardkill-starter timer (2535)
04/21/11 15:12:53 State change: starter exited
04/21/11 15:12:53 Changing activity: Busy -> Idle
04/21/11 15:12:54 Got activate_claim request from shadow ()

StarterLog –

04/21/11 15:12:49 Job 1.1 set to execute immediately
04/21/11 15:12:49 Starting a VANILLA universe job with ID: 1.1
04/21/11 15:12:49 IWD: /tmp
04/21/11 15:12:49 About to exec /bin/true 
04/21/11 15:12:49 Create_Process succeeded, pid=24279
04/21/11 15:12:49 Process exited, pid=24279, status=0
04/21/11 15:12:49 Got SIGQUIT.  Performing fast shutdown.
04/21/11 15:12:49 ShutdownFast all jobs.
04/21/11 15:12:53 condor_read() failed: recv() returned -1, errno = 104 Connection reset by peer, reading 5 bytes from .
04/21/11 15:12:53 IO: Failed to read packet header
04/21/11 15:12:53 **** condor_starter (condor_STARTER) pid 24278 EXITING WITH STATUS 0
04/21/11 15:12:54 Setting maximum accepts per cycle 4.

First, the ShadowLog shows the Shadow is being told, by the Startd, to delay claim reuse. The Shadow decides to sleep() for a second. Second, the StartLog shows the Startd rejecting the claim reuse because the Starter still exists. Third, the StarterLog shows the Starter still exists because it is waiting in condor_read().

Using D_FULLDEBUG and D_NETWORK, it turned out that the Starter was trying to talk to the Shadow, sending an update, when the Shadow was not expecting to process anything from the Starter. The Shadow was not responding, causing the Starter to wait and eventually timeout.

The issue was the update not being properly blocked from executing during the shutdown. A two line patch and re-running the submission-completion test, resulted in zero delays in claim reuse and a bump to about 95 jobs processed per second.

The dip in the rates is from SHADOW_WORKLIFE and a topic for the discussion on pooled Shadows.

Quick walk with Condor: Looking at Scheduler performance w/o notification

April 24, 2011

Recently I did a quick walk with the Schedd, looking at its submission and completion rates. Out of the box, submitting jobs with no special consideration for performance, the Schedd comfortably ran 55 jobs per second.

Without sending notifications, the Schedd can sustain a rate of 85 jobs per second.

I ran the test again, this time with notification=never and in two configurations: first, with 500,000 jobs submitted upfront; second, with submissions occurring during completions. The idea was to get an idea for performance when the Shadow is not burdened with sending email notifications of job completions, and to figure out how the Schedd performs with respect to servicing condor_submit at the same time it is running jobs.

First, submitting 500,000 jobs and then letting them drain showed a sustained rate of about 86 jobs per second.

Upfront submission of 500K jobs, drain at rate of 85 jobs per second

Second, building up the submission rate to about 100 jobs per second showed a sustained rate of about 83 jobs per second (81 shown in graph below).

Submission and drain rates of 85 jobs per second

The results are quite satisfying, and show the Schedd can sustain a reasonably high job execution rate at the same time it services submissions.

Quick walk with Condor: Looking at Scheduler performance

April 15, 2011

This was a simple walk to get a feel for what a single, out of the box, 7.6 condor_schedd could handle for a job load. There were about 1,500 dynamic slots and all jobs were 5 second sleeps.

It turns out that without errors in the system, a single Schedd can sustain at least a rate of 55 jobs per second.

Out of the box condor_schedd performance visualized with Cumin

Graphs courtesy Cumin and Ernie Allen‘s recent visualization of Erik Erlandson‘s new Schedd stats. This is a drill-down into a Scheduler.

Here’s what happened. I started submitting 25 jobs (queue 25) every 5 seconds. You can’t see this unfortunately, it is off the left side of the graphs. The submission, start and completion rates were all equal at 5 per second. Every five/ten/fifteen minutes, when I felt like it, I ramped that up a bit, by increasing the jobs per submit (queue 50 then 100 then 150 then 200) and the rate of submission (every 5 seconds then 2 seconds then 1). The scaling up matched my expectations. At 50 jobs every second, I saw 10 jobs submitted/started/completed per second. At 100 job, the rates were 20 per second. I eventually worked it up to rates about 50-55 per second.

Then we got to the 30 minute mark in the graphs. B shows that Shadows started failing. I let this run for about 10 minutes, the Schedd’s rates fluctuated down to between 45-50 jobs per second, and then kicked up the submission rate. The spike in submissions is quite visible to the right of A in the Rates graph.

At this point the Schedd was sustaining about 45 jobs per second and the Shadow exception rate was fairly sustained. I decided to kill off the submissions, also quite visible. The Schedd popped back up to 55 jobs per second and finished off its backlog.

A bit about the errors, simple investigation: condor_status -sched -long | grep ExitCode, oh a bunch of 108s; grep -B10 “STATUS 108” /var/log/condor/ShadowLog, oh a bunch of network issues and some evictions; pull out the hosts the Shadows were connecting to, oh mostly a couple nodes that were reported as being flakey; done.

Throughout this entire walk, the Mean start times turned out to be an interesting graph. It shows two things: 0) Mean time to start cumulative – the mean time between a job is queued to when it first starts running, over the lifetime of the Schedd; and, 1) Mean time to start – the same metric, but over an adjustable window, defaulting to 5 minutes. Until the exceptions started and when I blew the submissions out, around C, the mean queue time/wait time/time to start was consistently between 4 and 6 seconds. I did not look at the service rate on the back side of the job’s execution, e.g. how long it waited before termination was recognized, mail was sent, etc.

That’s about it. Though, remember this was done with about 1,500 slots. I didn’t vary the number of slots, say to see if the Schedd would have issues sustaining more than 1,500 concurrent jobs. I also didn’t do much in the way of calculations to compare the mean start time to the backlog (idle job count).

Also, if anyone tries this themselves, I would suggest notification=never in your submissions. It will prevent an email getting sent for each job, will save the Schedd and Shadow a good amount of work, and in my case would have resulted in a 389MB reduction in email.

Where does the memory go? The condor_shadow.

May 24, 2010

The condor_schedd’s architecture currently represents each running job as a condor_shadow process it can directly manage. I’m going to skip the pros & cons of threading vs processes, or multiplexing vs not. Those are great topics, but not of interest right now. The condor_schedd uses processes and does not multiplex, so when it is managing 10,000 running jobs it will have 10,000 running condor_shadow processes. Other than the shock of 10,000 processes, what are some immediate concerns? 1) CPU usage – actually very small, condor_shadow processes spend all their time waiting on I/O, 2) I/O usage – condor_shadows may spend all their time waiting on I/O, but when that I/O opens it is typically small, maybe 10K network and a few K disk, 3) memory usage – now that’s an interesting one.

Quick ways to look at memory usage?

1) top, but it’s not going to be as helpful as you might like.

$ top -n1 -p $(pidof condor_shadow-base)                          
...
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
10487 matt      20   0 10092 3756 3176 S  0.0  0.2   0:00.01 condor_shadow-b

This gives a general idea about the memory usage of the condor_shadow. It has about 3756KB of resident memory and is sharing about 3176KB.

What we really want to see is the amount of memory that cannot be shared between condor_shadow processes. This will give us a rough lower bound on how much memory is needed for 10,000 condor_shadows. top might suggest that number is 3756-3176K=580K, but that’s a common mistake. A quick web search will tell you why.

2) pmap, which will give a better idea and where the memory is going.

$ pmap -d $(pidof condor_shadow-base)
Address   Kbytes Mode  Offset           Device    Mapping
00110000     884 r-x-- 0000000000000000 0fd:00000 libstdc++.so.6.0.13
001ed000      16 r---- 00000000000dc000 0fd:00000 libstdc++.so.6.0.13
001f1000       8 rw--- 00000000000e0000 0fd:00000 libstdc++.so.6.0.13
001f3000      24 rw--- 0000000000000000 000:00000   [ anon ]
001f9000      44 r-x-- 0000000000000000 0fd:00000 libnss_files-2.11.1.so
00204000       4 r---- 000000000000a000 0fd:00000 libnss_files-2.11.1.so
00205000       4 rw--- 000000000000b000 0fd:00000 libnss_files-2.11.1.so
0063a000     116 r-x-- 0000000000000000 0fd:00000 libgcc_s-4.4.3-20100127.so.1
00657000       4 rw--- 000000000001c000 0fd:00000 libgcc_s-4.4.3-20100127.so.1
00733000       4 r-x-- 0000000000000000 000:00000   [ anon ]
00745000       8 r-x-- 0000000000000000 0fd:00000 libcom_err.so.2.1
00747000       4 rw--- 0000000000002000 0fd:00000 libcom_err.so.2.1
0074a000       8 r-x-- 0000000000000000 0fd:00000 libkeyutils-1.2.so
0074c000       4 rw--- 0000000000001000 0fd:00000 libkeyutils-1.2.so
0074f000     716 r-x-- 0000000000000000 0fd:00000 libkrb5.so.3.3
00802000      24 rw--- 00000000000b3000 0fd:00000 libkrb5.so.3.3
0082d000     276 r-x-- 0000000000000000 0fd:00000 libfreebl3.so
00872000       4 rw--- 0000000000045000 0fd:00000 libfreebl3.so
00873000      16 rw--- 0000000000000000 000:00000   [ anon ]
00880000     180 r-x-- 0000000000000000 0fd:00000 libgssapi_krb5.so.2.2
008ad000       4 rw--- 000000000002d000 0fd:00000 libgssapi_krb5.so.2.2
008b0000     168 r-x-- 0000000000000000 0fd:00000 libk5crypto.so.3.1
008da000       4 rw--- 000000000002a000 0fd:00000 libk5crypto.so.3.1
008dd000      32 r-x-- 0000000000000000 0fd:00000 libkrb5support.so.0.1
008e5000       4 rw--- 0000000000007000 0fd:00000 libkrb5support.so.0.1
008e8000      28 r-x-- 0000000000000000 0fd:00000 libcrypt-2.11.1.so
008ef000       4 r---- 0000000000007000 0fd:00000 libcrypt-2.11.1.so
008f0000       4 rw--- 0000000000008000 0fd:00000 libcrypt-2.11.1.so
008f1000     156 rw--- 0000000000000000 000:00000   [ anon ]
0091a000     328 r-x-- 0000000000000000 0fd:00000 libssl.so.1.0.0
0096c000      16 rw--- 0000000000051000 0fd:00000 libssl.so.1.0.0
00974000     120 r-x-- 0000000000000000 0fd:00000 ld-2.11.1.so
00992000       4 r---- 000000000001d000 0fd:00000 ld-2.11.1.so
00993000       4 rw--- 000000000001e000 0fd:00000 ld-2.11.1.so
00996000    1464 r-x-- 0000000000000000 0fd:00000 libc-2.11.1.so
00b04000       8 r---- 000000000016e000 0fd:00000 libc-2.11.1.so
00b06000       4 rw--- 0000000000170000 0fd:00000 libc-2.11.1.so
00b07000      12 rw--- 0000000000000000 000:00000   [ anon ]
00b0c000      88 r-x-- 0000000000000000 0fd:00000 libpthread-2.11.1.so
00b22000       4 r---- 0000000000015000 0fd:00000 libpthread-2.11.1.so
00b23000       4 rw--- 0000000000016000 0fd:00000 libpthread-2.11.1.so
00b24000       8 rw--- 0000000000000000 000:00000   [ anon ]
00b28000      12 r-x-- 0000000000000000 0fd:00000 libdl-2.11.1.so
00b2b000       4 r---- 0000000000002000 0fd:00000 libdl-2.11.1.so
00b2c000       4 rw--- 0000000000003000 0fd:00000 libdl-2.11.1.so
00b2f000     160 r-x-- 0000000000000000 0fd:00000 libm-2.11.1.so
00b57000       4 r---- 0000000000027000 0fd:00000 libm-2.11.1.so
00b58000       4 rw--- 0000000000028000 0fd:00000 libm-2.11.1.so
00b5b000      72 r-x-- 0000000000000000 0fd:00000 libz.so.1.2.3
00b6d000       4 rw--- 0000000000011000 0fd:00000 libz.so.1.2.3
00b7b000     112 r-x-- 0000000000000000 0fd:00000 libselinux.so.1
00b97000       4 r---- 000000000001b000 0fd:00000 libselinux.so.1
00b98000       4 rw--- 000000000001c000 0fd:00000 libselinux.so.1
00b9b000      80 r-x-- 0000000000000000 0fd:00000 libresolv-2.11.1.so
00baf000       4 ----- 0000000000014000 0fd:00000 libresolv-2.11.1.so
00bb0000       4 r---- 0000000000014000 0fd:00000 libresolv-2.11.1.so
00bb1000       4 rw--- 0000000000015000 0fd:00000 libresolv-2.11.1.so
00bb2000       8 rw--- 0000000000000000 000:00000   [ anon ]
042a3000    1476 r-x-- 0000000000000000 0fd:00000 libcrypto.so.1.0.0
04414000      80 rw--- 0000000000170000 0fd:00000 libcrypto.so.1.0.0
04428000      12 rw--- 0000000000000000 000:00000   [ anon ]
049f0000     188 r-x-- 0000000000000000 0fd:00000 libpcre.so.0.0.1
04a1f000       4 rw--- 000000000002e000 0fd:00000 libpcre.so.0.0.1
08048000    2496 r-x-- 0000000000000000 0fd:00001 condor_shadow-base
082b8000       8 rw--- 0000000000270000 0fd:00001 condor_shadow-base
082ba000      12 rw--- 0000000000000000 000:00000   [ anon ]
09c64000     396 rw--- 0000000000000000 000:00000   [ anon ]
b7819000      24 rw--- 0000000000000000 000:00000   [ anon ]
b7839000      12 rw--- 0000000000000000 000:00000   [ anon ]
bf930000      84 rw--- 0000000000000000 000:00000   [ stack ]
mapped: 10092K    writeable/private: 972K    shared: 0K

“writable/private: 972K” is a better number to go by. It suggests that 10,000 shadows will require at least 972*10,000/2^20=9.269GB of memory.

So that’s where the memory is. Now why is it there?

I’ve been playing with Massif lately, so it seemed like a reasonable place to start. It can produce a usage graph,

    KB
335.8^                                                                      # 
     |                                                  ::::::::::@:::::::::#:
     |                                             @@:@::::: :::: @: : :::::#:
     |                                       ::::::@ :@::::: :::: @: : :::::#:
     |                                       ::::::@ :@::::: :::: @: : :::::#:
     |                                @::::::::::::@ :@::::: :::: @: : :::::#:
     |                         ::@::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                       @:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                      :@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                     ::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                    @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                    @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                  @@@::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                  @ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                 @@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                @@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                @@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |              @@@@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |              @ @@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |             @@ @@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
   0 +----------------------------------------------------------------------->Mi
     0                                                                   17.22

Massif tells us about dynamic memory allocations, which in this case amount to 335KB. That is only a third of the 972K of private memory reported by pmap. This suggests that two-thirds of the private memory in the shadow may be from static allocations.

objdump is another tool I’ve been using lately to inspect ELF binaries.

$ objdump -t condor_shadow-base | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | tail
082bc900 l     O .bss	00000100              pe_logname.7591
082bca00 l     O .bss	00000100              pe_user.7592
082bc160 l     O .bss	00000100              priv_identifier::id
082b8d80 l     O .data	00000120              CondorEnvironList
082bbbc0 g     O .bss	000001c4              ConfigTab
082bbf20 l     O .bss	00000200              priv_history
082ba920 l     O .bss	00001000              answer.7409
082b9920 l     O .bss	00001000              answer.7437
082b89a0 g       .data	00000000              __data_start
082b89a0  w      .data	00000000              data_start

What’s this, two 1K static buffers named answer.

* * *

After some code inspection it turned out that two functions in ckpt_name.c contained static char answer[MAXPATHLEN], where MAXPATHLEN is of course 1024. One of those functions, gen_exec_name, was dead code and easy to remove. The other, gen_ckpt_name, was active, with all but one of its call sites immediately strdup()’ing its result. After necessary modifications,

$ objdump -t condor_shadow-sans-answers | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | tail
082b9720 g     O .bss	000000e8              MaxLog
082b8b00 l     O .data	000000f0              SigNameArray
082ba900 l     O .bss	00000100              pe_logname.7591
082baa00 l     O .bss	00000100              pe_user.7592
082ba160 l     O .bss	00000100              priv_identifier::id
082b8d80 l     O .data	00000120              CondorEnvironList
082b9bc0 g     O .bss	000001c4              ConfigTab
082b9f20 l     O .bss	00000200              priv_history
082b89a0 g       .data	00000000              __data_start
082b89a0  w      .data	00000000              data_start

And while running,

$ pmap -d $(pidof condor_shadow-sans-answers) | grep -e rw -e private
...
082b8000       8 rw--- 0000000000270000 0fd:00001 condor_shadow-sans-answers
082ba000       4 rw--- 0000000000000000 000:00000   [ anon ]
089f0000     396 rw--- 0000000000000000 000:00000   [ anon ]
b7877000      24 rw--- 0000000000000000 000:00000   [ anon ]
b7897000      12 rw--- 0000000000000000 000:00000   [ anon ]
bff58000      84 rw--- 0000000000000000 000:00000   [ stack ]
mapped: 10084K    writeable/private: 964K    shared: 0K

We see an expected decrease in the shadow’s private address space. But, even though those answer buffers were the largest around, we’re aiming at small game. This is a ~0.8% improvement, or ~78MB out of our ~9.269GB. At least we now have a procedure for further reducing memory, even if it is a fraction of a percent at a time. In fact, with a little more digging there are more possibilities here,

$ objdump -t condor_shadow-sans-answers | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | \
   grep NULL_XACTION | tail -n1                     
082babbc l     O .bss	00000004              classad::NULL_XACTION

While classad::NULL_XACTION may be small, there are quite a few of them,

$ objdump -t condor_shadow-sans-answers | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | \
   awk '{print $6}' | sort | uniq -c | sort -n | tail
      2 DEFAULT_INDENT
      2 EvalBool(compat_classad::ClassAd*,
      2 SharedPortEndpoint::SharedPortEndpoint(char
      2 SharedPortEndpoint::UseSharedPort(MyString*,
      2 std::__ioinit
      2 SubsystemInfo::setClass(SubsystemInfoLookup
      3 vtable
     11 guard
    107 classad::NULL_XACTION

* * *

And it turns out classad::NULL_XACTION is not only defined in a header file that is included in many places, but it is completely unused. Note: Please don’t put data like NULL_XACTION in a header.

$ objdump -t condor_shadow-sans-answers-null_xaction | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | \
   awk '{print $6}' | sort | uniq -c | sort -n | tail
      1 ZZZZZ
      2 DEFAULT_INDENT
      2 EvalBool(compat_classad::ClassAd*,
      2 SharedPortEndpoint::SharedPortEndpoint(char
      2 SharedPortEndpoint::UseSharedPort(MyString*,
      2 std::__ioinit
      2 SubsystemInfo::setClass(SubsystemInfoLookup
      3 vtable
     11 guard

Since NULL_XACTION is not actually used its removal should not change the private address space of the shadow, but instead the overall binary and memory size.

$ pmap -d $(pidof condor_shadow-sans-answers-null_xaction) | grep -e rw -e private
...
082b2000       8 rw--- 0000000000269000 0fd:00001 condor_shadow-sans-answers-null_xaction
082b4000       4 rw--- 0000000000000000 000:00000   [ anon ]
08a0a000     396 rw--- 0000000000000000 000:00000   [ anon ]
b77f5000      24 rw--- 0000000000000000 000:00000   [ anon ]
b7815000      12 rw--- 0000000000000000 000:00000   [ anon ]
bfd28000      84 rw--- 0000000000000000 000:00000   [ stack ]
mapped: 10060K    writeable/private: 964K    shared: 0K

Which appears to be the case,

$ size *
   text	   data	    bss	    dec	    hex	filename
2554154	   5032	  14360	2573546	 2744ea	condor_shadow-base
2554146	   5032	   6168	2565346	 2724e2	condor_shadow-sans-answers
2528350	   4648	   5724	2538722	 26bce2	condor_shadow-sans-answers-null_xaction

Ok, so that was just a diversion. Back to big game. From the original pmap dump, there are a number of libraries linked in with the shadow that are using memory of their own. One of those libraries, libcrypt is using far more than the others.

008e8000      28 r-x-- 0000000000000000 0fd:00000 libcrypt-2.11.1.so
008ef000       4 r---- 0000000000007000 0fd:00000 libcrypt-2.11.1.so
008f0000       4 rw--- 0000000000008000 0fd:00000 libcrypt-2.11.1.so
008f1000     156 rw--- 0000000000000000 000:00000   [ anon ]

* * *

The existing build system for Condor, which uses imake(!), has a nasty habit of linking all binaries with all libraries. And as it turns out, crypt is not used by Condor code anymore. It has been replaced by OpenSSL’s DES implementation, likely many years ago.

Running without libcrypt,

$ pmap -d $(pidof condor_shadow-sans-answers-null_xaction-crypt) | grep -e rw -e private 
005e6000       4 rw--- 000000000000b000 0fd:00000 libnss_files-2.11.1.so
00657000       4 rw--- 000000000001c000 0fd:00000 libgcc_s-4.4.3-20100127.so.1
0073b000       8 rw--- 00000000000e0000 0fd:00000 libstdc++.so.6.0.13
0073d000      24 rw--- 0000000000000000 000:00000   [ anon ]
00747000       4 rw--- 0000000000002000 0fd:00000 libcom_err.so.2.1
0074c000       4 rw--- 0000000000001000 0fd:00000 libkeyutils-1.2.so
00802000      24 rw--- 00000000000b3000 0fd:00000 libkrb5.so.3.3
008ad000       4 rw--- 000000000002d000 0fd:00000 libgssapi_krb5.so.2.2
008da000       4 rw--- 000000000002a000 0fd:00000 libk5crypto.so.3.1
008e5000       4 rw--- 0000000000007000 0fd:00000 libkrb5support.so.0.1
0096c000      16 rw--- 0000000000051000 0fd:00000 libssl.so.1.0.0
00993000       4 rw--- 000000000001e000 0fd:00000 ld-2.11.1.so
00b06000       4 rw--- 0000000000170000 0fd:00000 libc-2.11.1.so
00b07000      12 rw--- 0000000000000000 000:00000   [ anon ]
00b23000       4 rw--- 0000000000016000 0fd:00000 libpthread-2.11.1.so
00b24000       8 rw--- 0000000000000000 000:00000   [ anon ]
00b2c000       4 rw--- 0000000000003000 0fd:00000 libdl-2.11.1.so
00b58000       4 rw--- 0000000000028000 0fd:00000 libm-2.11.1.so
00b6d000       4 rw--- 0000000000011000 0fd:00000 libz.so.1.2.3
00b98000       4 rw--- 000000000001c000 0fd:00000 libselinux.so.1
00bb1000       4 rw--- 0000000000015000 0fd:00000 libresolv-2.11.1.so
00bb2000       8 rw--- 0000000000000000 000:00000   [ anon ]
04414000      80 rw--- 0000000000170000 0fd:00000 libcrypto.so.1.0.0
04428000      12 rw--- 0000000000000000 000:00000   [ anon ]
04a1f000       4 rw--- 000000000002e000 0fd:00000 libpcre.so.0.0.1
082ad000       8 rw--- 0000000000264000 0fd:00001 condor_shadow-sans-answers-null_xaction-crypt
082af000       4 rw--- 0000000000000000 000:00000   [ anon ]
0908e000     396 rw--- 0000000000000000 000:00000   [ anon ]
b77bd000      20 rw--- 0000000000000000 000:00000   [ anon ]
b77dc000      12 rw--- 0000000000000000 000:00000   [ anon ]
bf820000      84 rw--- 0000000000000000 000:00000   [ stack ]
mapped: 9548K    writeable/private: 780K    shared: 0K

We see a huge improvement in both mapped and private memory usage. 780K/972K shows a ~19% improvement, or a reduction in memory of ~1.8GB over 10,000 running jobs.

BIG NOTE: The condor_shadow used above is from the 7.5 development series, which contains a problem around inclusion of a static parameter table on its heap. Using a version without the config table and some config minimzation tricks yields similar results, private memory footprint going from 708K to 516K (516/708=0.728). All on my 32-bit laptop of course. Memory on a 64-bit architecture goes from ~1,400K to ~1,000K (1000/1400=0.714).

There is clearly still more to be done here. In addition to evaluating the other libraries linked into the shadow and eliminating many of the static buffers, the memory allocations reported by Massif can be explored. However, the config table factored out there is likely only 100KB to explore.

Condor Week 2010

April 25, 2010

Condor Week 2010 concluded about a week ago. As always, it included interesting talks from academic and corporate users about the advances they are making enabled by Condor. It also had updates on development over the past year both at UW and Red Hat, and plans for the future.

I got to present an update on what Red Hat is doing with Condor, including configuration management, fostering transparency, performance enhancements, cloud integration, and new features.

Will Benton presented his and Rob Rati’s work on configuration management with Wallaby. Definitely something to keep an eye on.

With all the interesting talks there are many opportunities to see great advances. I found Greg Thain’s talk on High Throughput Parallel Computing (HTPC) especially interesting as it marks completion of a cycle in how resources are modeled in a batch scheduler.

Condor the Cloud Scheduler @ Open Source Cloud Computing Forum, 10 Feb 2010

April 23, 2010

Red Hat hosted another Open Source Cloud Computing forum on 10 February 2010. It included a number of new technologies under development at Red Hat. I presented a second session in a series on Condor, this time as a cloud scheduler. Like before, the proceedings of the forum will be available online for 12 months. The Condor: Cloud Scheduler presentation is also attached to this blog.

Setup a two node Condor pool (Red Hat KB)

January 14, 2010

I have been meaning to write a short getting-started tutorial about installing a multiple-node Condor pool, but it just hasn’t happened yet. Lucky for everyone, Jon Thomas at Red Hat has written one. Check out his How do I install a basic two node MRG grid?, which covers installation on Linux very nicely.

Submitting to Condor from QMF: A minimal Vanilla Universe job

April 30, 2009

Part of Red Hat’s MRG product is a management system that covers Condor. At the core of that system is the Qpid Management Framework (QMF), built on top of AMQP. Condor components modeled in QMF allow for information retrieval and control injection. I previously discussed How QMF Submission to Condor Could work, and now there’s a prototype of that interface.

Jobs in Condor are represented as ClassAds, a.k.a. a job ad, which are essentially schema-free. Any schema they have is defined by the code paths that handle jobs. This includes not only the name of attributes but also the types of their values.. For instance, a job ad does not have to have an attribute that describes what program to execute, the Cmd attribute whose value is a string, but if it does then the job can actually run.

Who cares what’s in a job ad?

Most all components in a Condor system handle jobs in one way or another, which means they all help define what a job ad looks like. To complicate things a bit, different components handle jobs in different ways depending on the job’s type or requested features. For simplicity I’m only going to discuss jobs that want to run in the Vanilla Universe with minimal extra features, e.g. no file transfer requests.

The Schedd is the first component that deals with jobs. Its purpose is to manage them. It insists all jobs it manages have: an Owner string, an identify of the job’s owner; a ClusterId string, an identifier assigned by the Schedd; a ProcId string, an identifier within the ClusterId, also assigned by the Schedd; a JobStatus integer, specifying the state the job is in, e.g. Idle, Held, Running; and, a JobUniverse integer, denoying part of a job’s type, e.g. Vanilla Universe or Grid Universe.

The Negotaitor is sent jobs by the Schedd to perform matching with machines. To do that matching it requires that a job have a Requirements expression.

The Shadow helps the Schedd manage a job while it is running. The Schedd gives it the job ad to the Shadow, and the Shadow insists on finding an Iwd string representing a path to the job’s initial working directory.

The Startd handles jobs similarly to the Negotiator. Since matching is bi-directional in Condor and execution resources get final say in running a job, the Startd evaluates the job’s Requirements expression before accepting it.

The Starter is responsible for actually running the job. It requires that a job have the Cmd string attribute. Without one the Starter does not know what program to run. An additional requirement the Starter imposes is the validity of the Owner string. If the Starter is to impersonate the Owner, then the string must specify an identify known on the Starter’s machine.

Now, those are not the only attributes on a job ad. Often components will fill in sane defaults for attributes they need but are not present. For instance, the Shadow will fill in the TransferFiles string, and the Schedd will assume a MaxHosts integer.

What about tools?

It’s not just the components that manage jobs that care about attributes on a job. The condor_q command-line tool also expects to find certain attributes to help it display information. In addition to those attributes already described, it requires: a QDate integer, the number of seconds since EPOCH when the job was submitted; a RemoteUserCpu float, an attribute set while a job runs; a JobPrio integer, specifying the job’s relative priority to other jobs; and, an ImageSize integer, specifying the amount of memory used by the job in KiB.

What does a submitter care about?

Anyone who wants to submit a job to Condor is going to have to specify enough attributes on the job ad to make all the Condor components happy. Those attributes will vary depending on the type of job and the features it desires. Luckily Condor can fill in many attributes with sane defaults. For instance, condor_q wants a QDate and a RemoteUserCpu. Those are two attributes that the submitter should not be able to or have to specify.

To perform a simple submission for running a pre-staged program, the job ad needs: a Cmd, a Requirements, a JobUniverse, an Iwd, and an Owner. Additionally, an Args string is possible if the Cmd takes arguments.

Given the knownledge of what attributes are required on a job ad, and using the QMF Python Console Tutorial, I was able to quickly write up an example program to submit a job via QMF.

#!/usr/bin/env python

from qmf.console import Session
from sys import exit

(EXPR_TYPE, INTEGER_TYPE, FLOAT_TYPE, STRING_TYPE) = (0, 1, 2, 3)
UNIVERSE = {"VANILLA": 5, "SCHEDULER": 7, "GRID": 9, "JAVA": 10, "PARALLEL": 11, "LOCAL": 12, "VM": 13}
JOB_STATUS = ("", "Idle", "Running", "Removed", "Completed", "Held", "")

ad = {"Cmd":          {"TYPE": STRING_TYPE,  "VALUE": "/bin/sleep"},
      "Args":         {"TYPE": STRING_TYPE,  "VALUE": "120"},
      "Requirements": {"TYPE": EXPR_TYPE,    "VALUE": "TRUE"},
      "JobUniverse":  {"TYPE": INTEGER_TYPE, "VALUE": "%s" % (UNIVERSE["VANILLA"],)},
      "Iwd":          {"TYPE": STRING_TYPE,  "VALUE": "/tmp"},
      "Owner":        {"TYPE": STRING_TYPE,  "VALUE": "nobody"}}

session = Session(); session.addBroker("amqp://localhost:5672")
schedulers = session.getObjects(_class="scheduler", _package="mrg.grid")
result = schedulers[0].Submit(ad)

if result.status:
    print "Error submitting job:", result.text
    exit(1)

print "Submitted job:", result.Id

jobs = session.getObjects(_class="job", _package="mrg.grid")
job = reduce(lambda x,y: x or (y.CustomId == result.Id and y), jobs, False)
if not job:
    print "Did not find job"
    exit(2)

print "Job status:", JOB_STATUS[job.JobStatus]
print "Job properties:"
for prop in job.getProperties():
    print " ",prop[0],"=",prop[1]

The program does more than just submit, it also looks up the job based on what has been published via QMF. The job that it does submit runs /bin/sleep 120 as the nobody user from /tmp on any, Requirements = TRUE, execution node.

The job’s ClassAd is presented as nested maps. The top level map holds attribute names mapped to values. Those values are themselves maps that specify the type of the actual value and a representation of the actual value. All representations of values are strings. The type specifies how the string should be handled, e.g. if it should be parsed into an int or float.

Two good sources of information about job ad attributes are the UW’s Condor Manual’s Appendix, and the output from condor_submit -dump job.ad when run against a job submission file.


%d bloggers like this: