Posts Tagged ‘Submit’

Submitting a DAG via Aviary using Python

September 16, 2011

Submitting individual jobs through Condor’s various interfaces is, unsurprisingly, the first thing people do. A quick second is submitting DAGs. I have previously discussed this in Java with BirdBath.

Aviary is a suite of APIs that expose Condor features via powerful, easy to use developer interfaces. It builds on experience from other implementations and takes an approach of exposing common use cases through clean abstractions, while maintaining the Condor philosophy of giving experts access to extended features.

The code is maintained in the contributions section of the Condor repository and is documented in the Grid Developer Guide.

The current implementation provides a SOAP interface for job submission, control and query. It is broken in two parts: a plugin to the condor_schedd that exposes the submission and control; and, a daemon, aviary_query_server, exposing the data querying capabilities.

Installation on Fedora 15 and beyond is a simple yum install condor-aviary. The condor-aviary package includes configuration placed in /etc/condor/config.d. A reconfig of the condor_master, to start the aviary_query_server, and restart of the condor_schedd, to load the plugin, is necessary.

Once installed, there are examples in the repository, including a python submit script.

Starting with the submit.py above, submit_dag.py is a straightforward extension following the CondorSubmitDAG.java example.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# Copyright 2009-2011 Red Hat, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# uses Suds - https://fedorahosted.org/suds/
from suds.client import Client
import sys, pwd, os, logging, argparse

def attr_builder(type_, format):
    def attr(name, value):
        attr = client.factory.create("ns0:Attribute")
        attr.name = name
        attr.type = type_
        attr.value = format % (value,)
        return attr
    return attr
string_attr=attr_builder('STRING', '%s')
int_attr=attr_builder('INTEGER', '%d')
expr_attr=attr_builder('EXPRESSION', '%s')


parser = argparse.ArgumentParser(description='Submit a job remotely via SOAP.')
parser.add_argument('-v', '--verbose', action='store_true',
                    default=False, help='enable SOAP logging')
parser.add_argument('-u', '--url', action='store', nargs='?', dest='url',
                    default='http://localhost:9090/services/job/submitJob',
                    help='http or https URL prefix to be added to cmd')
parser.add_argument('dag', action='store', help='full path to dag file')
args =  parser.parse_args()


uid = pwd.getpwuid(os.getuid())[0] or "nobody"

client = Client('file:/var/lib/condor/aviary/services/job/aviary-job.wsdl')

client.set_options(location=args.url)

if args.verbose:
    logging.basicConfig(level=logging.INFO)
    logging.getLogger('suds.client').setLevel(logging.DEBUG)
    print client

try:
    result = client.service.submitJob(
        '/usr/bin/condor_dagman',
        '-f -l . -Debug 3 -AutoRescue 1 -DoRescueFrom 0 -Allowversionmismatch -Lockfile %s.lock -Dag %s' % (args.dag, args.dag),
        uid,
        args.dag[:args.dag.rindex('/')],
        args.dag[args.dag.rindex('/')+1:],
        [],
        [string_attr('Env', '_CONDOR_MAX_DAGMAN_LOG=0;_CONDOR_DAGMAN_LOG=%s.dagman.out' % (args.dag,)),
         int_attr('JobUniverse', 7),
         string_attr('UserLog', args.dag + '.dagman.log'),
         string_attr('RemoveKillSig', 'SIGUSR1'),
         expr_attr('OnExitRemove', '(ExitSignal =?= 11 || (ExitCode =!= UNDEFINED && ExitCode >= 0 && ExitCode <= 2))')]
	)
except Exception, e:
    print 'invocation failed at: ', args.url
    print e
    sys.exit(1)	

if result.status.code != 'OK':
    print result.status.code,'; ', result.status.text
    sys.exit(1)

print args.verbose and result or result.id.job

Timeouts from condor_rm and condor_submit

December 16, 2009

The condor_schedd is an event driven process, like all other Condor daemons. It spends its time waiting in select(2) for events to process. Events include: condor_q queries, spawning and reaping condor_shadow processes, accepting condor_submit submissions, negotiating with the Negotiator, removing jobs during condor_rm. The responsiveness of the Schedd to user interaction, e.g. condor_q, condor_rm, condor_submit, and process interaction, e.g. messages with condor_shadow, condor_startd or condor_negotiator, is effected by how long it takes to process an event and how many events are waiting to be processed.

For instance, if a thousand condor_shadow processes start up at the same time there may be a thousand keep-alive messages for the Schedd to process after a single call to select. Once select returns, no new events will be considered until the Schedd calls select again. A condor_rm request would have to wait. Likewise, if any one event takes a long time to process, such as a negotiation cycle, it can also keep the Schedd from getting back to select and accepting new events.

Basically, to function well, the Schedd needs to get back to select as fast as possible.

From a user perspective, when the Schedd does not get back to select quickly, a condor_rm or condor_submit attempt may appear to fail, e.g.

$ time condor_rm -a

Could not remove all jobs.

real	0m20.069s
user	0m0.020s
sys	0m0.020s

As of the Condor 7.4 series, this rarely happens because of internal events that the Schedd is processing. The Schedd uses structures that allow such events to be interleaved with calls to select. However, some events still take long periods of time, e.g. the removal of 300,000 jobs above. One such event is a negotiation cycle initiated by the Negotiator. If a condor_rm, condor_q, condor_submit, etc happens during a negotiation, there is a good chance it may timeout.

Though a simple re-try of the tool will often succeed, this timeout may be annoying to users of the tools, be they people or processes. An alternative to a re-try is to extend the timeout used by the tool. The default timeout is 20 seconds, which is very often long enough, but may not be in large pools.

To extend the timeout for condor_submit, put SUBMIT_TIMEOUT_MULTIPLER=3 in the configuration file read by condor_submit. To extend the timeout for condor_q, condor_rm, etc, put TOOL_TIMEOUT_MULTILIER=3 in the configuration file read by the tool. These changes will take the default timeout, 20 seconds, and multiply it by 3, giving the Schedd 60 seconds to respond. For instance, with 100Ks of jobs in the queue:

$ _CONDOR_TOOL_TIMEOUT_MULTIPLIER=3 time condor_rm -a
All jobs marked for removal.
0.01user 0.02system 0:53.99elapsed 0%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+4374minor)pagefaults 0swaps