How to run a second Startd on one machine

In case you did not know, you can run multiple copies of Condor on a single machine. You can even run multiple copies of portions of Condor on a single machine.

You may want to do this to assist in over provisioning a system, perform some strict workload isolation, or run an administrative shadow pool.

The condor_master does make some efforts to prevent duplicate copies running against the same configuration. The important thing is to figure out what configuration must be unique.

Here is the configuration for running a second Startd on a machine. Put it in /etc/condor/config.d/31shadow_startd.config.

# Define the daemon name for the shadow startd, to be put in
# DAEMON_LIST. The master will read DAEMON_LIST and look for
# configuration for each daemon, such as SHADOW_STARTD_ARGS and
# SHADOW_STARTD_ENVIRONMENT.
SHADOW_STARTD = $(STARTD)

# Arguments for the SHADOW_STARTD daemon:
#  -f means foreground (eliminates need for adding SHADOW_STARTD to DC_DAEMON_LIST)
#  -local-name gives the SHADOW_STARTD a namespace for its configuration
SHADOW_STARTD_ARGS = -f -local-name SHADOW_STARTD

# The startd spawns starters, each of which writes to a log file. That
# log file is found by looking up the STARTER_LOG configuration. There
# is no STARTER_ARGS, similar to SHADOW_STARTD_ARGS, to pass
# -local-name. Instead the STARTER_LOG is put in the environment of
# the startd, by the master, so it can be inherited by the
# starters. Similarly, the startd and starters share a procd. Multiple
# startd/starter sets sharing a procd will result in failures. So a
# unique PROCD_ADDRESS is also provided.
SHADOW_STARTD_ENVIRONMENT = "_CONDOR_STARTER_LOG=$(STARTER_LOG).shadow _CONDOR_PROCD_ADDRESS=$(PROCD_ADDRESS).shadow"

# Configuration in the SHADOW_STARTD namespace:
SHADOW_STARTD.STARTD_LOG = $(STARTD_LOG).shadow
SHADOW_STARTD.STARTD_ADDRESS_FILE = $(STARTD_ADDRESS_FILE).shadow
# A unique EXECUTE directory, create 1777 owned by condor.
SHADOW_STARTD.EXECUTE = $(EXECUTE).shadow
# A unique name to avoid collisions in the collector.
SHADOW_STARTD.STARTD_NAME = SHADOW_STARTD
# Put an extra attribute on the slots from the SHADOW_STARTD, useful
# in identifying them.
SHADOW_STARTD.IsShadowStartd = TRUE
SHADOW_STARTD.STARTD_ATTRS = $(STARTD_ATTRS), IsShadowStartd
# The shadow startd may have special requirements for jobs.
#SHADOW_STARTD.START = IsForShadowStartd =?= TRUE

# Add the SHADOW_STARTD to the set of daemons managed by the master.
DAEMON_LIST = $(DAEMON_LIST), SHADOW_STARTD

After installing the configuration, make sure to create the EXECUTE directory.

# sudo -u condor mkdir -m0755 $(condor_config_val SHADOW_STARTD.EXECUTE)

Then you can test it is working –

# service condor start

$ pstree -p | grep condor
        |-condor_master(9947)-+-condor_collecto(9949)
        |                     |-condor_negotiat(9950)
        |                     |-condor_schedd(9951)---condor_procd(9956)
        |                     |-condor_startd(9952)
        |                     `-condor_startd(9953)

Notice the two Startds. You can also check /var/log/condor for the shadow startd’s log files.

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem ActvtyTime
slot1@SHADOW_START LINUX      X86_64 Unclaimed Idle     0.000   987 0+00:00:04
slot1@eeyore.local LINUX      X86_64 Unclaimed Idle     0.000   987 0+00:00:04
slot2@SHADOW_START LINUX      X86_64 Unclaimed Idle     0.000   987 0+00:00:20
slot2@eeyore.local LINUX      X86_64 Unclaimed Idle     0.000   987 0+00:00:20
slot3@SHADOW_START LINUX      X86_64 Unclaimed Idle     0.000   987 0+00:00:21
slot3@eeyore.local LINUX      X86_64 Unclaimed Idle     0.000   987 0+00:00:21
slot4@SHADOW_START LINUX      X86_64 Unclaimed Idle     0.000   987 0+00:00:22
slot4@eeyore.local LINUX      X86_64 Unclaimed Idle     0.000   987 0+00:00:22
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        8     0       0         8       0          0
               Total        8     0       0         8       0          0

$ echo -e 'cmd=/bin/sleep\nargs=1d\nrequirements=IsShadowStartd=!=TRUE\nqueue 8\nrequirements=IsShadowStartd=?=TRUE\nqueue 8' | condor_submit
Submitting job(s)................
16 job(s) submitted to cluster 1.

The first 8 jobs (1.0 to 1.7) will not run on the shadow startd, while the second 8 (1.8 to 1.15) will only.

$ condor_q -run
-- Submitter: eeyore.local :  : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)
   1.0   matt           10/18 13:50   0+00:00:16 slot1@eeyore.local
   1.1   matt           10/18 13:50   0+00:00:16 slot2@eeyore.local
   1.2   matt           10/18 13:50   0+00:00:16 slot3@eeyore.local
   1.3   matt           10/18 13:50   0+00:00:16 slot4@eeyore.local
   1.8   matt           10/18 13:50   0+00:00:16 slot1@SHADOW_STARTD@eeyore.local
   1.9   matt           10/18 13:50   0+00:00:16 slot2@SHADOW_STARTD@eeyore.local
   1.10  matt           10/18 13:50   0+00:00:16 slot3@SHADOW_STARTD@eeyore.local
   1.11  matt           10/18 13:50   0+00:00:16 slot4@SHADOW_STARTD@eeyore.local

Leave a comment