Getting started: Creating a multiple node Condor pool

Assuming you have read how to setup a personal condor, the next step is to add more machines.

This time from Fedora 15, with iptables -L indicating the firewall is off.

[root@node0 ~]# rpm -q condor
condor-7.7.0-0.4.fc15.x86_64

[root@node0 ~]# condor_version
$CondorVersion: 7.7.0 Jun 08 2011 PRE-RELEASE-UWCS $
$CondorPlatform: X86_64-Fedora_15 $

First off, what’s actually running?

[root@node0 ~]# service condor start
Starting condor (via systemctl):  [  OK  ]

[root@node0 ~]# pstree | grep condor
        |-condor_master-+-condor_collecto
        |               |-condor_negotiat
        |               |-condor_schedd---condor_procd
        |               `-condor_startd

You have a condor_master, which spawns all the other condor daemons and monitors them, restarting them if necessary.

Since this pool is entirely contained on one machine all the components needed for a functional pool are present. The condor_collector is there, it’s the rendezvous point. The condor_schedd is there, it holds and manages the jobs. The condor_startd is there, it represents resources in the pool. The condor_negotiator is there, it hands out matches between jobs and resources.

Run condor_q to see all the jobs. It queries the condor_schedd for the list.

[root@node0 ~]# condor_q
-- Submitter: node0.local :  : node0.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
0 jobs; 0 idle, 0 running, 0 held

Nothing there yet.

Run condor_status to see all the resources. It queries the condor_collector for the list.

[root@node0 ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:14:45
slot2@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:15:06
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        2     0       0         2       0          0
               Total        2     0       0         2       0          0

The machine has two cores.

Adding more nodes to the pool means running a condor_startd on more machines and telling them to report to the collector. The collector is the entry point into the pool. All components checkin with it, and it runs on a well known port: 9618 (try: grep condor /etc/services)

After installing condor on your second machine, you need to change the configuration. You want to run just the condor_startd, and you want the node to checkin with the collector on node0.local. To do this take advantage of condor’s config.d, found at /etc/condor/config.d.

The three parameters to change are CONDOR_HOST, DAEMON_LIST and ALLOW_WRITE. Configuration files are concatenated together, with the last definition of a parameter being authoritative. ALLOW_WRITE is to make sure the condor_schedd on node0 can run jobs on node1. So,

[root@node1 ~]# cat > /etc/condor/config.d/40root.config
CONDOR_HOST = node0.local
DAEMON_LIST = MASTER, STARTD
ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST)
^D

Once started, only the condor_master and condor_startd will be running.

[root@node1 ~]# service condor start
Starting condor (via systemctl):  [  OK  ]

[root@node1 ~]# pstree | grep condor
        |-condor_master---condor_startd

If everything worked out, condor_status will now show both machines.

[root@node1 ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:34:45
slot2@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:35:06
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        2     0       0         2       0          0
               Total        2     0       0         2       0          0

No dice. We’ll do this the slow way.

Since we’re asking the collector for information, the best place to look for what’s going wrong is in the collector’s log file. It is on node0.local in /var/log/condor/CollectorLog. Note: condor_config_val COLLECTOR_LOG will tell you were to look too.

[root@node0 ~]# tail -n5 $(condor_config_val COLLECTOR_LOG)
06/11/11 23:00:45 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.0.1,node1.local
06/11/11 23:00:45 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: cached result for ADVERTISE_STARTD; see first case for the full reason
06/11/11 23:00:46 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 2 (UPDATE_MASTER_AD), access level ADVERTISE_MASTER: reason: ADVERTISE_MASTER authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.0.1,node1.local
06/11/11 23:00:51 Got QUERY_STARTD_ADS
06/11/11 23:00:51 (Sending 2 ads in response to query)

This is telling us the security configuration on node0 is preventing node1 from checking in. Specifically, the ALLOW_ADVERTISE_STARTD configuration. The security configuration mechanisms in condor are very flexible and powerful. I suggest you read about them sometime. For now, we’ll grant node1 access to ADVERTISE_STARTD as well as any other operations that require write permissions by setting ALLOW_WRITE.

[root@node0 ~]# cat > /etc/condor/config.d/40root.config
ALLOW_WRITE = $(ALLOW_WRITE), 10.0.0.1,node1.local
^D

[root@node0 ~]# condor_reconfig
Sent "Reconfig" command to local master

[root@node0 ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@node0.local  LINUX      X86_64 Unclaimed Idle     0.040   497  0+00:50:46
slot2@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:51:07
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        2     0       0         2       0          0
               Total        2     0       0         2       0          0

You should note that host-based authentication is being used here. Any connection from 10.0.0.1 or node1.local are given write access. For simplicity, the ALLOW_WRITE parameter is set so that it is appending to any existing value. And, node1 still is not showing up in the status listing.

The reason node1 is not present is it only attempts to check in every 5 minutes (condor_config_val UPDATE_INTERVAL, seconds). A quick reconfig on node1 will speed the checkin along.

[root@node1 ~]# condor_reconfig
Sent "Reconfig" command to local master

[root@node1 ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@node0.local  LINUX      X86_64 Unclaimed Idle     0.040   497  0+00:50:46
slot2@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:51:07
slot1@node1.local  LINUX      X86_64 Unclaimed Idle     0.000  1003  0+00:09:04
slot2@node1.local  LINUX      X86_64 Unclaimed Idle     0.000  1003  0+00:09:25
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        4     0       0         4       0          0
               Total        4     0       0         4       0          0

Now that the resources are all visible, you can run a few jobs.

[matt@node0 ~]$ cat > job.sub
cmd = /bin/sleep
args = 1d
should_transfer_files = if_needed
when_to_transfer_output = on_exit
queue 8
^D

[matt@node0 ~]$ condor_submit job.sub 
Submitting job(s)........
8 job(s) submitted to cluster 1.

[matt@node0 ~]$ condor_q
-- Submitter: node0.local :  : node0.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   matt            6/11 23:23   0+00:00:07 R  0   0.0  sleep 1d          
   1.1   matt            6/11 23:23   0+00:00:06 R  0   0.0  sleep 1d          
   1.2   matt            6/11 23:23   0+00:00:06 R  0   0.0  sleep 1d          
   1.3   matt            6/11 23:23   0+00:00:06 R  0   0.0  sleep 1d          
   1.4   matt            6/11 23:23   0+00:00:00 I  0   0.0  sleep 1d          
   1.5   matt            6/11 23:23   0+00:00:00 I  0   0.0  sleep 1d          
   1.6   matt            6/11 23:23   0+00:00:00 I  0   0.0  sleep 1d          
   1.7   matt            6/11 23:23   0+00:00:00 I  0   0.0  sleep 1d          
8 jobs; 4 idle, 4 running, 0 held

[matt@node0 ~]$ condor_q -run
-- Submitter: node0.local :  : node0.local
 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)         
   1.0   matt            6/11 23:23   0+00:00:09 slot1@node0.local
   1.1   matt            6/11 23:23   0+00:00:08 slot2@node0.local
   1.2   matt            6/11 23:23   0+00:00:08 slot1@node1.local
   1.3   matt            6/11 23:23   0+00:00:08 slot2@node1.local

We’ve installed condor on two nodes, configured one (node0) to be what is sometimes called a head node, configured the other (node1) to report to node0, and ran a few jobs.

Other topics, enabling the firewall and using the shared_port daemon, sharing UID and FS between nodes, finer grained security.

Notes:
0) By default, ALLOW_WRITE should include $(CONDOR_HOST)
1) ShouldTransferFiles should default to IF_NEEDED
2) WhenToTransferOutput should default to ON_EXIT

Edit – changed 50root.conf to 40root.config

Tags: , , ,

15 Responses to “Getting started: Creating a multiple node Condor pool”

  1. Getting Started: Multiple node Condor pool with firewalls « Spinning Says:

    […] a Condor pool with no firewalls up is quite a simple task. Before the condor_shared_port daemon, doing the same with firewalls was a bit […]

  2. Mark T. Kennedy Says:

    interesting. planning on writing anything about how to manage resource limits?

  3. jlib Says:

    Your writing style is outstanding and this article was invaluable in getting me past a setup problem. I sure wish the Condor documentation was so well written. Thanks!

  4. Getting started: Condor and EC2 – EC2 execute node « Spinning Says:

    […] condor on the instance is very similar to creating a multiple node pool. You will need to set the CONDOR_HOST, ALLOW_WRITE, and […]

  5. Somnath Mazumdar Says:

    Hi,
    I am new to condor. I have installed condor onto my two VM (VM#1, VM#2) as personal condor setup. They are working fine and even I can execute the jobs. But when I am trying to add VM#2 to VM#1 to increase the pool, I cannot do that. I followed your steps. The steps I followed:
    I am not the root user. I am working as condor user.
    I made changes to /home/condor/condor-xx/etc/condor_config.

    The changes I made:
    1.COLECTOR_HOST=IP of VM#1
    as in config there is no “CONDOR_HOST” entry

    2. DAEMON_LIST = MASTER, STARTD
    Done

    3. ALLOW_WRITE = IP of VM#1

    After saving the condor_config. I tried to execute
    1. condor_reconfig
    Status: Ok
    2. condor_status
    Status: error

    I am getting this error:
    Error: communication error
    CEDAR:6001:Failed to connect to VM#1

    But I do not understand the reason. It would be great if you advice me to fix the problem.

    • spinningmatt Says:

      First, you need to make sure the two VMs can talk to one another (ping, ssh, traceroute, look in VM#1’s CollectorLog for connection attempts), and the instructions above assume the firewall is off.

      Second, since you don’t have a CONDOR_HOST, I’m not sure what version of Condor you have or what your default configuration might be. You will have to make sure ALLOW_READ on VM#1 will allow VM#2. The CollectorLog on VM#1 should tell you if and why it is denying access from VM#2.

      I suggest starting with Fedora’s condor RPM and using LOCAL_CONFIG_DIR. Without root you can rpm2cpio | cpio -id and you will have to fix up the RELEASE_DIR & LOCAL_DIR & LOG & ETC & RUN params in etc/condor/condor_config.

      Customizing Condor configuration: LOCAL_CONFIG_FILE vs LOCAL_CONFIG_DIR

  6. Somnath Mazumdar Says:

    Thanks spinningmatt for your prompt reply. I am using condor 7.6.5.
    $CondorVersion: 7.6.5 Dec 27 2011 BuildID: 397396 $
    $CondorPlatform: x86_64_deb_5.0 $

    The issue is fixed.

    My problem was the port 9618 was closed. To check that, I used ping and lsof -i -n | egrep ‘COMMAND|LISTEN’ both were working but nc -z 137.43.92.184 9618 failed so I ask root user to open. It worked.

  7. bijurama Says:

    Hello SpinningMatt:
    Is it possbile to specify pre-render or post-render script in a submit script ?
    Thanks
    /Biju

    • spinningmatt Says:

      Yes, via +PreCmd and +PostCmd, see man condor_submit. PostCmd appears to work in 7.8 and be broken in 7.6.

  8. Dave Hentchel Says:

    This tutorial was very helpful – well written and focused on a single, important goal.
    I’m thinking it’d be worthwhile having a wiki for Condor users…

  9. Dave Hentchel Says:

    One clarification. With Condor 7.9.4 I created the /etc/condor/config.d/40root.config file, but the StartLog indicated that this config file gets invoked before the /etc/condor/condor_config.local. This latter file sets the DAEMON_LIST attribute which thus overrides the setting in the 40root.config file.

    There are multiple ways to fix this – I chose to simply comment out the DAEMON_LIST entry in the local config file.

    Is it possible that the search order for config files, and/or the default DAEMON_LIST setting for the local config file changed in v7.9?

  10. Dave Hentchel Says:

    One interesting twist is that these jobs sleep for one day, so they don’t easily go away. I discovered the condor_drain and condor_vacate_job cmds, but you end up with the jobs remaining on the queue in “Idle” state.
    Good opportunity to learn more about condor admin :-)

    and BTW, thanks, spinningmatt, for this article, it got me running and helped understand the best-practices of Condor.

  11. Condor_job_guy Says:

    Hi,
    It seems I have set up my condor nodes nicely with this tutorial, though there seems to be one problem.When running jobs, no job will run on the other nodes e.g all the jobs run in the local machine.Condor_status shows both machines, I have condor working on both of them etc.
    What I’m doing wrong?

Leave a comment