Getting started: Creating a multiple node Condor pool

Assuming you have read how to setup a personal condor, the next step is to add more machines.

This time from Fedora 15, with iptables -L indicating the firewall is off.

[root@node0 ~]# rpm -q condor
condor-7.7.0-0.4.fc15.x86_64

[root@node0 ~]# condor_version
$CondorVersion: 7.7.0 Jun 08 2011 PRE-RELEASE-UWCS $
$CondorPlatform: X86_64-Fedora_15 $

First off, what’s actually running?

[root@node0 ~]# service condor start
Starting condor (via systemctl):  [  OK  ]

[root@node0 ~]# pstree | grep condor
        |-condor_master-+-condor_collecto
        |               |-condor_negotiat
        |               |-condor_schedd---condor_procd
        |               `-condor_startd

You have a condor_master, which spawns all the other condor daemons and monitors them, restarting them if necessary.

Since this pool is entirely contained on one machine all the components needed for a functional pool are present. The condor_collector is there, it’s the rendezvous point. The condor_schedd is there, it holds and manages the jobs. The condor_startd is there, it represents resources in the pool. The condor_negotiator is there, it hands out matches between jobs and resources.

Run condor_q to see all the jobs. It queries the condor_schedd for the list.

[root@node0 ~]# condor_q
-- Submitter: node0.local :  : node0.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
0 jobs; 0 idle, 0 running, 0 held

Nothing there yet.

Run condor_status to see all the resources. It queries the condor_collector for the list.

[root@node0 ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:14:45
slot2@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:15:06
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        2     0       0         2       0          0
               Total        2     0       0         2       0          0

The machine has two cores.

Adding more nodes to the pool means running a condor_startd on more machines and telling them to report to the collector. The collector is the entry point into the pool. All components checkin with it, and it runs on a well known port: 9618 (try: grep condor /etc/services)

After installing condor on your second machine, you need to change the configuration. You want to run just the condor_startd, and you want the node to checkin with the collector on node0.local. To do this take advantage of condor’s config.d, found at /etc/condor/config.d.

The three parameters to change are CONDOR_HOST, DAEMON_LIST and ALLOW_WRITE. Configuration files are concatenated together, with the last definition of a parameter being authoritative. ALLOW_WRITE is to make sure the condor_schedd on node0 can run jobs on node1. So,

[root@node1 ~]# cat > /etc/condor/config.d/40root.config
CONDOR_HOST = node0.local
DAEMON_LIST = MASTER, STARTD
ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST)
^D

Once started, only the condor_master and condor_startd will be running.

[root@node1 ~]# service condor start
Starting condor (via systemctl):  [  OK  ]

[root@node1 ~]# pstree | grep condor
        |-condor_master---condor_startd

If everything worked out, condor_status will now show both machines.

[root@node1 ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:34:45
slot2@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:35:06
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        2     0       0         2       0          0
               Total        2     0       0         2       0          0

No dice. We’ll do this the slow way.

Since we’re asking the collector for information, the best place to look for what’s going wrong is in the collector’s log file. It is on node0.local in /var/log/condor/CollectorLog. Note: condor_config_val COLLECTOR_LOG will tell you were to look too.

[root@node0 ~]# tail -n5 $(condor_config_val COLLECTOR_LOG)
06/11/11 23:00:45 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.0.1,node1.local
06/11/11 23:00:45 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: cached result for ADVERTISE_STARTD; see first case for the full reason
06/11/11 23:00:46 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 2 (UPDATE_MASTER_AD), access level ADVERTISE_MASTER: reason: ADVERTISE_MASTER authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.0.1,node1.local
06/11/11 23:00:51 Got QUERY_STARTD_ADS
06/11/11 23:00:51 (Sending 2 ads in response to query)

This is telling us the security configuration on node0 is preventing node1 from checking in. Specifically, the ALLOW_ADVERTISE_STARTD configuration. The security configuration mechanisms in condor are very flexible and powerful. I suggest you read about them sometime. For now, we’ll grant node1 access to ADVERTISE_STARTD as well as any other operations that require write permissions by setting ALLOW_WRITE.

[root@node0 ~]# cat > /etc/condor/config.d/40root.config
ALLOW_WRITE = $(ALLOW_WRITE), 10.0.0.1,node1.local
^D

[root@node0 ~]# condor_reconfig
Sent "Reconfig" command to local master

[root@node0 ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@node0.local  LINUX      X86_64 Unclaimed Idle     0.040   497  0+00:50:46
slot2@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:51:07
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        2     0       0         2       0          0
               Total        2     0       0         2       0          0

You should note that host-based authentication is being used here. Any connection from 10.0.0.1 or node1.local are given write access. For simplicity, the ALLOW_WRITE parameter is set so that it is appending to any existing value. And, node1 still is not showing up in the status listing.

The reason node1 is not present is it only attempts to check in every 5 minutes (condor_config_val UPDATE_INTERVAL, seconds). A quick reconfig on node1 will speed the checkin along.

[root@node1 ~]# condor_reconfig
Sent "Reconfig" command to local master

[root@node1 ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@node0.local  LINUX      X86_64 Unclaimed Idle     0.040   497  0+00:50:46
slot2@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:51:07
slot1@node1.local  LINUX      X86_64 Unclaimed Idle     0.000  1003  0+00:09:04
slot2@node1.local  LINUX      X86_64 Unclaimed Idle     0.000  1003  0+00:09:25
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        4     0       0         4       0          0
               Total        4     0       0         4       0          0

Now that the resources are all visible, you can run a few jobs.

[matt@node0 ~]$ cat > job.sub
cmd = /bin/sleep
args = 1d
should_transfer_files = if_needed
when_to_transfer_output = on_exit
queue 8
^D

[matt@node0 ~]$ condor_submit job.sub
Submitting job(s)........
8 job(s) submitted to cluster 1.

[matt@node0 ~]$ condor_q
-- Submitter: node0.local :  : node0.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
   1.0   matt            6/11 23:23   0+00:00:07 R  0   0.0  sleep 1d
   1.1   matt            6/11 23:23   0+00:00:06 R  0   0.0  sleep 1d
   1.2   matt            6/11 23:23   0+00:00:06 R  0   0.0  sleep 1d
   1.3   matt            6/11 23:23   0+00:00:06 R  0   0.0  sleep 1d
   1.4   matt            6/11 23:23   0+00:00:00 I  0   0.0  sleep 1d
   1.5   matt            6/11 23:23   0+00:00:00 I  0   0.0  sleep 1d
   1.6   matt            6/11 23:23   0+00:00:00 I  0   0.0  sleep 1d
   1.7   matt            6/11 23:23   0+00:00:00 I  0   0.0  sleep 1d
8 jobs; 4 idle, 4 running, 0 held

[matt@node0 ~]$ condor_q -run
-- Submitter: node0.local :  : node0.local
 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)
   1.0   matt            6/11 23:23   0+00:00:09 slot1@node0.local
   1.1   matt            6/11 23:23   0+00:00:08 slot2@node0.local
   1.2   matt            6/11 23:23   0+00:00:08 slot1@node1.local
   1.3   matt            6/11 23:23   0+00:00:08 slot2@node1.local

We’ve installed condor on two nodes, configured one (node0) to be what is sometimes called a head node, configured the other (node1) to report to node0, and ran a few jobs.

Other topics, enabling the firewall and using the shared_port daemon, sharing UID and FS between nodes, finer grained security.

Notes:
0) By default, ALLOW_WRITE should include $(CONDOR_HOST)
1) ShouldTransferFiles should default to IF_NEEDED
2) WhenToTransferOutput should default to ON_EXIT

Edit – changed 50root.conf to 40root.config

Advertisement

Tags: , , ,

4 Responses to “Getting started: Creating a multiple node Condor pool”

  1. Getting Started: Multiple node Condor pool with firewalls « Spinning Says:

    [...] a Condor pool with no firewalls up is quite a simple task. Before the condor_shared_port daemon, doing the same with firewalls was a bit [...]

  2. Mark T. Kennedy Says:

    interesting. planning on writing anything about how to manage resource limits?

  3. jlib Says:

    Your writing style is outstanding and this article was invaluable in getting me past a setup problem. I sure wish the Condor documentation was so well written. Thanks!

  4. Getting started: Condor and EC2 – EC2 execute node « Spinning Says:

    [...] condor on the instance is very similar to creating a multiple node pool. You will need to set the CONDOR_HOST, ALLOW_WRITE, and [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.