Assuming you have read how to setup a personal condor, the next step is to add more machines.
This time from Fedora 15, with iptables -L indicating the firewall is off.
[root@node0 ~]# rpm -q condor condor-7.7.0-0.4.fc15.x86_64 [root@node0 ~]# condor_version $CondorVersion: 7.7.0 Jun 08 2011 PRE-RELEASE-UWCS $ $CondorPlatform: X86_64-Fedora_15 $
First off, what’s actually running?
[root@node0 ~]# service condor start
Starting condor (via systemctl): [ OK ]
[root@node0 ~]# pstree | grep condor
|-condor_master-+-condor_collecto
| |-condor_negotiat
| |-condor_schedd---condor_procd
| `-condor_startd
You have a condor_master, which spawns all the other condor daemons and monitors them, restarting them if necessary.
Since this pool is entirely contained on one machine all the components needed for a functional pool are present. The condor_collector is there, it’s the rendezvous point. The condor_schedd is there, it holds and manages the jobs. The condor_startd is there, it represents resources in the pool. The condor_negotiator is there, it hands out matches between jobs and resources.
Run condor_q to see all the jobs. It queries the condor_schedd for the list.
[root@node0 ~]# condor_q -- Submitter: node0.local : : node0.local ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
Nothing there yet.
Run condor_status to see all the resources. It queries the condor_collector for the list.
[root@node0 ~]# condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@node0.local LINUX X86_64 Unclaimed Idle 0.000 497 0+00:14:45
slot2@node0.local LINUX X86_64 Unclaimed Idle 0.000 497 0+00:15:06
Machines Owner Claimed Unclaimed Matched Preempting
X86_64/LINUX 2 0 0 2 0 0
Total 2 0 0 2 0 0
The machine has two cores.
Adding more nodes to the pool means running a condor_startd on more machines and telling them to report to the collector. The collector is the entry point into the pool. All components checkin with it, and it runs on a well known port: 9618 (try: grep condor /etc/services)
After installing condor on your second machine, you need to change the configuration. You want to run just the condor_startd, and you want the node to checkin with the collector on node0.local. To do this take advantage of condor’s config.d, found at /etc/condor/config.d.
The three parameters to change are CONDOR_HOST, DAEMON_LIST and ALLOW_WRITE. Configuration files are concatenated together, with the last definition of a parameter being authoritative. ALLOW_WRITE is to make sure the condor_schedd on node0 can run jobs on node1. So,
[root@node1 ~]# cat > /etc/condor/config.d/40root.config CONDOR_HOST = node0.local DAEMON_LIST = MASTER, STARTD ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST) ^D
Once started, only the condor_master and condor_startd will be running.
[root@node1 ~]# service condor start
Starting condor (via systemctl): [ OK ]
[root@node1 ~]# pstree | grep condor
|-condor_master---condor_startd
If everything worked out, condor_status will now show both machines.
[root@node1 ~]# condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@node0.local LINUX X86_64 Unclaimed Idle 0.000 497 0+00:34:45
slot2@node0.local LINUX X86_64 Unclaimed Idle 0.000 497 0+00:35:06
Machines Owner Claimed Unclaimed Matched Preempting
X86_64/LINUX 2 0 0 2 0 0
Total 2 0 0 2 0 0
No dice. We’ll do this the slow way.
Since we’re asking the collector for information, the best place to look for what’s going wrong is in the collector’s log file. It is on node0.local in /var/log/condor/CollectorLog. Note: condor_config_val COLLECTOR_LOG will tell you were to look too.
[root@node0 ~]# tail -n5 $(condor_config_val COLLECTOR_LOG) 06/11/11 23:00:45 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.0.1,node1.local 06/11/11 23:00:45 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: cached result for ADVERTISE_STARTD; see first case for the full reason 06/11/11 23:00:46 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 2 (UPDATE_MASTER_AD), access level ADVERTISE_MASTER: reason: ADVERTISE_MASTER authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.0.1,node1.local 06/11/11 23:00:51 Got QUERY_STARTD_ADS 06/11/11 23:00:51 (Sending 2 ads in response to query)
This is telling us the security configuration on node0 is preventing node1 from checking in. Specifically, the ALLOW_ADVERTISE_STARTD configuration. The security configuration mechanisms in condor are very flexible and powerful. I suggest you read about them sometime. For now, we’ll grant node1 access to ADVERTISE_STARTD as well as any other operations that require write permissions by setting ALLOW_WRITE.
[root@node0 ~]# cat > /etc/condor/config.d/40root.config
ALLOW_WRITE = $(ALLOW_WRITE), 10.0.0.1,node1.local
^D
[root@node0 ~]# condor_reconfig
Sent "Reconfig" command to local master
[root@node0 ~]# condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@node0.local LINUX X86_64 Unclaimed Idle 0.040 497 0+00:50:46
slot2@node0.local LINUX X86_64 Unclaimed Idle 0.000 497 0+00:51:07
Machines Owner Claimed Unclaimed Matched Preempting
X86_64/LINUX 2 0 0 2 0 0
Total 2 0 0 2 0 0
You should note that host-based authentication is being used here. Any connection from 10.0.0.1 or node1.local are given write access. For simplicity, the ALLOW_WRITE parameter is set so that it is appending to any existing value. And, node1 still is not showing up in the status listing.
The reason node1 is not present is it only attempts to check in every 5 minutes (condor_config_val UPDATE_INTERVAL, seconds). A quick reconfig on node1 will speed the checkin along.
[root@node1 ~]# condor_reconfig
Sent "Reconfig" command to local master
[root@node1 ~]# condor_status
Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@node0.local LINUX X86_64 Unclaimed Idle 0.040 497 0+00:50:46
slot2@node0.local LINUX X86_64 Unclaimed Idle 0.000 497 0+00:51:07
slot1@node1.local LINUX X86_64 Unclaimed Idle 0.000 1003 0+00:09:04
slot2@node1.local LINUX X86_64 Unclaimed Idle 0.000 1003 0+00:09:25
Machines Owner Claimed Unclaimed Matched Preempting
X86_64/LINUX 4 0 0 4 0 0
Total 4 0 0 4 0 0
Now that the resources are all visible, you can run a few jobs.
[matt@node0 ~]$ cat > job.sub cmd = /bin/sleep args = 1d should_transfer_files = if_needed when_to_transfer_output = on_exit queue 8 ^D [matt@node0 ~]$ condor_submit job.sub Submitting job(s)........ 8 job(s) submitted to cluster 1. [matt@node0 ~]$ condor_q -- Submitter: node0.local : : node0.local ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 matt 6/11 23:23 0+00:00:07 R 0 0.0 sleep 1d 1.1 matt 6/11 23:23 0+00:00:06 R 0 0.0 sleep 1d 1.2 matt 6/11 23:23 0+00:00:06 R 0 0.0 sleep 1d 1.3 matt 6/11 23:23 0+00:00:06 R 0 0.0 sleep 1d 1.4 matt 6/11 23:23 0+00:00:00 I 0 0.0 sleep 1d 1.5 matt 6/11 23:23 0+00:00:00 I 0 0.0 sleep 1d 1.6 matt 6/11 23:23 0+00:00:00 I 0 0.0 sleep 1d 1.7 matt 6/11 23:23 0+00:00:00 I 0 0.0 sleep 1d 8 jobs; 4 idle, 4 running, 0 held [matt@node0 ~]$ condor_q -run -- Submitter: node0.local : : node0.local ID OWNER SUBMITTED RUN_TIME HOST(S) 1.0 matt 6/11 23:23 0+00:00:09 slot1@node0.local 1.1 matt 6/11 23:23 0+00:00:08 slot2@node0.local 1.2 matt 6/11 23:23 0+00:00:08 slot1@node1.local 1.3 matt 6/11 23:23 0+00:00:08 slot2@node1.local
We’ve installed condor on two nodes, configured one (node0) to be what is sometimes called a head node, configured the other (node1) to report to node0, and ran a few jobs.
Other topics, enabling the firewall and using the shared_port daemon, sharing UID and FS between nodes, finer grained security.
Notes:
0) By default, ALLOW_WRITE should include $(CONDOR_HOST)
1) ShouldTransferFiles should default to IF_NEEDED
2) WhenToTransferOutput should default to ON_EXIT
Edit – changed 50root.conf to 40root.config
Tags: Condor, Multiple nodes, Pool, Setup
June 21, 2011 at 1:36 pm |
[...] a Condor pool with no firewalls up is quite a simple task. Before the condor_shared_port daemon, doing the same with firewalls was a bit [...]
June 22, 2011 at 8:57 am |
interesting. planning on writing anything about how to manage resource limits?
October 7, 2011 at 3:32 pm |
Your writing style is outstanding and this article was invaluable in getting me past a setup problem. I sure wish the Condor documentation was so well written. Thanks!
November 10, 2011 at 7:53 am |
[...] condor on the instance is very similar to creating a multiple node pool. You will need to set the CONDOR_HOST, ALLOW_WRITE, and [...]