Assuming you have read how to setup a personal condor, the next step is to add more machines.
This time from Fedora 15, with iptables -L indicating the firewall is off.
[root@node0 ~]# rpm -q condor condor-7.7.0-0.4.fc15.x86_64 [root@node0 ~]# condor_version $CondorVersion: 7.7.0 Jun 08 2011 PRE-RELEASE-UWCS $ $CondorPlatform: X86_64-Fedora_15 $
First off, what’s actually running?
[root@node0 ~]# service condor start Starting condor (via systemctl): [ OK ] [root@node0 ~]# pstree | grep condor |-condor_master-+-condor_collecto | |-condor_negotiat | |-condor_schedd---condor_procd | `-condor_startd
You have a condor_master
, which spawns all the other condor daemons and monitors them, restarting them if necessary.
Since this pool is entirely contained on one machine all the components needed for a functional pool are present. The condor_collector
is there, it’s the rendezvous point. The condor_schedd
is there, it holds and manages the jobs. The condor_startd
is there, it represents resources in the pool. The condor_negotiator
is there, it hands out matches between jobs and resources.
Run condor_q
to see all the jobs. It queries the condor_schedd for the list.
[root@node0 ~]# condor_q -- Submitter: node0.local : : node0.local ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
Nothing there yet.
Run condor_status
to see all the resources. It queries the condor_collector for the list.
[root@node0 ~]# condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@node0.local LINUX X86_64 Unclaimed Idle 0.000 497 0+00:14:45 slot2@node0.local LINUX X86_64 Unclaimed Idle 0.000 497 0+00:15:06 Machines Owner Claimed Unclaimed Matched Preempting X86_64/LINUX 2 0 0 2 0 0 Total 2 0 0 2 0 0
The machine has two cores.
Adding more nodes to the pool means running a condor_startd on more machines and telling them to report to the collector. The collector is the entry point into the pool. All components checkin with it, and it runs on a well known port: 9618 (try: grep condor /etc/services)
After installing condor on your second machine, you need to change the configuration. You want to run just the condor_startd, and you want the node to checkin with the collector on node0.local. To do this take advantage of condor’s config.d, found at /etc/condor/config.d.
The three parameters to change are CONDOR_HOST
, DAEMON_LIST
and ALLOW_WRITE
. Configuration files are concatenated together, with the last definition of a parameter being authoritative. ALLOW_WRITE is to make sure the condor_schedd on node0 can run jobs on node1. So,
[root@node1 ~]# cat > /etc/condor/config.d/40root.config CONDOR_HOST = node0.local DAEMON_LIST = MASTER, STARTD ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST) ^D
Once started, only the condor_master and condor_startd will be running.
[root@node1 ~]# service condor start Starting condor (via systemctl): [ OK ] [root@node1 ~]# pstree | grep condor |-condor_master---condor_startd
If everything worked out, condor_status will now show both machines.
[root@node1 ~]# condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@node0.local LINUX X86_64 Unclaimed Idle 0.000 497 0+00:34:45 slot2@node0.local LINUX X86_64 Unclaimed Idle 0.000 497 0+00:35:06 Machines Owner Claimed Unclaimed Matched Preempting X86_64/LINUX 2 0 0 2 0 0 Total 2 0 0 2 0 0
No dice. We’ll do this the slow way.
Since we’re asking the collector for information, the best place to look for what’s going wrong is in the collector’s log file. It is on node0.local in /var/log/condor/CollectorLog. Note: condor_config_val COLLECTOR_LOG will tell you were to look too.
[root@node0 ~]# tail -n5 $(condor_config_val COLLECTOR_LOG) 06/11/11 23:00:45 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.0.1,node1.local 06/11/11 23:00:45 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: cached result for ADVERTISE_STARTD; see first case for the full reason 06/11/11 23:00:46 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 2 (UPDATE_MASTER_AD), access level ADVERTISE_MASTER: reason: ADVERTISE_MASTER authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.0.1,node1.local 06/11/11 23:00:51 Got QUERY_STARTD_ADS 06/11/11 23:00:51 (Sending 2 ads in response to query)
This is telling us the security configuration on node0 is preventing node1 from checking in. Specifically, the ALLOW_ADVERTISE_STARTD configuration. The security configuration mechanisms in condor are very flexible and powerful. I suggest you read about them sometime. For now, we’ll grant node1 access to ADVERTISE_STARTD as well as any other operations that require write permissions by setting ALLOW_WRITE.
[root@node0 ~]# cat > /etc/condor/config.d/40root.config ALLOW_WRITE = $(ALLOW_WRITE), 10.0.0.1,node1.local ^D [root@node0 ~]# condor_reconfig Sent "Reconfig" command to local master [root@node0 ~]# condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@node0.local LINUX X86_64 Unclaimed Idle 0.040 497 0+00:50:46 slot2@node0.local LINUX X86_64 Unclaimed Idle 0.000 497 0+00:51:07 Machines Owner Claimed Unclaimed Matched Preempting X86_64/LINUX 2 0 0 2 0 0 Total 2 0 0 2 0 0
You should note that host-based authentication is being used here. Any connection from 10.0.0.1 or node1.local are given write access. For simplicity, the ALLOW_WRITE parameter is set so that it is appending to any existing value. And, node1 still is not showing up in the status listing.
The reason node1 is not present is it only attempts to check in every 5 minutes (condor_config_val UPDATE_INTERVAL, seconds). A quick reconfig on node1 will speed the checkin along.
[root@node1 ~]# condor_reconfig Sent "Reconfig" command to local master [root@node1 ~]# condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@node0.local LINUX X86_64 Unclaimed Idle 0.040 497 0+00:50:46 slot2@node0.local LINUX X86_64 Unclaimed Idle 0.000 497 0+00:51:07 slot1@node1.local LINUX X86_64 Unclaimed Idle 0.000 1003 0+00:09:04 slot2@node1.local LINUX X86_64 Unclaimed Idle 0.000 1003 0+00:09:25 Machines Owner Claimed Unclaimed Matched Preempting X86_64/LINUX 4 0 0 4 0 0 Total 4 0 0 4 0 0
Now that the resources are all visible, you can run a few jobs.
[matt@node0 ~]$ cat > job.sub cmd = /bin/sleep args = 1d should_transfer_files = if_needed when_to_transfer_output = on_exit queue 8 ^D [matt@node0 ~]$ condor_submit job.sub Submitting job(s)........ 8 job(s) submitted to cluster 1. [matt@node0 ~]$ condor_q -- Submitter: node0.local : : node0.local ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 matt 6/11 23:23 0+00:00:07 R 0 0.0 sleep 1d 1.1 matt 6/11 23:23 0+00:00:06 R 0 0.0 sleep 1d 1.2 matt 6/11 23:23 0+00:00:06 R 0 0.0 sleep 1d 1.3 matt 6/11 23:23 0+00:00:06 R 0 0.0 sleep 1d 1.4 matt 6/11 23:23 0+00:00:00 I 0 0.0 sleep 1d 1.5 matt 6/11 23:23 0+00:00:00 I 0 0.0 sleep 1d 1.6 matt 6/11 23:23 0+00:00:00 I 0 0.0 sleep 1d 1.7 matt 6/11 23:23 0+00:00:00 I 0 0.0 sleep 1d 8 jobs; 4 idle, 4 running, 0 held [matt@node0 ~]$ condor_q -run -- Submitter: node0.local : : node0.local ID OWNER SUBMITTED RUN_TIME HOST(S) 1.0 matt 6/11 23:23 0+00:00:09 slot1@node0.local 1.1 matt 6/11 23:23 0+00:00:08 slot2@node0.local 1.2 matt 6/11 23:23 0+00:00:08 slot1@node1.local 1.3 matt 6/11 23:23 0+00:00:08 slot2@node1.local
We’ve installed condor on two nodes, configured one (node0) to be what is sometimes called a head node, configured the other (node1) to report to node0, and ran a few jobs.
Other topics, enabling the firewall and using the shared_port daemon, sharing UID and FS between nodes, finer grained security.
Notes:
0) By default, ALLOW_WRITE should include $(CONDOR_HOST)
1) ShouldTransferFiles should default to IF_NEEDED
2) WhenToTransferOutput should default to ON_EXIT
Edit – changed 50root.conf to 40root.config
Tags: Condor, Multiple nodes, Pool, Setup
June 21, 2011 at 1:36 pm |
[…] a Condor pool with no firewalls up is quite a simple task. Before the condor_shared_port daemon, doing the same with firewalls was a bit […]
June 22, 2011 at 8:57 am |
interesting. planning on writing anything about how to manage resource limits?
October 7, 2011 at 3:32 pm |
Your writing style is outstanding and this article was invaluable in getting me past a setup problem. I sure wish the Condor documentation was so well written. Thanks!
November 10, 2011 at 7:53 am |
[…] condor on the instance is very similar to creating a multiple node pool. You will need to set the CONDOR_HOST, ALLOW_WRITE, and […]
April 24, 2012 at 7:00 am |
Hi,
I am new to condor. I have installed condor onto my two VM (VM#1, VM#2) as personal condor setup. They are working fine and even I can execute the jobs. But when I am trying to add VM#2 to VM#1 to increase the pool, I cannot do that. I followed your steps. The steps I followed:
I am not the root user. I am working as condor user.
I made changes to /home/condor/condor-xx/etc/condor_config.
The changes I made:
1.COLECTOR_HOST=IP of VM#1
as in config there is no “CONDOR_HOST” entry
2. DAEMON_LIST = MASTER, STARTD
Done
3. ALLOW_WRITE = IP of VM#1
After saving the condor_config. I tried to execute
1. condor_reconfig
Status: Ok
2. condor_status
Status: error
I am getting this error:
Error: communication error
CEDAR:6001:Failed to connect to VM#1
But I do not understand the reason. It would be great if you advice me to fix the problem.
April 24, 2012 at 8:07 am |
First, you need to make sure the two VMs can talk to one another (ping, ssh, traceroute, look in VM#1’s CollectorLog for connection attempts), and the instructions above assume the firewall is off.
Second, since you don’t have a CONDOR_HOST, I’m not sure what version of Condor you have or what your default configuration might be. You will have to make sure ALLOW_READ on VM#1 will allow VM#2. The CollectorLog on VM#1 should tell you if and why it is denying access from VM#2.
I suggest starting with Fedora’s condor RPM and using LOCAL_CONFIG_DIR. Without root you can rpm2cpio | cpio -id and you will have to fix up the RELEASE_DIR & LOCAL_DIR & LOG & ETC & RUN params in etc/condor/condor_config.
April 24, 2012 at 11:49 am |
Thanks spinningmatt for your prompt reply. I am using condor 7.6.5.
$CondorVersion: 7.6.5 Dec 27 2011 BuildID: 397396 $
$CondorPlatform: x86_64_deb_5.0 $
The issue is fixed.
My problem was the port 9618 was closed. To check that, I used ping and lsof -i -n | egrep ‘COMMAND|LISTEN’ both were working but nc -z 137.43.92.184 9618 failed so I ask root user to open. It worked.
April 24, 2012 at 11:04 pm |
If you have to deal with firewalls you should look at the condor_shared_port daemon described in,
September 12, 2012 at 3:23 pm |
Hello SpinningMatt:
Is it possbile to specify pre-render or post-render script in a submit script ?
Thanks
/Biju
October 3, 2012 at 7:30 am |
Yes, via +PreCmd and +PostCmd, see man condor_submit. PostCmd appears to work in 7.8 and be broken in 7.6.
March 20, 2013 at 4:43 pm |
This tutorial was very helpful – well written and focused on a single, important goal.
I’m thinking it’d be worthwhile having a wiki for Condor users…
April 2, 2013 at 1:45 pm |
One clarification. With Condor 7.9.4 I created the /etc/condor/config.d/40root.config file, but the StartLog indicated that this config file gets invoked before the /etc/condor/condor_config.local. This latter file sets the DAEMON_LIST attribute which thus overrides the setting in the 40root.config file.
There are multiple ways to fix this – I chose to simply comment out the DAEMON_LIST entry in the local config file.
Is it possible that the search order for config files, and/or the default DAEMON_LIST setting for the local config file changed in v7.9?
April 2, 2013 at 3:10 pm |
One interesting twist is that these jobs sleep for one day, so they don’t easily go away. I discovered the condor_drain and condor_vacate_job cmds, but you end up with the jobs remaining on the queue in “Idle” state.
Good opportunity to learn more about condor admin :-)
and BTW, thanks, spinningmatt, for this article, it got me running and helped understand the best-practices of Condor.
December 11, 2013 at 5:35 am |
Hi,
It seems I have set up my condor nodes nicely with this tutorial, though there seems to be one problem.When running jobs, no job will run on the other nodes e.g all the jobs run in the local machine.Condor_status shows both machines, I have condor working on both of them etc.
What I’m doing wrong?
December 11, 2013 at 6:31 am |
There are many possible issues, please ask on the htcondor-users mailing list – https://lists.cs.wisc.edu/mailman/listinfo/htcondor-users