Getting started: Condor and EC2 – EC2 execute node

We have been over starting and managing instances from Condor, using condor_ec2_q to help, and importing existing instances. Here we will cover extending an existing pool using execute nodes run from EC2 instances. We will start with an existing pool, create an EC2 instance, configure the instance to run condor, authorize the instance to join the existing pool, and run a job.

Let us pretend that the node running your existing pool’s condor_collector and condor_schedd is called condor.condorproject.org.

These instructions will require bi-directional connectivity between condor.condorproject.org and your EC2 instance. condor.condorproject.org must be connected to the internet with a publically routable address. Also, ports must be open in its firewall for the Collector and Schedd. The EC2 execute nodes have to be able to connect to condor.condorproject.org to talk to the condor_collector and condor_schedd. It cannot be behind a NAT or firewall. Okay, let’s start.

I am going to use ami-60bd4609, a publically available Fedora 15 AMI. You can either start the instance via the AWS console, or submit it by following previous instructions.

Once the instance is up and running, login and sudo yum install condor. Note, until BZ656562 is resolved, you will have to sudo mkdir /var/run/condor; sudo chown condor.condor /var/run/condor before starting condor. Start condor with sudo service condor start to get a personal condor.

Configuring condor on the instance is very similar to creating a multiple node pool. You will need to set the CONDOR_HOST, ALLOW_WRITE, and DAEMON_LIST,

# cat > /etc/condor/config.d/40execute_node.config
CONDOR_HOST = condor.condorproject.org
DAEMON_LIST = MASTER, STARTD
ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST)
^D

If you do not give condor.condorproject.org WRITE permissions, the Schedd will fail to start jobs. StartLog will report,

PERMISSION DENIED to unauthenticated@unmapped from host 128.105.291.82 for command 442 (REQUEST_CLAIM), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 128.105.291.82,condor.condorproject.org, hostname size = 1, original ip address = 128.105.291.82

Now remember, we need bi-directional connectivity. So condor.condorproject.org must be able to connect to the EC2 instance’s Startd. The condor_start will listen on an ephemeral port by default. You could restrict it to a port range or use condor_shared_port. For simplicity, just force a non-ephemeral port of 3131,

# echo "STARTD_ARGS = -p 3131" >> /etc/condor/config.d/40execute_node.config

You can now open TCP port 3131 in the instance’s iptables firewall. If you are using the Fedora 15 AMI, the firewall is off by default and needs no adjustment. Additionally, the security group on the instance needs to have TCP port 3131 authorized. Use the AWS Console or ec2-authorize GROUP -p 3131.

If you miss either of these steps, the Schedd will fail to start jobs on the instance, likely with a message similar to,

Failed to send REQUEST_CLAIM to startd ec2-174-129-47-20.compute-1.amazonaws.com <174.129.47.20:3131>#1220911452#1#... for matt: SECMAN:2003:TCP connection to startd ec2-174-129-47-20.compute-1.amazonaws.com <174.129.47.20:3131>#1220911452#1#... for matt failed.

A quick service condor restart on the instance, and a condor_status on condor.condorproject.org would hopefully show the instance joined the pool. Except the instance has not been authorized yet. In fact, the CollectorLog will probably report,

PERMISSION DENIED to unauthenticated@unmapped from host 174.129.47.20 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 174.129.47.20,ec2-174-129-47-20.compute-1.amazonaws.com
PERMISSION DENIED to unauthenticated@unmapped from host 174.129.47.20 for command 2 (UPDATE_MASTER_AD), access level ADVERTISE_MASTER: reason: ADVERTISE_MASTER authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 174.129.47.20,ec2-174-129-47-20.compute-1.amazonaws.com

The instance needs to be authorized to advertise itself into the Collector. A good way to do that is to add,

ALLOW_ADVERTISE_MASTER = $(ALLOW_WRITE), ec2-174-129-47-20.compute-1.amazonaws.com
ALLOW_ADVERTISE_STARTD = $(ALLOW_WRITE), ec2-174-129-47-20.compute-1.amazonaws.com

to condor.condorproject.org’s configuration and reconfig with condor_reconfig. A note here, ALLOW_WRITE is added in because I am assuming you are following previous instructions. If you have ALLOW_ADVERTISE_MASTER/STARTD already configured, you should append to them instead. Also, appending for each new instance will get tedious. You could be very trusting and allow *.amazonaws.com, but it is better to use SSL or PASSWORD authentication. I will describe that some other time.

After the reconfig, the instance will eventually show up in a condor_status listing.

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
localhost.localdom LINUX      INTEL  Unclaimed Benchmar 0.430  1666  0+00:00:04

The name is not very helpful, but also not a problem.

It is time to submit a job.

$ condor_submit
Submitting job(s)
cmd = /bin/sleep
args = 1d
should_transfer_files = IF_NEEDED
when_to_transfer_output = ON_EXIT
queue
.^D
1 job(s) submitted to cluster 14.

$ condor_q
-- Submitter: condor.condorproject.org : <128.105.291.82:36900> : condor.condorproject.org
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  31.0   matt           11/11 11:11   0+00:00:00 I  0   0.0  sleep 1d
1 jobs; 1 idle, 0 running, 0 held

The job will stay idle forever, which is no good. The problem can be found in the SchedLog,

Enqueued contactStartd startd=<10.72.55.105:3131>
In checkContactQueue(), args = 0x9705798, host=<10.72.55.105:3131>
Requesting claim localhost.localdomain <10.72.55.105:3131>#1220743035#1#... for matt 31.0
attempt to connect to <10.72.55.105:3131> failed: Connection timed out (connect errno = 110).  Will keep trying for 45 total seconds (24 to go).

The root cause is that the instance has two internet addresses. A private one, which is not routable from condor.condorproject.org, that it is advertising,

$ condor_status -format "%s, " Name -format "%s\n" MyAddress
localhost.localdomain, <10.72.55.105:3131>

And a public one, which can be found from within the instance,

$ curl -f http://169.254.169.254/latest/meta-data/public-ipv4
174.129.47.20

Condor has a way to handle this. The TCP_FORWARDING_HOST configuration parameter can be set to the public address for the instance.

# echo "TCP_FORWARDING_HOST = $(curl -f http://169.254.169.254/latest/meta-data/public-ipv4)" >> /etc/condor/config.d/40execute_node.config

A condor_reconfig will apply the change, but a restart will clear out the old entry first. Oops. Note, you cannot set TCP_FORWARDING_HOST to the public-hostname of the instance, because the public hostname will be revolved within the instance and will resolve to the instance’s internal, private address.

When setting TCP_FORWARDING_HOST, also set PRIVATE_NETWORK_INTERFACE to let the host talk to itself over its private address.

# echo "PRIVATE_NETWORK_INTERFACE = $(curl -f http://169.254.169.254/latest/meta-data/local-ipv4)" >> /etc/condor/config.d/40execute_node.config

Doing so will prevent the condor_startd from using its public address to send DC_CHILDALIVE messages to the condor_master, which might fail because of a firewall or security group setting,

attempt to connect to <174.129.47.20:34550> failed: Connection timed out (connect errno = 110).  Will keep trying for 390 total seconds (200 to go).
attempt to connect to <174.129.47.20:34550> failed: Connection timed out (connect errno = 110).
ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <174.129.47.20:34550> (try 1 of 3): CEDAR:6001:Failed to connect to <174.129.47.20:34550>

Or if simply because the master does not trust the public address,

PERMISSION DENIED to unauthenticated@unmapped from host 174.129.47.20 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 174.129.47.20,ec2-174-129-47-20.compute-1.amazonaws.com, hostname size = 1, original ip address = 174.129.47.20

Now run that service condor restart and the public, routable address will be advertised,

$ condor_status -format "%s, " Name -format "%s\n" MyAddress
localhost.localdomain, <174.129.47.20:3131?noUDP>

The job will be started on the instance automatically,

$ condor_q -run
-- Submitter: condor.condorproject.org : <128.105.291.82:36900> : condor.condorproject.org
 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)
  31.0   matt           11/11 11:11   0+00:00:11 localhost.localdomain

If you want to clean up the localhost.localdomain, set the instance’s hostname and restart condor,

$ sudo hostname $(curl -f http://169.254.169.254/latest/meta-data/public-hostname)
$ sudo service condor restart
(wait for the start to advertise)
$ condor_status -format "%s, " Name -format "%s\n" MyAddress
ec2-174-129-47-20.compute-1.amazonaws.com, <174.129.47.20:3131?noUDP>

In summary,

Configuration changes on condor.condorproject.org,

ALLOW_ADVERTISE_MASTER = $(ALLOW_WRITE), ec2-174-129-47-20.compute-1.amazonaws.com
ALLOW_ADVERTISE_STARTD = $(ALLOW_WRITE), ec2-174-129-47-20.compute-1.amazonaws.com

Setup on the instance,

# cat > /etc/condor/config.d/40execute_node.config
CONDOR_HOST = condor.condorproject.org
DAEMON_LIST = MASTER, STARTD
ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST)
STARTD_ARGS = -p 3131
^D
# echo "TCP_FORWARDING_HOST = $(curl -f http://169.254.169.254/latest/meta-data/public-ipv4)" >> /etc/condor/config.d/40execute_node.config
# echo "PRIVATE_NETWORK_INTERFACE = $(curl -f http://169.254.169.254/latest/meta-data/local-ipv4)" >> /etc/condor/config.d/40execute_node.config
# hostname $(curl -f http://169.254.169.254/latest/meta-data/public-hostname)

Tags: , , , ,

Leave a comment