Archive for November, 2011

Custom resource attributes: Facter

November 29, 2011

Condor provides a large set of attributes, facts, about resources for scheduling and querying, but it does not provide everything possible. Instead, there is a mechanism to extend the set. Previously, we added FreeMemoryMB. The set can also be extend with information from Facter.

Facter provides an extensible set of facts about a system. To include facter facts we need a means to translate them into attributes and add to Startd configuration.

$ facter
...
architecture => x86_64
domain => local
facterversion => 1.5.9
hardwareisa => x86_64
hardwaremodel => x86_64
physicalprocessorcount => 1
processor0 => Intel(R) Core(TM) i7 CPU       M 620  @ 2.67GHz
selinux => true
selinux_config_mode => enforcing
swapfree => 3.98 GB
swapsize => 4.00 GB
...

The facts are of the form name => value, not very far off from ClassAd attributes. A simple script to convert all the facts into attribute with string values is,

/usr/libexec/condor/facter.sh

#!/bin/sh
type facter &> /dev/null || exit 1
facter | sed 's/\([^ ]*\) => \(.*\)/facter_\1 = "\2"/'
$ facter.sh
...
facter_architecture = "x86_64"
facter_domain = "local"
facter_facterversion = "1.5.9"
facter_hardwareisa = "x86_64"
facter_hardwaremodel = "x86_64"
facter_physicalprocessorcount = "1"
facter_processor0 = "Intel(R) Core(TM) i7 CPU       M 620  @ 2.67GHz"
facter_selinux = "true"
facter_selinux_config_mode = "enforcing"
facter_swapfree = "3.98 GB"
facter_swapsize = "4.00 GB"
...

And the configuration, simply dropped into /etc/condor/config.d,

/etc/condor/config.d/49facter.config

FACTER = /usr/libexec/condor/facter.sh
STARTD_CRON_JOBLIST = $(STARTD_CRON_JOBLIST) FACTER
STARTD_CRON_FACTER_EXECUTABLE = $(FACTER)
STARTD_CRON_FACTER_PERIOD = 300

A condor_reconfig and the facter facts will be available,

$ condor_status -long | grep ^facter
...
facter_architecture = "x86_64"
facter_facterversion = "1.5.9"
facter_domain = "local"
facter_swapfree = "3.98 GB"
facter_selinux = "true"
facter_hardwaremodel = "x86_64"
facter_selinux_config_mode = "enforcing"
facter_processor0 = "Intel(R) Core(TM) i7 CPU       M 620  @ 2.67GHz"
facter_selinux_mode = "targeted"
facter_hardwareisa = "x86_64"
facter_swapsize = "4.00 GB"
facter_physicalprocessorcount = "1"
...

For scheduling, just use the facter information in job requierments, e.g. requirements = facter_selinux == "true".

Or, query your pool to see what resources are not running selinux,

$ condor_status -const 'facter_selinux == "false"'
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
eeyore.local       LINUX      X86_64 Unclaimed Idle     0.030  3760  0+00:12:31
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        1     0       0         1       0          0
               Total        1     0       0         1       0          0

Oops.

Getting started: Condor and EC2 – EC2 execute node

November 10, 2011

We have been over starting and managing instances from Condor, using condor_ec2_q to help, and importing existing instances. Here we will cover extending an existing pool using execute nodes run from EC2 instances. We will start with an existing pool, create an EC2 instance, configure the instance to run condor, authorize the instance to join the existing pool, and run a job.

Let us pretend that the node running your existing pool’s condor_collector and condor_schedd is called condor.condorproject.org.

These instructions will require bi-directional connectivity between condor.condorproject.org and your EC2 instance. condor.condorproject.org must be connected to the internet with a publically routable address. Also, ports must be open in its firewall for the Collector and Schedd. The EC2 execute nodes have to be able to connect to condor.condorproject.org to talk to the condor_collector and condor_schedd. It cannot be behind a NAT or firewall. Okay, let’s start.

I am going to use ami-60bd4609, a publically available Fedora 15 AMI. You can either start the instance via the AWS console, or submit it by following previous instructions.

Once the instance is up and running, login and sudo yum install condor. Note, until BZ656562 is resolved, you will have to sudo mkdir /var/run/condor; sudo chown condor.condor /var/run/condor before starting condor. Start condor with sudo service condor start to get a personal condor.

Configuring condor on the instance is very similar to creating a multiple node pool. You will need to set the CONDOR_HOST, ALLOW_WRITE, and DAEMON_LIST,

# cat > /etc/condor/config.d/40execute_node.config
CONDOR_HOST = condor.condorproject.org
DAEMON_LIST = MASTER, STARTD
ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST)
^D

If you do not give condor.condorproject.org WRITE permissions, the Schedd will fail to start jobs. StartLog will report,

PERMISSION DENIED to unauthenticated@unmapped from host 128.105.291.82 for command 442 (REQUEST_CLAIM), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 128.105.291.82,condor.condorproject.org, hostname size = 1, original ip address = 128.105.291.82

Now remember, we need bi-directional connectivity. So condor.condorproject.org must be able to connect to the EC2 instance’s Startd. The condor_start will listen on an ephemeral port by default. You could restrict it to a port range or use condor_shared_port. For simplicity, just force a non-ephemeral port of 3131,

# echo "STARTD_ARGS = -p 3131" >> /etc/condor/config.d/40execute_node.config

You can now open TCP port 3131 in the instance’s iptables firewall. If you are using the Fedora 15 AMI, the firewall is off by default and needs no adjustment. Additionally, the security group on the instance needs to have TCP port 3131 authorized. Use the AWS Console or ec2-authorize GROUP -p 3131.

If you miss either of these steps, the Schedd will fail to start jobs on the instance, likely with a message similar to,

Failed to send REQUEST_CLAIM to startd ec2-174-129-47-20.compute-1.amazonaws.com <174.129.47.20:3131>#1220911452#1#... for matt: SECMAN:2003:TCP connection to startd ec2-174-129-47-20.compute-1.amazonaws.com <174.129.47.20:3131>#1220911452#1#... for matt failed.

A quick service condor restart on the instance, and a condor_status on condor.condorproject.org would hopefully show the instance joined the pool. Except the instance has not been authorized yet. In fact, the CollectorLog will probably report,

PERMISSION DENIED to unauthenticated@unmapped from host 174.129.47.20 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 174.129.47.20,ec2-174-129-47-20.compute-1.amazonaws.com
PERMISSION DENIED to unauthenticated@unmapped from host 174.129.47.20 for command 2 (UPDATE_MASTER_AD), access level ADVERTISE_MASTER: reason: ADVERTISE_MASTER authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 174.129.47.20,ec2-174-129-47-20.compute-1.amazonaws.com

The instance needs to be authorized to advertise itself into the Collector. A good way to do that is to add,

ALLOW_ADVERTISE_MASTER = $(ALLOW_WRITE), ec2-174-129-47-20.compute-1.amazonaws.com
ALLOW_ADVERTISE_STARTD = $(ALLOW_WRITE), ec2-174-129-47-20.compute-1.amazonaws.com

to condor.condorproject.org’s configuration and reconfig with condor_reconfig. A note here, ALLOW_WRITE is added in because I am assuming you are following previous instructions. If you have ALLOW_ADVERTISE_MASTER/STARTD already configured, you should append to them instead. Also, appending for each new instance will get tedious. You could be very trusting and allow *.amazonaws.com, but it is better to use SSL or PASSWORD authentication. I will describe that some other time.

After the reconfig, the instance will eventually show up in a condor_status listing.

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
localhost.localdom LINUX      INTEL  Unclaimed Benchmar 0.430  1666  0+00:00:04

The name is not very helpful, but also not a problem.

It is time to submit a job.

$ condor_submit
Submitting job(s)
cmd = /bin/sleep
args = 1d
should_transfer_files = IF_NEEDED
when_to_transfer_output = ON_EXIT
queue
.^D
1 job(s) submitted to cluster 14.

$ condor_q
-- Submitter: condor.condorproject.org : <128.105.291.82:36900> : condor.condorproject.org
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  31.0   matt           11/11 11:11   0+00:00:00 I  0   0.0  sleep 1d
1 jobs; 1 idle, 0 running, 0 held

The job will stay idle forever, which is no good. The problem can be found in the SchedLog,

Enqueued contactStartd startd=<10.72.55.105:3131>
In checkContactQueue(), args = 0x9705798, host=<10.72.55.105:3131>
Requesting claim localhost.localdomain <10.72.55.105:3131>#1220743035#1#... for matt 31.0
attempt to connect to <10.72.55.105:3131> failed: Connection timed out (connect errno = 110).  Will keep trying for 45 total seconds (24 to go).

The root cause is that the instance has two internet addresses. A private one, which is not routable from condor.condorproject.org, that it is advertising,

$ condor_status -format "%s, " Name -format "%s\n" MyAddress
localhost.localdomain, <10.72.55.105:3131>

And a public one, which can be found from within the instance,

$ curl -f http://169.254.169.254/latest/meta-data/public-ipv4
174.129.47.20

Condor has a way to handle this. The TCP_FORWARDING_HOST configuration parameter can be set to the public address for the instance.

# echo "TCP_FORWARDING_HOST = $(curl -f http://169.254.169.254/latest/meta-data/public-ipv4)" >> /etc/condor/config.d/40execute_node.config

A condor_reconfig will apply the change, but a restart will clear out the old entry first. Oops. Note, you cannot set TCP_FORWARDING_HOST to the public-hostname of the instance, because the public hostname will be revolved within the instance and will resolve to the instance’s internal, private address.

When setting TCP_FORWARDING_HOST, also set PRIVATE_NETWORK_INTERFACE to let the host talk to itself over its private address.

# echo "PRIVATE_NETWORK_INTERFACE = $(curl -f http://169.254.169.254/latest/meta-data/local-ipv4)" >> /etc/condor/config.d/40execute_node.config

Doing so will prevent the condor_startd from using its public address to send DC_CHILDALIVE messages to the condor_master, which might fail because of a firewall or security group setting,

attempt to connect to <174.129.47.20:34550> failed: Connection timed out (connect errno = 110).  Will keep trying for 390 total seconds (200 to go).
attempt to connect to <174.129.47.20:34550> failed: Connection timed out (connect errno = 110).
ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <174.129.47.20:34550> (try 1 of 3): CEDAR:6001:Failed to connect to <174.129.47.20:34550>

Or if simply because the master does not trust the public address,

PERMISSION DENIED to unauthenticated@unmapped from host 174.129.47.20 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 174.129.47.20,ec2-174-129-47-20.compute-1.amazonaws.com, hostname size = 1, original ip address = 174.129.47.20

Now run that service condor restart and the public, routable address will be advertised,

$ condor_status -format "%s, " Name -format "%s\n" MyAddress
localhost.localdomain, <174.129.47.20:3131?noUDP>

The job will be started on the instance automatically,

$ condor_q -run
-- Submitter: condor.condorproject.org : <128.105.291.82:36900> : condor.condorproject.org
 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)
  31.0   matt           11/11 11:11   0+00:00:11 localhost.localdomain

If you want to clean up the localhost.localdomain, set the instance’s hostname and restart condor,

$ sudo hostname $(curl -f http://169.254.169.254/latest/meta-data/public-hostname)
$ sudo service condor restart
(wait for the start to advertise)
$ condor_status -format "%s, " Name -format "%s\n" MyAddress
ec2-174-129-47-20.compute-1.amazonaws.com, <174.129.47.20:3131?noUDP>

In summary,

Configuration changes on condor.condorproject.org,

ALLOW_ADVERTISE_MASTER = $(ALLOW_WRITE), ec2-174-129-47-20.compute-1.amazonaws.com
ALLOW_ADVERTISE_STARTD = $(ALLOW_WRITE), ec2-174-129-47-20.compute-1.amazonaws.com

Setup on the instance,

# cat > /etc/condor/config.d/40execute_node.config
CONDOR_HOST = condor.condorproject.org
DAEMON_LIST = MASTER, STARTD
ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST)
STARTD_ARGS = -p 3131
^D
# echo "TCP_FORWARDING_HOST = $(curl -f http://169.254.169.254/latest/meta-data/public-ipv4)" >> /etc/condor/config.d/40execute_node.config
# echo "PRIVATE_NETWORK_INTERFACE = $(curl -f http://169.254.169.254/latest/meta-data/local-ipv4)" >> /etc/condor/config.d/40execute_node.config
# hostname $(curl -f http://169.254.169.254/latest/meta-data/public-hostname)

Getting started: Condor and EC2 – Importing instances with condor_ec2_link

November 7, 2011

Starting and managing instances describes the powerful feature of Condor to start and manage EC2 instances, but what if you are already using something other than Condor to start your instance, such as the AWS Management Console.

Importing instances turns out to be straightforward, if you know how instances are started. In a nutshell, the condor_gridmanager executes a state machine and records its current state in an attribute named GridJobId. To import an instance, submit a job that is already in the state where an instance id has been assigned. You can take a submit file and add +GridJobId = “ec2 https://ec2.amazonaws.com/ BOGUS INSTANCE-ID. The INSTANCE-ID needs to be the actual identifier of the instance you want to import. For instance,

...
ec2_access_key_id = ...
ec2_secret_access_key = ...
...
+GridJobId = "ec2 https://ec2.amazonaws.com/ BOGUS i-319c3652"
queue

It is important to get the ec2_access_key_id and ec2_secret_access_key correct. Without them Condor will not be able to communicate with EC2 and EC2_GAHP_LOG will report,

$ tail -n2 $(condor_config_val EC2_GAHP_LOG)
11/11/11 11:11:11 Failure response text was '
AuthFailureAWS was not able to validate the provided access credentialsab50f005-6d77-4653-9cec-298b2d475f6e'.

This error will not be reported back into the job, putting it on hold, instead the gridmanager will think the EC2 is down for the job. Oops.

$ grep down $(condor_config_val GRIDMANAGER_LOG)
11/11/11 11:11:11 [10697] resource https://ec2.amazonaws.com is now down
11/11/11 11:14:22 [10697] resource https://ec2.amazonaws.com is still down

To simplify the import, here is a script that will use ec2-describe-instances to get useful metadata about the instance and populate a submit file for you,

condor_ec2_link

#!/bin/sh

# Provide three arguments:
#  . instance id to link
#  . path to file with access key id
#  . path to file with secret access key

# TODO:
#  . Get EC2UserData (ec2-describe-instance-attribute --user-data)

ec2-describe-instances --show-empty-fields $1 | \
   awk '/^INSTANCE/ {id=$2; ami=$3; keypair=$7; type=$10; zone=$12; ip=$17; group=$29}
        /^TAG/ {name=$5}
        END {print "universe = grid\n",
                   "grid_resource = ec2 https://ec2.amazonaws.com\n",
                   "executable =", ami"-"name, "\n",
                   "log = $(executable).$(cluster).log\n",
                   "ec2_ami_id =", ami, "\n",
                   "ec2_instance_type =", type, "\n",
                   "ec2_keypair_file = name-"keypair, "\n",
                   "ec2_security_groups =", group, "\n",
                   "ec2_availability_zone =", zone, "\n",
                   "ec2_elastic_ip =", ip, "\n",
                   "+EC2InstanceName = \""id"\"\n",
                   "+GridJobId = \"$(grid_resource) BOGUS", id, "\"\n",
                   "queue\n"}' | \
      condor_submit -a "ec2_access_key_id = $2" \
                    -a "ec2_secret_access_key = $3"

In action,

$ ./condor_ec2_link i-319c3652 /home/matt/Documents/AWS/Cert/AccessKeyID /home/matt/Documents/AWS/Cert/SecretAccessKey
Submitting job(s).
1 job(s) submitted to cluster 1739.

$ ./condor_ec2_q 1739
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
1739.0   matt           11/11 11:11   0+00:00:00 I  0   0.0 ami-e1f53a88-TheNa
  Instance name: i-319c3652
  Groups: sg-4f706226
  Keypair file: /home/matt/Documents/AWS/name-TheKeyPair
  AMI id: ami-e1f53a88
  Instance type: t1.micro
1 jobs; 1 idle, 0 running, 0 held

(20 seconds later)

$ ./condor_ec2_q 1739
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
1739.0   matt           11/11 11:11   0+00:00:01 R  0   0.0 ami-e1f53a88-TheNa
  Instance name: i-319c3652
  Hostname: ec2-50-17-104-50.compute-1.amazonaws.com
  Groups: sg-4f706226
  Keypair file: /home/matt/Documents/AWS/name-TheKeyPair
  AMI id: ami-e1f53a88
  Instance type: t1.micro
1 jobs; 0 idle, 1 running, 0 held

There are a few things that can be improved here, the most notable of which is the RUN_TIME. The Gridmanager gets status data from EC2 periodically. This is how the EC2RemoteVirtualMachineName (Hostname) gets populated on the job. The instance’s launch time is also available. Oops.

Getting started: Condor and EC2 – condor_ec2_q tool

November 2, 2011

While Getting started with Condor and EC2, it is useful to display the EC2 specific attributes on jobs. This is a script that mirrors condor_q output, using its formatting parameters, and adds details for EC2 jobs.

condor_ec2_q:

#!/bin/sh

# NOTE:
#  . Requires condor_q >= 7.5.2, old classads do not
#    have %
#  . When running, jobs show RUN_TIME of their current
#    run, not accumulated, which would require adding
#    in RemoteWallClockTime
#  . See condor_utils/condor_q.cpp:encode_status for
#    JobStatus map

# TODO:
#  . Remove extra ShadowBday==0 test,
#    condor_gridmanager < 7.7.5 (3a896d01) did not
#    delete ShadowBday when a job was not running.
#    RUN_TIME of held EC2 jobs would be wrong.

echo ' ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD'
condor_q \
   -format '%4d.' ClusterId \
   -format '%-3d ' ProcId \
   -format '%-14s ' Owner \
   -format '%-11s ' 'formatTime(QDate,"%m/%d %H:%M")' \
   -format '%3d+' 'ifThenElse(ShadowBday =!= UNDEFINED, ifThenElse(ShadowBday != 0, time() - ShadowBday, int(RemoteWallClockTime)), int(RemoteWallClockTime)) / (60*60*24)' \
   -format '%02d:' '(ifThenElse(ShadowBday =!= UNDEFINED, ifThenElse(ShadowBday != 0, time() - ShadowBday, int(RemoteWallClockTime)), int(RemoteWallClockTime)) % (60*60*24)) / (60*60)' \
   -format '%02d:' '(ifThenElse(ShadowBday =!= UNDEFINED, ifThenElse(ShadowBday != 0, time() - ShadowBday, int(RemoteWallClockTime)), int(RemoteWallClockTime)) % (60*60)) / 60' \
   -format '%02d ' 'ifThenElse(ShadowBday =!= UNDEFINED, ifThenElse(ShadowBday != 0, time() - ShadowBday, int(RemoteWallClockTime)), int(RemoteWallClockTime)) % 60' \
   -format '%-2s ' 'substr("?IRXCH>S", JobStatus, 1)' \
   -format '%-3d ' JobPrio \
   -format '%-4.1f ' ImageSize/1024.0 \
   -format '%-18.18s' 'strcat(Cmd," ",Arguments)' \
   -format '\n' Owner \
   -format '  Instance name: %s\n' EC2InstanceName \
   -format '  Hostname: %s\n' EC2RemoteVirtualMachineName \
   -format '  Keypair file: %s\n' EC2KeyPairFile \
   -format '  User data: %s\n' EC2UserData \
   -format '  User data file: %s\n' EC2UserDataFile \
   -format '  AMI id: %s\n' EC2AmiID \
   -format '  Instance type: %s\n' EC2InstanceType \
   "$@" | awk 'BEGIN {St["I"]=0;St["R"]=0;St["H"]=0} \
   	       {St[$6]++; print} \
   	       END {for (i=0;i<=7;i++) jobs+=St[substr("?IRXCH>S",i,1)]; \
	       	    print jobs, "jobs;", \
		          St["I"], "idle,", St["R"], "running,", St["H"], "held"}'

In action,

$ condor_q
-- Submitter: eeyore.local :  : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
1728.0   matt           10/31 23:09   0+00:04:15 H  0   0.0  EC2_Instance-ami-6
1732.0   matt           11/1  01:43   0+05:16:46 R  0   0.0  EC2_Instance-ami-6
5 jobs; 0 idle, 4 running, 1 held

$ ./condor_ec2_q 
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
1728.0   matt           10/31 23:09   0+00:04:15 H  0   0.0  EC2_Instance-ami-6
  Instance name: i-31855752
  Hostname: ec2-50-19-175-62.compute-1.amazonaws.com
  Keypair file: /home/matt/Documents/AWS/EC2_Instance-ami-60bd4609.1728.pem
  User data: Hello EC2_Instance-ami-60bd4609!
  AMI id: ami-60bd4609
  Instance type: m1.small
1732.0   matt           11/01 01:43   0+05:16:48 R  0   0.0  EC2_Instance-ami-6
  Instance name: i-a90edcca
  Hostname: ec2-107-20-6-83.compute-1.amazonaws.com
  Keypair file: /home/matt/Documents/AWS/EC2_Instance-ami-60bd4609.1732.pem
  User data: Hello EC2_Instance-ami-60bd4609!
  AMI id: ami-60bd4609
  Instance type: m1.small
5 jobs; 0 idle, 4 running, 1 held

%d bloggers like this: