Archive for the ‘EC2’ Category

EC2, VNC and Fedora

January 24, 2012

If you have ever wondered about running a desktop session in EC2, here is one way to set it up and some pointers.

First, start an instance, my preferred way is via Condor. I used ami-60bd4609 on an m1.small, providing a basic Fedora 15 server. Make sure the instance’s security group has port 22 (ssh) open.

Second, install a desktop environment, e.g. yum groupinstall 'GNOME Desktop Environment'. This is 467 packages and will take about 18 minutes.

Third, install and setup a VNC server. yum install vnc-server ; vncpasswd ; vncserver :1. This produces a running desktop that can be contacted by a vncviewer.

Finally, connect via an SSH secured VNC session.

VNC_VIA_CMD='/usr/bin/ssh -i KEYPAIR.pem -l ec2-user -f -L "$L":"$H":"$R" "$G" sleep 20' vncviewer localhost:1 -via INSTANCE_ADDRESS

What’s going on here? vncviewer allows for a proxy host when connecting to the vncserver. That is the -via argument. The VNC_VIA_CMD is an environment variable that specifies the command used to connect to the proxy. Here it is modified to provide the keypair needed to access the instance, and the user ec2-user, which is the default user on Fedora AMIs. The INSTANCE_ADDRESS is the Hostname from condor_ec2_q.

Alternatively, ssh-add KEYPAIR.pem followed by vncviewer localhost:1 -via ec2-user@INSTANCE_ADDRESS. However, be careful if you have many keys stored in your ssh-agent. They will all be tried and the remote sshd may reject your connection before the proper keypair is found.

Tips:

  • It takes about 20 minutes from start to vncviewer. Once the instance is setup consider creating your own AMI.
  • Set a password for ec2-user, otherwise the screensaver will lock you out. Use sudo passwd ec2-user.
  • Remember AWS charges for data transmitted out of the instance, as well as the uptime of the instance, see EC2 Pricing. You will want to figure out how much bandwidth your workflow takes on average to figure out total cost. For me, a half hour of browsing Planet Fedora, editing with emacs, and compiling some code, transmitted about 60MB of data. That measurement is the difference in eth0’s “TX bytes” as reported by ifconfig. This is not a perfect estimate because there is may have been data transferred within EC2, which is not charged.
  • For transmit rates, consider running bmw-ng to see what actions use the most bandwidth.
  • Generally, make the screen update as little as possible. Constantly changing graphics on web pages can run 60-120KB/s. Compare that to a text console and emacs producing a TX rate closer to 5-25KB/s.
  • Cover consoles with compilations, or compile in a low verbosity mode.

Amazon S3 – Object Expiration, what about Instance Expiration

December 28, 2011

AWS is providing APIs that take distributed computing concerns into account. One could call them cloud concerns these days. Unfortunately, not all cloud providers are doing the same.

Idempotent instance creation showed up in Sept 2010, providing the ability to simplify interactions with EC2. Idempotent resource allocation is critical for distributed systems.

S3 object expiration appeared in Dec 2011, allowing for service-side managed deallocation of S3 resources.

Next up? It would be great to have an EC2 instance expiration feature. One that could be (0) assigned per instance and (1) adjusted while the instance exists. Bonus if can also be (2) adjusted from within the instance without credentials. Think leases.

Getting started: Condor and EC2 – EC2 execute node

November 10, 2011

We have been over starting and managing instances from Condor, using condor_ec2_q to help, and importing existing instances. Here we will cover extending an existing pool using execute nodes run from EC2 instances. We will start with an existing pool, create an EC2 instance, configure the instance to run condor, authorize the instance to join the existing pool, and run a job.

Let us pretend that the node running your existing pool’s condor_collector and condor_schedd is called condor.condorproject.org.

These instructions will require bi-directional connectivity between condor.condorproject.org and your EC2 instance. condor.condorproject.org must be connected to the internet with a publically routable address. Also, ports must be open in its firewall for the Collector and Schedd. The EC2 execute nodes have to be able to connect to condor.condorproject.org to talk to the condor_collector and condor_schedd. It cannot be behind a NAT or firewall. Okay, let’s start.

I am going to use ami-60bd4609, a publically available Fedora 15 AMI. You can either start the instance via the AWS console, or submit it by following previous instructions.

Once the instance is up and running, login and sudo yum install condor. Note, until BZ656562 is resolved, you will have to sudo mkdir /var/run/condor; sudo chown condor.condor /var/run/condor before starting condor. Start condor with sudo service condor start to get a personal condor.

Configuring condor on the instance is very similar to creating a multiple node pool. You will need to set the CONDOR_HOST, ALLOW_WRITE, and DAEMON_LIST,

# cat > /etc/condor/config.d/40execute_node.config
CONDOR_HOST = condor.condorproject.org
DAEMON_LIST = MASTER, STARTD
ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST)
^D

If you do not give condor.condorproject.org WRITE permissions, the Schedd will fail to start jobs. StartLog will report,

PERMISSION DENIED to unauthenticated@unmapped from host 128.105.291.82 for command 442 (REQUEST_CLAIM), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 128.105.291.82,condor.condorproject.org, hostname size = 1, original ip address = 128.105.291.82

Now remember, we need bi-directional connectivity. So condor.condorproject.org must be able to connect to the EC2 instance’s Startd. The condor_start will listen on an ephemeral port by default. You could restrict it to a port range or use condor_shared_port. For simplicity, just force a non-ephemeral port of 3131,

# echo "STARTD_ARGS = -p 3131" >> /etc/condor/config.d/40execute_node.config

You can now open TCP port 3131 in the instance’s iptables firewall. If you are using the Fedora 15 AMI, the firewall is off by default and needs no adjustment. Additionally, the security group on the instance needs to have TCP port 3131 authorized. Use the AWS Console or ec2-authorize GROUP -p 3131.

If you miss either of these steps, the Schedd will fail to start jobs on the instance, likely with a message similar to,

Failed to send REQUEST_CLAIM to startd ec2-174-129-47-20.compute-1.amazonaws.com <174.129.47.20:3131>#1220911452#1#... for matt: SECMAN:2003:TCP connection to startd ec2-174-129-47-20.compute-1.amazonaws.com <174.129.47.20:3131>#1220911452#1#... for matt failed.

A quick service condor restart on the instance, and a condor_status on condor.condorproject.org would hopefully show the instance joined the pool. Except the instance has not been authorized yet. In fact, the CollectorLog will probably report,

PERMISSION DENIED to unauthenticated@unmapped from host 174.129.47.20 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 174.129.47.20,ec2-174-129-47-20.compute-1.amazonaws.com
PERMISSION DENIED to unauthenticated@unmapped from host 174.129.47.20 for command 2 (UPDATE_MASTER_AD), access level ADVERTISE_MASTER: reason: ADVERTISE_MASTER authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 174.129.47.20,ec2-174-129-47-20.compute-1.amazonaws.com

The instance needs to be authorized to advertise itself into the Collector. A good way to do that is to add,

ALLOW_ADVERTISE_MASTER = $(ALLOW_WRITE), ec2-174-129-47-20.compute-1.amazonaws.com
ALLOW_ADVERTISE_STARTD = $(ALLOW_WRITE), ec2-174-129-47-20.compute-1.amazonaws.com

to condor.condorproject.org’s configuration and reconfig with condor_reconfig. A note here, ALLOW_WRITE is added in because I am assuming you are following previous instructions. If you have ALLOW_ADVERTISE_MASTER/STARTD already configured, you should append to them instead. Also, appending for each new instance will get tedious. You could be very trusting and allow *.amazonaws.com, but it is better to use SSL or PASSWORD authentication. I will describe that some other time.

After the reconfig, the instance will eventually show up in a condor_status listing.

$ condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
localhost.localdom LINUX      INTEL  Unclaimed Benchmar 0.430  1666  0+00:00:04

The name is not very helpful, but also not a problem.

It is time to submit a job.

$ condor_submit
Submitting job(s)
cmd = /bin/sleep
args = 1d
should_transfer_files = IF_NEEDED
when_to_transfer_output = ON_EXIT
queue
.^D
1 job(s) submitted to cluster 14.

$ condor_q
-- Submitter: condor.condorproject.org : <128.105.291.82:36900> : condor.condorproject.org
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  31.0   matt           11/11 11:11   0+00:00:00 I  0   0.0  sleep 1d
1 jobs; 1 idle, 0 running, 0 held

The job will stay idle forever, which is no good. The problem can be found in the SchedLog,

Enqueued contactStartd startd=<10.72.55.105:3131>
In checkContactQueue(), args = 0x9705798, host=<10.72.55.105:3131>
Requesting claim localhost.localdomain <10.72.55.105:3131>#1220743035#1#... for matt 31.0
attempt to connect to <10.72.55.105:3131> failed: Connection timed out (connect errno = 110).  Will keep trying for 45 total seconds (24 to go).

The root cause is that the instance has two internet addresses. A private one, which is not routable from condor.condorproject.org, that it is advertising,

$ condor_status -format "%s, " Name -format "%s\n" MyAddress
localhost.localdomain, <10.72.55.105:3131>

And a public one, which can be found from within the instance,

$ curl -f http://169.254.169.254/latest/meta-data/public-ipv4
174.129.47.20

Condor has a way to handle this. The TCP_FORWARDING_HOST configuration parameter can be set to the public address for the instance.

# echo "TCP_FORWARDING_HOST = $(curl -f http://169.254.169.254/latest/meta-data/public-ipv4)" >> /etc/condor/config.d/40execute_node.config

A condor_reconfig will apply the change, but a restart will clear out the old entry first. Oops. Note, you cannot set TCP_FORWARDING_HOST to the public-hostname of the instance, because the public hostname will be revolved within the instance and will resolve to the instance’s internal, private address.

When setting TCP_FORWARDING_HOST, also set PRIVATE_NETWORK_INTERFACE to let the host talk to itself over its private address.

# echo "PRIVATE_NETWORK_INTERFACE = $(curl -f http://169.254.169.254/latest/meta-data/local-ipv4)" >> /etc/condor/config.d/40execute_node.config

Doing so will prevent the condor_startd from using its public address to send DC_CHILDALIVE messages to the condor_master, which might fail because of a firewall or security group setting,

attempt to connect to <174.129.47.20:34550> failed: Connection timed out (connect errno = 110).  Will keep trying for 390 total seconds (200 to go).
attempt to connect to <174.129.47.20:34550> failed: Connection timed out (connect errno = 110).
ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <174.129.47.20:34550> (try 1 of 3): CEDAR:6001:Failed to connect to <174.129.47.20:34550>

Or if simply because the master does not trust the public address,

PERMISSION DENIED to unauthenticated@unmapped from host 174.129.47.20 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 174.129.47.20,ec2-174-129-47-20.compute-1.amazonaws.com, hostname size = 1, original ip address = 174.129.47.20

Now run that service condor restart and the public, routable address will be advertised,

$ condor_status -format "%s, " Name -format "%s\n" MyAddress
localhost.localdomain, <174.129.47.20:3131?noUDP>

The job will be started on the instance automatically,

$ condor_q -run
-- Submitter: condor.condorproject.org : <128.105.291.82:36900> : condor.condorproject.org
 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)
  31.0   matt           11/11 11:11   0+00:00:11 localhost.localdomain

If you want to clean up the localhost.localdomain, set the instance’s hostname and restart condor,

$ sudo hostname $(curl -f http://169.254.169.254/latest/meta-data/public-hostname)
$ sudo service condor restart
(wait for the start to advertise)
$ condor_status -format "%s, " Name -format "%s\n" MyAddress
ec2-174-129-47-20.compute-1.amazonaws.com, <174.129.47.20:3131?noUDP>

In summary,

Configuration changes on condor.condorproject.org,

ALLOW_ADVERTISE_MASTER = $(ALLOW_WRITE), ec2-174-129-47-20.compute-1.amazonaws.com
ALLOW_ADVERTISE_STARTD = $(ALLOW_WRITE), ec2-174-129-47-20.compute-1.amazonaws.com

Setup on the instance,

# cat > /etc/condor/config.d/40execute_node.config
CONDOR_HOST = condor.condorproject.org
DAEMON_LIST = MASTER, STARTD
ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST)
STARTD_ARGS = -p 3131
^D
# echo "TCP_FORWARDING_HOST = $(curl -f http://169.254.169.254/latest/meta-data/public-ipv4)" >> /etc/condor/config.d/40execute_node.config
# echo "PRIVATE_NETWORK_INTERFACE = $(curl -f http://169.254.169.254/latest/meta-data/local-ipv4)" >> /etc/condor/config.d/40execute_node.config
# hostname $(curl -f http://169.254.169.254/latest/meta-data/public-hostname)

Getting started: Condor and EC2 – Importing instances with condor_ec2_link

November 7, 2011

Starting and managing instances describes the powerful feature of Condor to start and manage EC2 instances, but what if you are already using something other than Condor to start your instance, such as the AWS Management Console.

Importing instances turns out to be straightforward, if you know how instances are started. In a nutshell, the condor_gridmanager executes a state machine and records its current state in an attribute named GridJobId. To import an instance, submit a job that is already in the state where an instance id has been assigned. You can take a submit file and add +GridJobId = “ec2 https://ec2.amazonaws.com/ BOGUS INSTANCE-ID. The INSTANCE-ID needs to be the actual identifier of the instance you want to import. For instance,

...
ec2_access_key_id = ...
ec2_secret_access_key = ...
...
+GridJobId = "ec2 https://ec2.amazonaws.com/ BOGUS i-319c3652"
queue

It is important to get the ec2_access_key_id and ec2_secret_access_key correct. Without them Condor will not be able to communicate with EC2 and EC2_GAHP_LOG will report,

$ tail -n2 $(condor_config_val EC2_GAHP_LOG)
11/11/11 11:11:11 Failure response text was '
AuthFailureAWS was not able to validate the provided access credentialsab50f005-6d77-4653-9cec-298b2d475f6e'.

This error will not be reported back into the job, putting it on hold, instead the gridmanager will think the EC2 is down for the job. Oops.

$ grep down $(condor_config_val GRIDMANAGER_LOG)
11/11/11 11:11:11 [10697] resource https://ec2.amazonaws.com is now down
11/11/11 11:14:22 [10697] resource https://ec2.amazonaws.com is still down

To simplify the import, here is a script that will use ec2-describe-instances to get useful metadata about the instance and populate a submit file for you,

condor_ec2_link

#!/bin/sh

# Provide three arguments:
#  . instance id to link
#  . path to file with access key id
#  . path to file with secret access key

# TODO:
#  . Get EC2UserData (ec2-describe-instance-attribute --user-data)

ec2-describe-instances --show-empty-fields $1 | \
   awk '/^INSTANCE/ {id=$2; ami=$3; keypair=$7; type=$10; zone=$12; ip=$17; group=$29}
        /^TAG/ {name=$5}
        END {print "universe = grid\n",
                   "grid_resource = ec2 https://ec2.amazonaws.com\n",
                   "executable =", ami"-"name, "\n",
                   "log = $(executable).$(cluster).log\n",
                   "ec2_ami_id =", ami, "\n",
                   "ec2_instance_type =", type, "\n",
                   "ec2_keypair_file = name-"keypair, "\n",
                   "ec2_security_groups =", group, "\n",
                   "ec2_availability_zone =", zone, "\n",
                   "ec2_elastic_ip =", ip, "\n",
                   "+EC2InstanceName = \""id"\"\n",
                   "+GridJobId = \"$(grid_resource) BOGUS", id, "\"\n",
                   "queue\n"}' | \
      condor_submit -a "ec2_access_key_id = $2" \
                    -a "ec2_secret_access_key = $3"

In action,

$ ./condor_ec2_link i-319c3652 /home/matt/Documents/AWS/Cert/AccessKeyID /home/matt/Documents/AWS/Cert/SecretAccessKey
Submitting job(s).
1 job(s) submitted to cluster 1739.

$ ./condor_ec2_q 1739
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
1739.0   matt           11/11 11:11   0+00:00:00 I  0   0.0 ami-e1f53a88-TheNa
  Instance name: i-319c3652
  Groups: sg-4f706226
  Keypair file: /home/matt/Documents/AWS/name-TheKeyPair
  AMI id: ami-e1f53a88
  Instance type: t1.micro
1 jobs; 1 idle, 0 running, 0 held

(20 seconds later)

$ ./condor_ec2_q 1739
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
1739.0   matt           11/11 11:11   0+00:00:01 R  0   0.0 ami-e1f53a88-TheNa
  Instance name: i-319c3652
  Hostname: ec2-50-17-104-50.compute-1.amazonaws.com
  Groups: sg-4f706226
  Keypair file: /home/matt/Documents/AWS/name-TheKeyPair
  AMI id: ami-e1f53a88
  Instance type: t1.micro
1 jobs; 0 idle, 1 running, 0 held

There are a few things that can be improved here, the most notable of which is the RUN_TIME. The Gridmanager gets status data from EC2 periodically. This is how the EC2RemoteVirtualMachineName (Hostname) gets populated on the job. The instance’s launch time is also available. Oops.

Getting started: Condor and EC2 – condor_ec2_q tool

November 2, 2011

While Getting started with Condor and EC2, it is useful to display the EC2 specific attributes on jobs. This is a script that mirrors condor_q output, using its formatting parameters, and adds details for EC2 jobs.

condor_ec2_q:

#!/bin/sh

# NOTE:
#  . Requires condor_q >= 7.5.2, old classads do not
#    have %
#  . When running, jobs show RUN_TIME of their current
#    run, not accumulated, which would require adding
#    in RemoteWallClockTime
#  . See condor_utils/condor_q.cpp:encode_status for
#    JobStatus map

# TODO:
#  . Remove extra ShadowBday==0 test,
#    condor_gridmanager < 7.7.5 (3a896d01) did not
#    delete ShadowBday when a job was not running.
#    RUN_TIME of held EC2 jobs would be wrong.

echo ' ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD'
condor_q \
   -format '%4d.' ClusterId \
   -format '%-3d ' ProcId \
   -format '%-14s ' Owner \
   -format '%-11s ' 'formatTime(QDate,"%m/%d %H:%M")' \
   -format '%3d+' 'ifThenElse(ShadowBday =!= UNDEFINED, ifThenElse(ShadowBday != 0, time() - ShadowBday, int(RemoteWallClockTime)), int(RemoteWallClockTime)) / (60*60*24)' \
   -format '%02d:' '(ifThenElse(ShadowBday =!= UNDEFINED, ifThenElse(ShadowBday != 0, time() - ShadowBday, int(RemoteWallClockTime)), int(RemoteWallClockTime)) % (60*60*24)) / (60*60)' \
   -format '%02d:' '(ifThenElse(ShadowBday =!= UNDEFINED, ifThenElse(ShadowBday != 0, time() - ShadowBday, int(RemoteWallClockTime)), int(RemoteWallClockTime)) % (60*60)) / 60' \
   -format '%02d ' 'ifThenElse(ShadowBday =!= UNDEFINED, ifThenElse(ShadowBday != 0, time() - ShadowBday, int(RemoteWallClockTime)), int(RemoteWallClockTime)) % 60' \
   -format '%-2s ' 'substr("?IRXCH>S", JobStatus, 1)' \
   -format '%-3d ' JobPrio \
   -format '%-4.1f ' ImageSize/1024.0 \
   -format '%-18.18s' 'strcat(Cmd," ",Arguments)' \
   -format '\n' Owner \
   -format '  Instance name: %s\n' EC2InstanceName \
   -format '  Hostname: %s\n' EC2RemoteVirtualMachineName \
   -format '  Keypair file: %s\n' EC2KeyPairFile \
   -format '  User data: %s\n' EC2UserData \
   -format '  User data file: %s\n' EC2UserDataFile \
   -format '  AMI id: %s\n' EC2AmiID \
   -format '  Instance type: %s\n' EC2InstanceType \
   "$@" | awk 'BEGIN {St["I"]=0;St["R"]=0;St["H"]=0} \
   	       {St[$6]++; print} \
   	       END {for (i=0;i<=7;i++) jobs+=St[substr("?IRXCH>S",i,1)]; \
	       	    print jobs, "jobs;", \
		          St["I"], "idle,", St["R"], "running,", St["H"], "held"}'

In action,

$ condor_q
-- Submitter: eeyore.local :  : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
1728.0   matt           10/31 23:09   0+00:04:15 H  0   0.0  EC2_Instance-ami-6
1732.0   matt           11/1  01:43   0+05:16:46 R  0   0.0  EC2_Instance-ami-6
5 jobs; 0 idle, 4 running, 1 held

$ ./condor_ec2_q 
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
1728.0   matt           10/31 23:09   0+00:04:15 H  0   0.0  EC2_Instance-ami-6
  Instance name: i-31855752
  Hostname: ec2-50-19-175-62.compute-1.amazonaws.com
  Keypair file: /home/matt/Documents/AWS/EC2_Instance-ami-60bd4609.1728.pem
  User data: Hello EC2_Instance-ami-60bd4609!
  AMI id: ami-60bd4609
  Instance type: m1.small
1732.0   matt           11/01 01:43   0+05:16:48 R  0   0.0  EC2_Instance-ami-6
  Instance name: i-a90edcca
  Hostname: ec2-107-20-6-83.compute-1.amazonaws.com
  Keypair file: /home/matt/Documents/AWS/EC2_Instance-ami-60bd4609.1732.pem
  User data: Hello EC2_Instance-ami-60bd4609!
  AMI id: ami-60bd4609
  Instance type: m1.small
5 jobs; 0 idle, 4 running, 1 held

Getting started: Condor and EC2 – Starting and managing instances

October 31, 2011

Condor has the ability to start and manage the lifecycle of instances in EC2. The integration was released in early 2008 with version 7.1.

The integration started with users being able to upload AMIs to S3 and manage instances using the EC2 and S3 SOAP APIs. At the time, mid-2007, creating a useful AMI required so much user interaction that the complexity of supporting S3 AMI upload was not justified. The implementation settled on pure instance lifecycle management, a very powerful base and a core Condor strength.

A point of innovation during the integration was how to transactionally start instances. The instance’s security group (originally) and ssh keypair (finally), were used as a tracking key. This innovation turned into an RFE and eventually resulted in idempotent instance creation, a feature all Cloud APIs should support. In fact, all distributed resource management APIs should support it, more on this sometime.

Today, in Condor 7.7 and MRG 2, Condor uses the EC2 Query API via the ec2_gahp, and that’s our starting point. We’ll build a submit file, start an instance, get key metadata about the instance, and show how to control the instance’s lifecycle just like any other job’s.

First, the submit file,

universe = grid
grid_resource = ec2 https://ec2.amazonaws.com/

ec2_access_key_id = /home/matt/Documents/AWS/Cert/AccessKeyID
ec2_secret_access_key = /home/matt/Documents/AWS/Cert/SecretAccessKey

ec2_ami_id = ami-60bd4609
ec2_instance_type = m1.small

ec2_user_data = Hello $(executable)!

executable = EC2_Instance-$(ec2_ami_id)

log = $(executable).$(cluster).log

ec2_keypair_file = $(executable).$(cluster).pem

queue

The universe must be grid. The resource string is ec2 https://ec2.amazonaws.com, and the URL may be changed if a proxy is needed or possibly debugging with a redirect.

The ec2_access_key_id and ec2_secret_access_key are full paths to files containing your credentials for accessing EC2. These are needed so Condor can act on your behalf when talking to EC2. They need not and should not be world readable. Take a look at EC2 User Guide: Amazon EC2 Credentials for information on obtaining your credentials.

The ec2_ami_id and ec2_instance_type are required. They specify the AMI off which to base the instance and the type of instance to create, respectively. ami-60bd4609 is an EBS backed Fedora 15 image supported by the Fedora Cloud SIG. A list of instance types can be found in EC2 User Guide: Instance Families and Types. I picked m1.small because the AMI is 32-bit.

ec2_user_data is optional, but when provided gives the instance some extra data to act on when starting up. It is described in EC2 User Guide: Using Instance Metadata. This is an incredibly powerful feature, allowing parameterization of AMIs.

The executable field is simply a label here. It should really be called label or name and integrate with the AWS Console.

The log is our old friend the structured log of lifecycle events.

The ec2_keypair_file is the file where Condor will put the ssh keypair used for accessing the instance. This is a file instead of a keypair name because Condor generates a new keypair for each instance as part of tracking the instances. Eventually Condor should use EC2’s idempotent RunInstances.

Second, let’s submit the job,

$ condor_submit f15-ec2.sub            
Submitting job(s).
1 job(s) submitted to cluster 1710.

$ condor_q
-- Submitter: eeyore.local :  : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
1710.0   matt           10/30 20:58   0+00:00:00 I  0   0.0  EC2_Instance-ami-6
1 jobs; 1 idle, 0 running, 0 held

Condor is starting up a condor_gridmanager, which is in turn starting up an ec2_gahp to communicate with EC2.

$ pstree | grep condor 
     |-condor_master-+-aviary_query_se
     |               |-condor_collecto---4*[{condor_collect}]
     |               |-condor_negotiat---4*[{condor_negotia}]
     |               |-condor_schedd-+-condor_gridmana---ec2_gahp---2*[{ec2_gahp}]
     |               |               |-condor_procd
     |               |               `-4*[{condor_schedd}]
     |               |-condor_startd-+-condor_procd
     |               |               `-4*[{condor_startd}]
     |               `-4*[{condor_master}]

Third, when the job is running the instance will also be started in EC2. Take a look at the log file, EC2_Instance-ami-60bd4609.1710.log, for some information. Also, the instance name and hostname will be available on the job ad,

$ condor_q -format "Instance name: %s\n" EC2InstanceName -format "Instance hostname: %s\n" EC2RemoteVirtualMachineName -format "Keypair: %s\n" EC2KeyPairFile
Instance name: i-7f37e31c
Instance hostname: ec2-184-72-158-77.compute-1.amazonaws.com
Keypair: /home/matt/Documents/AWS/EC2_Instance-ami-60bd4609.1710.pem

The instance name can be used with the AWS Console or ec2-describe-instances,

$ ec2-describe-instances i-7f37e31c
RESERVATION	r-f6592498	821108636519	default
INSTANCE	i-7f37e31c	ami-60bd4609	ec2-184-72-158-77.compute-1.amazonaws.com	ip-10-118-37-239.ec2.internal	running	SSH_eeyore.local_eeyore.local#1710.0#1320022728	0		m1.small	2011-10-31T00:59:01+0000	us-east-1c	aki-407d9529			monitoring-disabled	184.72.158.77	10.118.37.239			ebs					paravirtual	xen		sg-e5a18c8c	default
BLOCKDEVICE	/dev/sda1	vol-fe4aaf93	2011-10-31T00:59:24.000Z

The instance hostname along with the ec2_keypair_file will let us access the instance,

$ ssh -i /home/matt/Documents/AWS/EC2_Instance-ami-60bd4609.1710.pem ec2-user@ec2-184-72-158-77.compute-1.amazonaws.com
The authenticity of host 'ec2-184-72-158-77.compute-1.amazonaws.com (184.72.158.77)' can't be established.
RSA key fingerprint is f2:6e:da:bb:53:47:34:b6:2e:fe:63:62:a5:c8:a5:2e.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ec2-184-72-158-77.compute-1.amazonaws.com,184.72.158.77' (RSA) to the list of known hosts.

Appliance:	Fedora-15 appliance 1.1
Hostname:	localhost.localdomain
IP Address:	10.118.37.239

[ec2-user@localhost ~]$ 

Notice that the Fedora instances use a default account of ec2-user, not root.

Also, the user data is available in the instance. Any program could read and act on it.

[ec2-user@localhost ~]$ curl http://169.254.169.254/latest/user-data
Hello EC2_Instance-ami-60bd4609!

Finally, controlling the instance’s lifecycle, simply issue condor_hold or condor_rm and the instance will be terminated. You can also run shutdown -H now in the instance. Here I’ll run sudo shutdown -H now.

[ec2-user@localhost ~]$ sudo shutdown -H now
Broadcast message from ec2-user@localhost.localdomain on pts/0 (Mon, 31 Oct 2011 01:11:55 -0400):
The system is going down for system halt NOW!
[ec2-user@localhost ~]$
Connection to ec2-184-72-158-77.compute-1.amazonaws.com closed by remote host.
Connection to ec2-184-72-158-77.compute-1.amazonaws.com closed.

You will notice that condor_q does not immediately reflect that the instance is terminated, even though ec2-describe-instances will. This is because Condor only polls for status changes in EC2 every 5 minutes by default. The GRIDMANAGER_JOB_PROBE_INTERVAL configuration param is the control.

In this case, the instance was shutdown at Sun Oct 30 21:12:52 EDT 2011 and Condor noticed at 21:14:40,

$ tail -n11 EC2_Instance-ami-60bd4609.1710.log
005 (1710.000.000) 10/30 21:14:40 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...

Bonus, use periodic_hold or periodic_remove to cap how long an instance can run. Add periodic_hold = (time() – ShadowBday) >= 60 to the submit file and your instance will be terminated, by Condor, after 60 seconds.

$ tail -n6 EC2_Instance-ami-60bd4609.1713.log
001 (1713.000.000) 10/30 21:33:39 Job executing on host: ec2 https://ec2.amazonaws.com/
...
012 (1713.000.000) 10/30 21:37:54 Job was held.
	The job attribute PeriodicHold expression '( time() - ShadowBday ) >= 60' evaluated to TRUE
	Code 3 Subcode 0
...

The instance was not terminated at exactly 60 seconds because the PERIODIC_EXPR_INTERVAL configuration defaults to 300 seconds, just like the GRIDMANAGER_JOB_PROBE_INTERVAL.

Imagine keeping your EC2 instance inventory in Condor. Condor’s policy engine and extensible metadata for jobs automatically extend to instances running in EC2.


%d bloggers like this: