Getting started: Condor and EC2 – Starting and managing instances

Condor has the ability to start and manage the lifecycle of instances in EC2. The integration was released in early 2008 with version 7.1.

The integration started with users being able to upload AMIs to S3 and manage instances using the EC2 and S3 SOAP APIs. At the time, mid-2007, creating a useful AMI required so much user interaction that the complexity of supporting S3 AMI upload was not justified. The implementation settled on pure instance lifecycle management, a very powerful base and a core Condor strength.

A point of innovation during the integration was how to transactionally start instances. The instance’s security group (originally) and ssh keypair (finally), were used as a tracking key. This innovation turned into an RFE and eventually resulted in idempotent instance creation, a feature all Cloud APIs should support. In fact, all distributed resource management APIs should support it, more on this sometime.

Today, in Condor 7.7 and MRG 2, Condor uses the EC2 Query API via the ec2_gahp, and that’s our starting point. We’ll build a submit file, start an instance, get key metadata about the instance, and show how to control the instance’s lifecycle just like any other job’s.

First, the submit file,

universe = grid
grid_resource = ec2 https://ec2.amazonaws.com/

ec2_access_key_id = /home/matt/Documents/AWS/Cert/AccessKeyID
ec2_secret_access_key = /home/matt/Documents/AWS/Cert/SecretAccessKey

ec2_ami_id = ami-60bd4609
ec2_instance_type = m1.small

ec2_user_data = Hello $(executable)!

executable = EC2_Instance-$(ec2_ami_id)

log = $(executable).$(cluster).log

ec2_keypair_file = $(executable).$(cluster).pem

queue

The universe must be grid. The resource string is ec2 https://ec2.amazonaws.com, and the URL may be changed if a proxy is needed or possibly debugging with a redirect.

The ec2_access_key_id and ec2_secret_access_key are full paths to files containing your credentials for accessing EC2. These are needed so Condor can act on your behalf when talking to EC2. They need not and should not be world readable. Take a look at EC2 User Guide: Amazon EC2 Credentials for information on obtaining your credentials.

The ec2_ami_id and ec2_instance_type are required. They specify the AMI off which to base the instance and the type of instance to create, respectively. ami-60bd4609 is an EBS backed Fedora 15 image supported by the Fedora Cloud SIG. A list of instance types can be found in EC2 User Guide: Instance Families and Types. I picked m1.small because the AMI is 32-bit.

ec2_user_data is optional, but when provided gives the instance some extra data to act on when starting up. It is described in EC2 User Guide: Using Instance Metadata. This is an incredibly powerful feature, allowing parameterization of AMIs.

The executable field is simply a label here. It should really be called label or name and integrate with the AWS Console.

The log is our old friend the structured log of lifecycle events.

The ec2_keypair_file is the file where Condor will put the ssh keypair used for accessing the instance. This is a file instead of a keypair name because Condor generates a new keypair for each instance as part of tracking the instances. Eventually Condor should use EC2’s idempotent RunInstances.

Second, let’s submit the job,

$ condor_submit f15-ec2.sub            
Submitting job(s).
1 job(s) submitted to cluster 1710.

$ condor_q
-- Submitter: eeyore.local :  : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
1710.0   matt           10/30 20:58   0+00:00:00 I  0   0.0  EC2_Instance-ami-6
1 jobs; 1 idle, 0 running, 0 held

Condor is starting up a condor_gridmanager, which is in turn starting up an ec2_gahp to communicate with EC2.

$ pstree | grep condor 
     |-condor_master-+-aviary_query_se
     |               |-condor_collecto---4*[{condor_collect}]
     |               |-condor_negotiat---4*[{condor_negotia}]
     |               |-condor_schedd-+-condor_gridmana---ec2_gahp---2*[{ec2_gahp}]
     |               |               |-condor_procd
     |               |               `-4*[{condor_schedd}]
     |               |-condor_startd-+-condor_procd
     |               |               `-4*[{condor_startd}]
     |               `-4*[{condor_master}]

Third, when the job is running the instance will also be started in EC2. Take a look at the log file, EC2_Instance-ami-60bd4609.1710.log, for some information. Also, the instance name and hostname will be available on the job ad,

$ condor_q -format "Instance name: %s\n" EC2InstanceName -format "Instance hostname: %s\n" EC2RemoteVirtualMachineName -format "Keypair: %s\n" EC2KeyPairFile
Instance name: i-7f37e31c
Instance hostname: ec2-184-72-158-77.compute-1.amazonaws.com
Keypair: /home/matt/Documents/AWS/EC2_Instance-ami-60bd4609.1710.pem

The instance name can be used with the AWS Console or ec2-describe-instances,

$ ec2-describe-instances i-7f37e31c
RESERVATION	r-f6592498	821108636519	default
INSTANCE	i-7f37e31c	ami-60bd4609	ec2-184-72-158-77.compute-1.amazonaws.com	ip-10-118-37-239.ec2.internal	running	SSH_eeyore.local_eeyore.local#1710.0#1320022728	0		m1.small	2011-10-31T00:59:01+0000	us-east-1c	aki-407d9529			monitoring-disabled	184.72.158.77	10.118.37.239			ebs					paravirtual	xen		sg-e5a18c8c	default
BLOCKDEVICE	/dev/sda1	vol-fe4aaf93	2011-10-31T00:59:24.000Z

The instance hostname along with the ec2_keypair_file will let us access the instance,

$ ssh -i /home/matt/Documents/AWS/EC2_Instance-ami-60bd4609.1710.pem ec2-user@ec2-184-72-158-77.compute-1.amazonaws.com
The authenticity of host 'ec2-184-72-158-77.compute-1.amazonaws.com (184.72.158.77)' can't be established.
RSA key fingerprint is f2:6e:da:bb:53:47:34:b6:2e:fe:63:62:a5:c8:a5:2e.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ec2-184-72-158-77.compute-1.amazonaws.com,184.72.158.77' (RSA) to the list of known hosts.

Appliance:	Fedora-15 appliance 1.1
Hostname:	localhost.localdomain
IP Address:	10.118.37.239

[ec2-user@localhost ~]$ 

Notice that the Fedora instances use a default account of ec2-user, not root.

Also, the user data is available in the instance. Any program could read and act on it.

[ec2-user@localhost ~]$ curl http://169.254.169.254/latest/user-data
Hello EC2_Instance-ami-60bd4609!

Finally, controlling the instance’s lifecycle, simply issue condor_hold or condor_rm and the instance will be terminated. You can also run shutdown -H now in the instance. Here I’ll run sudo shutdown -H now.

[ec2-user@localhost ~]$ sudo shutdown -H now
Broadcast message from ec2-user@localhost.localdomain on pts/0 (Mon, 31 Oct 2011 01:11:55 -0400):
The system is going down for system halt NOW!
[ec2-user@localhost ~]$
Connection to ec2-184-72-158-77.compute-1.amazonaws.com closed by remote host.
Connection to ec2-184-72-158-77.compute-1.amazonaws.com closed.

You will notice that condor_q does not immediately reflect that the instance is terminated, even though ec2-describe-instances will. This is because Condor only polls for status changes in EC2 every 5 minutes by default. The GRIDMANAGER_JOB_PROBE_INTERVAL configuration param is the control.

In this case, the instance was shutdown at Sun Oct 30 21:12:52 EDT 2011 and Condor noticed at 21:14:40,

$ tail -n11 EC2_Instance-ami-60bd4609.1710.log
005 (1710.000.000) 10/30 21:14:40 Job terminated.
	(1) Normal termination (return value 0)
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Remote Usage
		Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
	0  -  Run Bytes Sent By Job
	0  -  Run Bytes Received By Job
	0  -  Total Bytes Sent By Job
	0  -  Total Bytes Received By Job
...

Bonus, use periodic_hold or periodic_remove to cap how long an instance can run. Add periodic_hold = (time() – ShadowBday) >= 60 to the submit file and your instance will be terminated, by Condor, after 60 seconds.

$ tail -n6 EC2_Instance-ami-60bd4609.1713.log
001 (1713.000.000) 10/30 21:33:39 Job executing on host: ec2 https://ec2.amazonaws.com/
...
012 (1713.000.000) 10/30 21:37:54 Job was held.
	The job attribute PeriodicHold expression '( time() - ShadowBday ) >= 60' evaluated to TRUE
	Code 3 Subcode 0
...

The instance was not terminated at exactly 60 seconds because the PERIODIC_EXPR_INTERVAL configuration defaults to 300 seconds, just like the GRIDMANAGER_JOB_PROBE_INTERVAL.

Imagine keeping your EC2 instance inventory in Condor. Condor’s policy engine and extensible metadata for jobs automatically extend to instances running in EC2.

Advertisements

Tags: , , ,

4 Responses to “Getting started: Condor and EC2 – Starting and managing instances”

  1. Getting started: Condor and EC2 – condor_ec2_q tool « Spinning Says:

    […] Spinning « Getting started: Condor and EC2 – Starting and managing instances […]

  2. Getting started: Condor and EC2 – Importing instances with condor_ec2_link « Spinning Says:

    […] Starting and managing instances describes the powerful feature of Condor to start and manage EC2 instances, but what if you are already using something other than Condor to start your instance, such as the AWS Management Console. […]

  3. Getting started: Condor and EC2 – EC2 execute node « Spinning Says:

    […] have been over starting and managing instances from Condor, using condor_ec2_q to help, and importing existing instances. Here we will cover […]

  4. EC2, VNC and Fedora « Spinning Says:

    […] start an instance, my preferred way is via Condor. I used ami-60bd4609 on an m1.small, providing a basic Fedora 15 server. Make sure the […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: