Archive for the ‘Tutorial’ Category

Hadoop on OpenStack with a CLI: Creating a cluster

January 29, 2014

OpenStack Savanna can already help you create a Hadoop cluster or run a Hadoop workload all through the Horizon dashboard. What it could not do until now is let you do that through a command-line interface.

Part of the Savanna work for Icehouse is to create a savanna CLI. It extends the Savanna functionality as well as gives us an opportunity to review the existing v1.0 and v1.1 REST APIs in preparation for a stable v2 API.

A first pass of the CLI is now done and functional for at least the v1.0 REST API. And here’s how you can use it.

Zeroth, get your hands on the Savanna client. Two places to get it are RDO and the OpenStack tarballs.

First, know that the Savanna architecture includes a plugin mechanism to allow for Hadoop vendors to plug in their own management tools. This is a key aspect of Savanna’s vendor appeal. So you need to pick a plugin to use.

$ savanna plugin-list
| name    | versions | title                     |
| vanilla | 1.2.1    | Vanilla Apache Hadoop     |
| hdp     | 1.3.2    | Hortonworks Data Platform |

I chose to try the Vanilla plugin, version 1.2.1. It’s the reference implementation,

export PLUGIN_NAME=vanilla
export PLUGIN_VERSION=1.2.1

Second, you need to make some decisions about the Hadoop cluster you want to start. I decided to have a master node using the m1.medium flavor and three worker nodes also using m1.medium.

export MASTER_FLAVOR=m1.medium
export WORKER_FLAVOR=m1.medium

Third, I decided to use Neutron networking in my OpenStack deployment, it’s what everyone is doing these days. As a result, I need a network to start the cluster on.

$ neutron net-list
| id            | name | subnets                     |
| 25783...f078b | net0 | 18d12...5f903 |

The cluster will be significantly more useful if I have a way to access it, so I need to pick a keypair for access.

$ nova keypair-list
| Name      | Fingerprint                                     |
| mykeypair | ac:ad:1d:f7:97:24:bd:6e:d7:98:50:a2:3d:7d:6c:45 |
export KEYPAIR=mykeypair

And I need an image to use for each of the nodes. I chose a Fedora image that was created using the Savanna DIB elements. You can pick one from the Savanna Quickstart guide,

$ glance image-list
| ID            | Name           | Disk Format | Container Format | Size       | Status |
| 1939b...f05c2 | fedora_savanna | qcow2       | bare             | 1093453824 | active |
export IMAGE_ID=1939bad7-11fe-4cab-b1b9-02b01d9f05c2

then register it with Savanna,

savanna image-register --id $IMAGE_ID --username fedora
savanna image-add-tag --id $IMAGE_ID --tag $PLUGIN_NAME
savanna image-add-tag --id $IMAGE_ID --tag $PLUGIN_VERSION
$ savanna image-list
| name           | id            | username | tags           | description |
| fedora_savanna | 1939b...f05c2 | fedora   | vanilla, 1.2.1 | None        |

FYI, --username fedora tells Savanna what account it can access on the instance that has sudo privileges. Adding the tags tells Savanna what plugin and version the image works with.

That’s all the input you need to provide. From here on the cluster creation is just a little more cut and pasting of a few commands.

First, a few commands to find IDs for the named values chosen above,

export MASTER_FLAVOR_ID=$(nova flavor-show $MASTER_FLAVOR | grep ' id ' | awk '{print $4}')
export WORKER_FLAVOR_ID=$(nova flavor-show $WORKER_FLAVOR | grep ' id ' | awk '{print $4}')
export MANAGEMENT_NETWORK_ID=$(neutron net-show net0 | grep ' id ' | awk '{print $4}')

Next, create some node group templates for the master and worker nodes. The CLI currently takes a JSON representation of the template. It also provides a JSON representation when showing template details to facilitate export & import.

export MASTER_TEMPLATE_ID=$(echo "{\"plugin_name\": \"$PLUGIN_NAME\", \"node_processes\": [\"namenode\", \"secondarynamenode\", \"oozie\", \"jobtracker\"], \"flavor_id\": \"$MASTER_FLAVOR_ID\", \"hadoop_version\": \"$PLUGIN_VERSION\", \"name\": \"master\"}" | savanna node-group-template-create | grep ' id ' | awk '{print $4}')

export WORKER_TEMPLATE_ID=$(echo "{\"plugin_name\": \"$PLUGIN_NAME\", \"node_processes\": [\"datanode\", \"tasktracker\"], \"flavor_id\": \"$WORKER_FLAVOR_ID\", \"hadoop_version\": \"$PLUGIN_VERSION\", \"name\": \"worker\"}" | savanna node-group-template-create | grep ' id ' | awk '{print $4}')

Now put those two node group templates together into a cluster template,

export CLUSTER_TEMPLATE_ID=$(echo "{\"plugin_name\": \"$PLUGIN_NAME\", \"node_groups\": [{\"count\": 1, \"name\": \"master\", \"node_group_template_id\": \"$MASTER_TEMPLATE_ID\"}, {\"count\": $WORKER_COUNT, \"name\": \"worker\", \"node_group_template_id\": \"$WORKER_TEMPLATE_ID\"}], \"hadoop_version\": \"$PLUGIN_VERSION\", \"name\": \"cluster\"}" | savanna cluster-template-create | grep ' id ' | awk '{print $4}')

Creating the node group and cluster templates only has to happen once, the final step, starting up the cluster, can be done multiple times.

echo "{\"cluster_template_id\": \"$CLUSTER_TEMPLATE_ID\", \"default_image_id\": \"$IMAGE_ID\", \"hadoop_version\": \"$PLUGIN_VERSION\", \"name\": \"cluster-instance-$(date +%s)\", \"plugin_name\": \"$PLUGIN_NAME\", \"user_keypair_id\": \"$KEYPAIR\", \"neutron_management_network\": \"$MANAGEMENT_NETWORK_ID\"}" | savanna cluster-create 

That’s it. You can nova list and ssh into the master instance, assuming you’re on the Neutron node and use ip netns exec, or you can login through the master node’s VNC console.

A recipe for starting cloud images with virt-install

January 8, 2014

I’m a fan of using the same OS image across multiple environments. So, I’m a fan of using cloud images, those with cloud-init installed, even outside of a cloud.

The trick to this is properly triggering the NoCloud datasource. It’s actually more of a pain than you would think, and not very well documented. Here’s my recipe (from Fedora 19),

xz -d Fedora-x86_64-20-Beta-20131106-sda.raw.xz

echo "#cloud-config\npassword: fedora\nchpasswd: {expire: False}\nssh_pwauth: True" > user-data

cp Fedora-x86_64-20-Beta-20131106-sda.raw.xz $NAME.raw
echo "instance-id: $NAME; local-hostname: $NAME" > meta-data
genisoimage -output $NAME-cidata.iso -volid cidata -joliet -rock user-data meta-data
virt-install --import --name $NAME --ram 512 --vcpus 2 --disk $NAME.raw --disk $NAME-cidata.iso,device=cdrom --network bridge=virbr0

Login with username fedora and password fedora.

You’ll also want to boost the amount of RAM if you plan on doing anything interesting in the guest.

You can repeat lines 6 through 10 to start multiple guests, just make sure to change the name in line 6.

If you want to ssh into the guest, you can use virsh console, login and use ifconfig / ip addr to find the address. Or, you can use arp -e and virsh dumpxml to match MAC addresses. Or just arp -e before and after starting the guest.

Note, you need to follow the meta-data and user-data lines very closely. If you don’t you may not trigger the NoCloud datasource properly. It took me a number of tries to get it right. Also, the volid needs to be “cidata” or it won’t be found, which turns out to be a configurable parameter for NoCloud. The chpasswd bit is to prevent being prompted to change your password the first time you login.

Consider becoming a fan of consistent OS images across your environments too!

Hello Fedora with docker in 3 steps

December 10, 2013

It really is this simple,

1. sudo yum install -y docker-io

2. sudo systemctl start docker

3. sudo docker run mattdm/fedora cat /etc/system-release

Bonus, for when you want to go deeper –

If you don’t want to use sudo all the time, which you shouldn’t want to do, you add yourself to the docker group,

$ sudo usermod -a -G docker $USER

If you don’t want to log out and back in, make your new group effective immediately,

$ su - $USER
$ groups | grep -q docker && echo Good job || echo Try again

If you want to run a known image, search for it on or on the command line,

$ docker search fedora

Try out a shell with,

$ docker run -i -t mattdm/fedora /bin/bash

Concurrency Limits: Group defaults

January 21, 2013

Concurrency limits allow for protecting resources by providing a way to cap the number of jobs requiring a specific resource that can run at one time.

For instance, limit licenses and filer access at four regional data centers.

license.north_LIMIT = 30
license.south_LIMIT = 30
license.east_LIMIT = 30
license.west_LIMIT = 45
filer.north_LIMIT = 75
filer.south_LIMIT = 150
filer.east_LIMIT = 75
filer.west_LIMIT = 75

Notice the repetition.

In addition to the repetition, every license.* and filer.* must be known and recorded in configuration. The set may be small in this example, but imagine imposing a limit on each user or each submission. The set of users is board, dynamic and may differ by region. The set of submissions is a more extreme version of the users case, yet it is still realistic.

To simplify the configuration management for groups of limits, a new feature to provide group defaults to limit was added for the Condor 7.8 series.

The feature requires that only the exception to a rule be called out explicitly in configuration. For instance, license.west and filer.south are the exceptions in the configuration above. Simplified configuration available in 7.8,

license.west_LIMIT = 45
filer.south_LIMIT = 150

In action,

$ for limit in license.north license.south license.east license.west filer.north filer.south filer.east filer.west; do echo queue 1000 | condor_submit -a cmd=/bin/sleep -a args=1d -a concurrency_limits=$limit; done

$ condor_q -format '%s\n' ConcurrencyLimits -const 'JobStatus == 2' | sort | uniq -c | sort -n
     30 license.east
     30 license.north
     30 license.south
     45 license.west
     75 filer.east
     75 filer.north
     75 filer.west
    150 filer.south

Configuration and policy evaluation

December 10, 2012

Figuring out how evaluation happens in configuration and policy is a common problem. The confusion is justified.

Configuration provides substitution with $() syntax, while policy is full ClassAd language evaluation without $() syntax.

Configuration is all the parameters listed in files discoverable with condor_config_val -config.

$ condor_config_val -config
Configuration source:
Local configuration sources:

Policy is the ClassAd expression found on the right-hand side of specific configuration parameters. For instance,

$ condor_config_val -v START
START: ( (KeyboardIdle > 15 * 60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner")) )
  Defined in '/etc/condor/condor_config', line 753.

Configuration evaluation allows for substitution of configuration parameters with $().

$ cat /etc/condor/condor_config | head -n753 | tail -n1

$ condor_config_val -v UWCS_START
UWCS_START: ( (KeyboardIdle > 15 * 60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner")) )
  Defined in '/etc/condor/condor_config', line 808.

$ cat /etc/condor/condor_config | head -n808 | tail -n3
UWCS_START	= ( (KeyboardIdle > $(StartIdleTime)) \
                    && ( $(CPUIdle) || \
                         (State != "Unclaimed" && State != "Owner")) )

Here START is actually the value of UWCS_START, provided by $(UWCS_START).

The substitution is recursive. Explore /etc/condor/condor_config and the JustCPU parameter. It is actually a parameter that is never read by daemons or tools. It is only useful in other configuration parameters. It’s shorthand.

Policy evaluation is full ClassAd expression evaluation. The evaluation happens at the appropriate times while daemons or tools are running.

Taking START as an example, the words KeyboardIdle, LoadAvg, CondorLoadAvg, State are attributes found on machine ads, and it is evaluated by the condor_startd and condor_negotiator to figure out if a job is allowed to start on a resource.

$ condor_status -l slot1@eeyore.local | grep -e ^KeyboardIdle -e ^LoadAvg -e ^CondorLoadAvg -e ^State
KeyboardIdle = 0
LoadAvg = 0.290000
CondorLoadAvg = 0.0
State = "Owner"

Evaluation happens by recursively evaluating those attributes. The expression ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner"))) becomes ((0 > 15 * 60) && (((0.29 - 0.0) <= 0.3) || ("Owner" != "Unclaimed" && "Owner" != "Owner"))). And so forth.

That’s it.

Extensible machine resources

November 19, 2012

Physical machines are home to many types of resources these days. The traditional cores, memory, disk, now share space with gpus, co-processors or even protein sequence analysis accelerators.

To facilitate use and management of these resources, a new feature is available in HTCondor for extending machine resources. Analogous to concurrency limits, which operate on a pool / global level, machine resources operate on a machine / local level.

The feature allows a machine to advertise that it has specific types of resources available. Jobs can then specify that they require those specific types of resources. And the matchmaker will take into account the new resource types.

By example, a machine may have some GPU resources, an RS232 connected to your favorite telescope, and a number of physical spinning hard disk drives. The configuration for this would be,


SLOT_TYPE_1 = cpus=100%,auto

Aside – cpus=100%,auto instead of just auto because of GT3327. Also, the configuration for SLOT_TYPE_1 will likely go away in the future when all slots are partitionable by default.

Once a machine with this configuration is running,

$ condor_status -long | grep -i MachineResources
MachineResources = &quot;cpus memory disk swap gpu rs232 spindle&quot;

$ condor_status -long | grep -i -e TotalCpus -e TotalMemory -e TotalGpu -e TotalRs232 -e TotalSpindle
TotalCpus = 24
TotalMemory = 49152
TotalGpu = 2
TotalRs232 = 1
TotalSpindle = 4

$ condor_status -long | grep -i -e ^Cpus -e ^Memory -e ^Gpu -e ^Rs232 -e ^Spindle
Cpus = 24
Memory = 49152
Gpu = 2
Rs232 = 1
Spindle = 4

As you can see, the machine is reporting the different types of resources, how many of each it has and how many are currently available.

A job can take advantage of these new types of resources using a syntax already familiar for requesting resources from partitionable slots.

To consume one of the GPUs,

cmd =

request_gpu = 1


Or for a disk intensive workload,

cmd =

request_spindle = 1


With these jobs submitted and running,

$ condor_status
Name            OpSys      Arch   State     Activity LoadAv Mem ActvtyTime

slot1@eeyore    LINUX      X86_64 Unclaimed Idle      0.400 48896 0+00:00:28
slot1_1@eeyore  LINUX      X86_64 Claimed   Busy      0.000  128 0+00:00:04
slot1_2@eeyore  LINUX      X86_64 Claimed   Busy      0.000  128 0+00:00:04
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        3     0       2         1       0          0
               Total        3     0       2         1       0          0

$ condor_status -l slot1@eeyore | grep -i -e ^Cpus -e ^Memory -e ^Gpu -e ^Rs232 -e ^Spindle
Cpus = 22
Memory = 48896
Gpu = 1
Rs232 = 1
Spindle = 3

That’s 22 cores, 1 gpu and 3 spindles still available.

Submit four more of the spindle consuming jobs and you’ll find the fourth does not run, because the available number of spindles is 0.

$ condor_status -l slot1@eeyore | grep -i -e ^Cpus -e ^Memory -e ^Gpu -e ^Rs232 -e ^Spindle
Cpus = 19
Memory = 48512
Gpu = 1
Rs232 = 1
Spindle = 0

Since these custom resources are available as attributes in various ClassAds the same way Cpu, Memory and Disk are, all the policy, management and reporting capabilities you would expect is available.

Pre and Post job scripts

October 29, 2012

Condor has a few ways to run programs associated with a job, beyond the job itself. If you’re an administrator, you can use the USER_JOB_WRAPPER. If you’re a user who is friends with your administrator, you can use Job Hooks. If you are ambitious, you can wrap all your jobs in a script that runs programs before and after your actual job.

Or, you can use the PreCmd and PostCmd attributes on your job. They specify programs to run before and after your job executes. By example,

$ cat prepost.job
cmd = /bin/sleep
args = 1

log = prepost.log
output = prepost.out
error = prepost.err

+PreCmd = &quot;pre_script&quot;
+PostCmd = &quot;post_script&quot;

transfer_input_files = pre_script, post_script
should_transfer_files = always

$ cat pre_script
date &gt; prepost.pre

$ cat post_script
date &gt;


$ condor_submit prepost.job
Submitting job(s)
1 job(s) submitted to cluster 1.

...wait a few seconds, or 259...

$ cat prepost.pre
Sun Oct 14 18:06:00 UTC 2012

$ cat
Sun Oct 14 18:06:02 UTC 2012

That’s about it, except for some gotchas.

  • transfer_input_files is manual and required
  • The scripts are run from Iwd, you can’t use +PreCmd=”/bin/blah”, instead +PreCmd=”blah” and transfer_input_files=/bin/blah
  • should_transfer_files = always, scripts are run from Iwd, if run local to the Schedd Iwd will be in the EXECUTE directory but the scripts won’t be
  • Script stdout/err and exit code are ignored
  • You must use +Attr=”” syntax, +PreCmd=pre_script won’t work
  • There is no option of arguments for the scripts
  • There is no starter environment, thus no $_CONDOR_JOB_AD/$_CONDOR_MACHINE_AD, but you can find .job_ad and .machine_ad in $_CONDOR_SCRATCH_DIR
  • Make sure the scripts are executable, otherwise the job will be put on hold with a reason similar to: Error from 127-0-0-1.NO_DNS: Failed to execute ‘…/dir_30626/pre_script’: Permission denied
  • PostCmd is broken in condor 7.6, but works in 7.8

Advanced scheduling: Execute periodically with cron jobs

October 15, 2012

If you want to run a job periodically you could repeatedly submit jobs, or qedit existing jobs after they run, but both of those options are a kludge. Instead, the condor_schedd provides support for cron-like jobs as a first-class citizen.

The cron-like feature builds on the ability to defer job execution. However, instead of using deferral_time, commands analogous to crontab(5) fields are available. cron_month, cron_day_of_month, cron_day_of_week, cron_hour, and cron_minute all behave as you would expect, and default to * when not provided.

To run a job every two minutes,

executable = /bin/date
log = cron.log
output = cron.out
error = cron.err

cron_minute = 0-59/2
on_exit_remove = false


Note – on_exit_remove = false is required or the job will only be run once. It is arguable that on_exit_remove should default to false for jobs using cron_* commands.

After submitting and waiting 10 minutes, results can be found in the cron.log file.

$ grep ^00 cron.log
000 (009.000.000) 09/09 09:22:46 Job submitted from host: &lt;;
001 (009.000.000) 09/09 09:24:00 Job executing on host: &lt;;
006 (009.000.000) 09/09 09:24:00 Image size of job updated: 75
004 (009.000.000) 09/09 09:24:00 Job was evicted.
001 (009.000.000) 09/09 09:26:00 Job executing on host: &lt;;
004 (009.000.000) 09/09 09:26:00 Job was evicted.
001 (009.000.000) 09/09 09:28:00 Job executing on host: &lt;;
004 (009.000.000) 09/09 09:28:00 Job was evicted.
001 (009.000.000) 09/09 09:30:00 Job executing on host: &lt;;
004 (009.000.000) 09/09 09:30:00 Job was evicted.
001 (009.000.000) 09/09 09:32:00 Job executing on host: &lt;;
004 (009.000.000) 09/09 09:32:01 Job was evicted.

Note – the job appears to be evicted instead of terminated. What really happens is the job remains in the queue on termination. This is arguably a poor choice of wording in the log.

Just like for job deferral, there is no guarantee resources will be available at exactly the right time to run the job. cron_prep_time and cron_window provide a means to introduce tolerance.

Common question: What happens when a job takes longer than the time between defined starts, i.e. job takes 30 minutes to complete and is set to be run every 15 minutes?

Answer: The job will run serially. It will not stack up. The job does not need to serialize itself.

Note – a common complication, arguably a bug, which occurs only in pools with little or no new jobs being submitted, is that matchmaking must happen in time for the job dispatch. The Schedd does not publish a new Submitter Ad for the cron job’s owner when the job completes. This means that submitter ad the Negotiator sees may have zero idle jobs, resulting in no new match being handed out to dispatch the job on the next time it is set to execute.


Tip: notification = never

October 8, 2012

By default, the condor_schedd will notify you, via email, when your job completes. This is a handy feature when running a few jobs, but can become overwhelming if you are running many jobs.

It can even turn into a problem if you are being notified at a mailbox you do not monitor.

# df /var
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/...             233747128 215868920   5813032  98% /

# du -a /var | sort -n -r | head -n 4
150436072       /var
111752396       /var/spool
111706452       /var/spool/mail
108702404       /var/spool/mail/matt

Yes, that’s ~105GB of job completion notification emails. All ignored. Oops.

The email notification feature is controlled on a per job basis by the notification command in a job’s submit file. See man condor_submit. To not get email notification, set it to NEVER, e.g.

$ echo queue | condor_submit -a cmd=/bin/hostname -a notification=never

If you are a pool administrator and want to change the default from COMPLETE to NEVER use the SUBMIT_EXPRS configuration parameter.

Notification = NEVER

Users will still be able to override the configured default by putting notification = complete|always|error in their submit files.

Keep those disks clean.

Advanced scheduling: Execute in the future with job deferral

September 24, 2012

One advanced scheduling feature of Condor is the ability to set a time, in the future, when a job should be run. This is called a deferral time.

Using the deferral_time command, you simply specify a time, in seconds since EPOCH, when your job should run:

executable = /bin/date
log = deferral.log
output = deferral.out
error = deferral.err

deferral_time = 1357016400


Use date(1) to generate the deferral_time.

$ date -d @1357016400
Tue Jan  1 00:00:00 EST 2013
$ date +%s -d &quot;2013-01-01 00:00:00&quot;

After submitting the job and waiting until 1 Jan 2013, you can see the result by looking in deferral.log and deferral.out.

$ grep ^00 deferral.log
000 (001.000.000) 08/15 22:33:00 Job submitted from host: &lt;;
001 (001.000.000) 01/01 00:00:00 Job executing on host: &lt;;
006 (001.000.000) 01/01 00:00:00 Image size of job updated: 75
005 (001.000.000) 01/01 00:00:00 Job terminated.

$ cat deferral.out
Tue Jan  1 00:00:00 EST 2013

Of course there is no guarantee that a resource will be available at a precise time in the future. A job that does not run at its deferral_time will be put on Hold for manual intervention.

To reduce the likelihood of missing the deferral_time and needing manual intervention, the deferral_prep_time and deferral_window commands are available. Respectively, they specify the amount of time before the deferral_time that the job can be matched with a resource and how long after the deferral_time execution is acceptable.

executable = /bin/date
log = deferral.log
output = deferral.out
error = deferral.err

deferral_time = 1357016400

# 1 day = 24 hour * 60 min * 60 sec = 86,400 seconds
# 1/2 day = 86,400 sec / 2 = 43,200 seconds
deferral_prep_time = 86400
deferral_window = 43200


In the example above, the job may be matched to a resource, where it will keep the resource Claimed/Busy for up to a day (deferral_prep_time) in advance of its actual run. This will make it more likely that the job will run at precisely the deferral_time. It also means that for accounting purposes, you will be charged for using the resource, though the job has not yet run.

Additionally, if the job is not matched or otherwise does not start at precisely deferral_time, it has half a day (deferral_window) to run before it is put on hold for manual intervention.

That’s it.

%d bloggers like this: