Posts Tagged ‘Tools’

Some htcondor-wiki stats

January 29, 2013

A few years ago I discovered Web Numbr, a service that will monitor a web page for a number and graph that number over time.

I installed a handful of webnumbrs to track things at HTCondor’s gittrac instance.

http://webnumbr.com/search?query=condor

Thing such as –

  • Tickets resolved with no destination: tickets that don’t indicate what version they were fixed in. Anyone wanting to know if a bug is fixed or feature was added to their version of HTCodnor and encounters one of these will have to go spelunking in the repository for their answer.
  • Tickets resolved but not assigned: tickets that were worked on, completed, but whomever worked on them never claimed ownership.
  • Action items with commits: tickets that are marked as Todo/Incident, yet have associated code changes. Once there is a code change the ticket is either a bug fix (ticket type: defect) or feature addition (ticket type: enhancement). Extra work is imposed on whomever comes after the ticket owner who wants to understand what they are looking at. Additionally, these tickets skew information about bugs and features in releases.
  • Tickets with invalid version fields: tickets that do not follow the, somewhat strict, version field syntax – vXXYYZZ, e.g. v070901. All the extra 0s are necessary and the v must be lowercase.

I wanted to embed the numbers here, but javascript is needed and wordpress.com filters javascript from posts.

Partitionable slot utilization

October 1, 2012

There are already ways to get pool utilization information on a macro level. Until Condor 7.8 and the introduction of TotalSlot{Cpus,Memory,Disk}, there were no good ways to get utilization on a micro level. At least not with only the standard command-line tools.

Resource data is available per slot. Getting macro, pool utilization always requires aggregating data across multiple slots. Getting micro, slot utilization should not.

In a pool using partitionable slots, you can now get per slot utilization from the slot itself. There is no need for any extra tooling to perform aggregation or correlation. This means condor_status can directly provide utilization information on the micro, per slot level.

$ echo "           Name  Cpus Avail Util%  Memory Avail Util%"
$ condor_status -constraint "PartitionableSlot =?= TRUE" -format "%15s" Name -format "%6d" TotalSlotCpus -format "%6d" Cpus -format "%5d%%" "((TotalSlotCpus - Cpus) / (TotalSlotCpus * 1.0)) * 100" -format "%8d" TotalSlotMemory -format "%6d" Memory -format "%5d%%" "((TotalSlotMemory - Memory) / (TotalSlotMemory * 1.0)) * 100" -format "\n" TRUE 

           Name  Cpus Avail Util%  Memory Avail Util%
  slot1@eeyore0    16    12   25%   65536 48128   26%
  slot2@eeyore0    16    14   12%   65536 58368   10%
  slot1@eeyore1    16    12   25%   65536 40960   37%
  slot2@eeyore1    16    15    6%   65536 62464    4%

This is especially useful when machines are configured into combinations of multiple partitionable slots or partitionable and static slots.

Someday the pool utilization script should be integrated with condor_status.

Schedd stats with OpenSTDB

June 14, 2012

Building on Pool utilization with OpenSTDB, the condor_schedd also advertises a plethora of useful statistics that can be harvested with condor_status.

Make the metrics,

$ tsdb mkmetric condor.schedd.jobs.idle condor.schedd.jobs.running condor.schedd.jobs.held condor.schedd.jobs.mean_runtime condor.schedd.jobs.mean_waittime condor.schedd.jobs.historical.mean_runtime condor.schedd.jobs.historical.mean_waittime condor.schedd.jobs.submission_rate condor.schedd.jobs.start_rate condor.schedd.jobs.completion_rate
...

Obtain schedd_stats_for_opentsdb.sh and run,

$ while true; do ./schedd_stats_for_opentsdb.sh; sleep 15; done | nc -w 30 tsdb-host 4242 &

View the results at,

http://tsdb-host:4242/#start=1h-ago&m=sum:condor.schedd.jobs.idle&o=&m=sum:condor.schedd.jobs.running&o=&m=sum:condor.schedd.jobs.held&o=&m=sum:10m-avg:condor.schedd.jobs.submission_rate&o=axis%20x1y2&m=sum:10m-avg:condor.schedd.jobs.start_rate&o=axis%20x1y2&m=sum:10m-avg:condor.schedd.jobs.completion_rate&o=axis%20x1y2&ylabel=jobs&y2label=rates&yrange=%5B0:%5D&y2range=%5B0:%5D&key=out%20center%20top%20horiz%20box

Pool utilization with OpenSTDB

June 12, 2012

Merging the pool utilization script with OpenTSDB.

Once you have followed the stellar OpenTSDB Getting Started guide, make the metrics with,

$ tsdb mkmetric condor.pool.slots.unavail condor.pool.slots.avail condor.pool.slots.total condor.pool.slots.used condor.pool.slots.used_of_avail condor.pool.slots.used_of_total condor.pool.cpus.unavail condor.pool.cpus.avail condor.pool.cpus.total condor.pool.cpus.used condor.pool.cpus.used_of_avail condor.pool.cpus.used_of_total condor.pool.memory.unavail condor.pool.memory.avail condor.pool.memory.total condor.pool.memory.used condor.pool.memory.used_of_avail condor.pool.memory.used_of_total
metrics condor.pool.slots.unavail: [0, 0, 1]
metrics condor.pool.slots.avail: [0, 0, 2]
metrics condor.pool.slots.total: [0, 0, 3]
...
metrics condor.pool.memory.used_of_total: [0, 0, 17]

Obtain utilization_for_opentsdb.sh before running,

$ while true; do ./utilization_for_opentsdb.sh; sleep 15; done | nc -w 30 tsdb-host 4242 &

View the results at,

http://tsdb-host:4242/#start=1h-ago&m=sum:condor.pool.cpus.total&o=&m=sum:condor.pool.cpus.used&o=&m=sum:condor.pool.cpus.used_of_avail&o=axis%20x1y2&ylabel=cpus&y2label=%2525+utilization&yrange=%5B0:%5D&y2range=%5B0:1%5D&key=out%20center%20top%20horiz%20box

The number of statistics about operating Condor pools has been growing over the years. All are easily retrieved via condor_status for feeding into OpenTSDB.

Pool utilization

January 31, 2012

Here is a utilization script for a Condor pool.

$ ./utilization.sh
       Unavailable Available    Total     Used:  Avail   Total
Slots         5968      5451    11419     4179  76.66%  36.59%
Cpus          6314      5903    12217     4631  78.45%  37.90%
Memory    14277325  11776800 26054125  9908190  84.13%  38.02%

And, if you know your workload will not run on slots with less then 1GB of memory, you can filter out slots that are too small,

$ ./utilization.sh 'Memory < 1024'
       Unavailable Available    Total     Used:  Avail   Total
Slots         6292      5127    11419     4177  81.47%  36.57%
Cpus          6638      5579    12217     4629  82.97%  37.88%
Memory    14592711  11461414 26054125  9904193  86.41%  38.01%

Remember, if an attribute is not on all slots you need to use the meta-comparison operators: =?= and =!=, e.g. 'MyCustomAttr =!= True'.

New toy: newpgid

December 13, 2011

Useful with cpusoak and memsoak,

newpgid.c

#include <unistd.h>

int
main(int argc, char *argv[])
{
  setpgid(0, 0);

  execvp(argv[1], &(argv[1]));

  return 1;
}

When you want to start a new process in its own process group for easy killing.

If you have coreutils 7.0+, you can take advantage of timeout, which happens to setpgid.

Getting started: Condor and EC2 – Importing instances with condor_ec2_link

November 7, 2011

Starting and managing instances describes the powerful feature of Condor to start and manage EC2 instances, but what if you are already using something other than Condor to start your instance, such as the AWS Management Console.

Importing instances turns out to be straightforward, if you know how instances are started. In a nutshell, the condor_gridmanager executes a state machine and records its current state in an attribute named GridJobId. To import an instance, submit a job that is already in the state where an instance id has been assigned. You can take a submit file and add +GridJobId = “ec2 https://ec2.amazonaws.com/ BOGUS INSTANCE-ID. The INSTANCE-ID needs to be the actual identifier of the instance you want to import. For instance,

...
ec2_access_key_id = ...
ec2_secret_access_key = ...
...
+GridJobId = "ec2 https://ec2.amazonaws.com/ BOGUS i-319c3652"
queue

It is important to get the ec2_access_key_id and ec2_secret_access_key correct. Without them Condor will not be able to communicate with EC2 and EC2_GAHP_LOG will report,

$ tail -n2 $(condor_config_val EC2_GAHP_LOG)
11/11/11 11:11:11 Failure response text was '
AuthFailureAWS was not able to validate the provided access credentialsab50f005-6d77-4653-9cec-298b2d475f6e'.

This error will not be reported back into the job, putting it on hold, instead the gridmanager will think the EC2 is down for the job. Oops.

$ grep down $(condor_config_val GRIDMANAGER_LOG)
11/11/11 11:11:11 [10697] resource https://ec2.amazonaws.com is now down
11/11/11 11:14:22 [10697] resource https://ec2.amazonaws.com is still down

To simplify the import, here is a script that will use ec2-describe-instances to get useful metadata about the instance and populate a submit file for you,

condor_ec2_link

#!/bin/sh

# Provide three arguments:
#  . instance id to link
#  . path to file with access key id
#  . path to file with secret access key

# TODO:
#  . Get EC2UserData (ec2-describe-instance-attribute --user-data)

ec2-describe-instances --show-empty-fields $1 | \
   awk '/^INSTANCE/ {id=$2; ami=$3; keypair=$7; type=$10; zone=$12; ip=$17; group=$29}
        /^TAG/ {name=$5}
        END {print "universe = grid\n",
                   "grid_resource = ec2 https://ec2.amazonaws.com\n",
                   "executable =", ami"-"name, "\n",
                   "log = $(executable).$(cluster).log\n",
                   "ec2_ami_id =", ami, "\n",
                   "ec2_instance_type =", type, "\n",
                   "ec2_keypair_file = name-"keypair, "\n",
                   "ec2_security_groups =", group, "\n",
                   "ec2_availability_zone =", zone, "\n",
                   "ec2_elastic_ip =", ip, "\n",
                   "+EC2InstanceName = \""id"\"\n",
                   "+GridJobId = \"$(grid_resource) BOGUS", id, "\"\n",
                   "queue\n"}' | \
      condor_submit -a "ec2_access_key_id = $2" \
                    -a "ec2_secret_access_key = $3"

In action,

$ ./condor_ec2_link i-319c3652 /home/matt/Documents/AWS/Cert/AccessKeyID /home/matt/Documents/AWS/Cert/SecretAccessKey
Submitting job(s).
1 job(s) submitted to cluster 1739.

$ ./condor_ec2_q 1739
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
1739.0   matt           11/11 11:11   0+00:00:00 I  0   0.0 ami-e1f53a88-TheNa
  Instance name: i-319c3652
  Groups: sg-4f706226
  Keypair file: /home/matt/Documents/AWS/name-TheKeyPair
  AMI id: ami-e1f53a88
  Instance type: t1.micro
1 jobs; 1 idle, 0 running, 0 held

(20 seconds later)

$ ./condor_ec2_q 1739
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
1739.0   matt           11/11 11:11   0+00:00:01 R  0   0.0 ami-e1f53a88-TheNa
  Instance name: i-319c3652
  Hostname: ec2-50-17-104-50.compute-1.amazonaws.com
  Groups: sg-4f706226
  Keypair file: /home/matt/Documents/AWS/name-TheKeyPair
  AMI id: ami-e1f53a88
  Instance type: t1.micro
1 jobs; 0 idle, 1 running, 0 held

There are a few things that can be improved here, the most notable of which is the RUN_TIME. The Gridmanager gets status data from EC2 periodically. This is how the EC2RemoteVirtualMachineName (Hostname) gets populated on the job. The instance’s launch time is also available. Oops.

Getting started: Condor and EC2 – condor_ec2_q tool

November 2, 2011

While Getting started with Condor and EC2, it is useful to display the EC2 specific attributes on jobs. This is a script that mirrors condor_q output, using its formatting parameters, and adds details for EC2 jobs.

condor_ec2_q:

#!/bin/sh

# NOTE:
#  . Requires condor_q >= 7.5.2, old classads do not
#    have %
#  . When running, jobs show RUN_TIME of their current
#    run, not accumulated, which would require adding
#    in RemoteWallClockTime
#  . See condor_utils/condor_q.cpp:encode_status for
#    JobStatus map

# TODO:
#  . Remove extra ShadowBday==0 test,
#    condor_gridmanager < 7.7.5 (3a896d01) did not
#    delete ShadowBday when a job was not running.
#    RUN_TIME of held EC2 jobs would be wrong.

echo ' ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD'
condor_q \
   -format '%4d.' ClusterId \
   -format '%-3d ' ProcId \
   -format '%-14s ' Owner \
   -format '%-11s ' 'formatTime(QDate,"%m/%d %H:%M")' \
   -format '%3d+' 'ifThenElse(ShadowBday =!= UNDEFINED, ifThenElse(ShadowBday != 0, time() - ShadowBday, int(RemoteWallClockTime)), int(RemoteWallClockTime)) / (60*60*24)' \
   -format '%02d:' '(ifThenElse(ShadowBday =!= UNDEFINED, ifThenElse(ShadowBday != 0, time() - ShadowBday, int(RemoteWallClockTime)), int(RemoteWallClockTime)) % (60*60*24)) / (60*60)' \
   -format '%02d:' '(ifThenElse(ShadowBday =!= UNDEFINED, ifThenElse(ShadowBday != 0, time() - ShadowBday, int(RemoteWallClockTime)), int(RemoteWallClockTime)) % (60*60)) / 60' \
   -format '%02d ' 'ifThenElse(ShadowBday =!= UNDEFINED, ifThenElse(ShadowBday != 0, time() - ShadowBday, int(RemoteWallClockTime)), int(RemoteWallClockTime)) % 60' \
   -format '%-2s ' 'substr("?IRXCH>S", JobStatus, 1)' \
   -format '%-3d ' JobPrio \
   -format '%-4.1f ' ImageSize/1024.0 \
   -format '%-18.18s' 'strcat(Cmd," ",Arguments)' \
   -format '\n' Owner \
   -format '  Instance name: %s\n' EC2InstanceName \
   -format '  Hostname: %s\n' EC2RemoteVirtualMachineName \
   -format '  Keypair file: %s\n' EC2KeyPairFile \
   -format '  User data: %s\n' EC2UserData \
   -format '  User data file: %s\n' EC2UserDataFile \
   -format '  AMI id: %s\n' EC2AmiID \
   -format '  Instance type: %s\n' EC2InstanceType \
   "$@" | awk 'BEGIN {St["I"]=0;St["R"]=0;St["H"]=0} \
   	       {St[$6]++; print} \
   	       END {for (i=0;i<=7;i++) jobs+=St[substr("?IRXCH>S",i,1)]; \
	       	    print jobs, "jobs;", \
		          St["I"], "idle,", St["R"], "running,", St["H"], "held"}'

In action,

$ condor_q
-- Submitter: eeyore.local :  : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
1728.0   matt           10/31 23:09   0+00:04:15 H  0   0.0  EC2_Instance-ami-6
1732.0   matt           11/1  01:43   0+05:16:46 R  0   0.0  EC2_Instance-ami-6
5 jobs; 0 idle, 4 running, 1 held

$ ./condor_ec2_q 
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
1728.0   matt           10/31 23:09   0+00:04:15 H  0   0.0  EC2_Instance-ami-6
  Instance name: i-31855752
  Hostname: ec2-50-19-175-62.compute-1.amazonaws.com
  Keypair file: /home/matt/Documents/AWS/EC2_Instance-ami-60bd4609.1728.pem
  User data: Hello EC2_Instance-ami-60bd4609!
  AMI id: ami-60bd4609
  Instance type: m1.small
1732.0   matt           11/01 01:43   0+05:16:48 R  0   0.0  EC2_Instance-ami-6
  Instance name: i-a90edcca
  Hostname: ec2-107-20-6-83.compute-1.amazonaws.com
  Keypair file: /home/matt/Documents/AWS/EC2_Instance-ami-60bd4609.1732.pem
  User data: Hello EC2_Instance-ami-60bd4609!
  AMI id: ami-60bd4609
  Instance type: m1.small
5 jobs; 0 idle, 4 running, 1 held

Toys: cpusoak and memsoak

June 5, 2010

Useful for using up CPU time and Memory,

cpusoak.c

#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <errno.h>

int
main(int argc, char **argv)
{
	int children = argc > 1 ? atoi(argv[1]) : 1;

	while (children--) {
		pid_t pid = fork();
		if (!pid) {
			long n, i = 0;
			while (1) {
				n = ++i;
				while (n != 1) n = (n % 2) ? n * 3 + 1 : n / 2;
			}
		}
	}

	while (-1 != wait(NULL) && ECHILD != errno);

	return 0;
}

memsoak.c

#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <errno.h>
#include <stdlib.h>

int
main(int argc, char **argv)
{
	int children = argc > 1 ? atoi(argv[1]) : 1;
	int bytes = argc > 2 ? atoi(argv[2]) : 1024;
	int nap = argc > 3 ? atoi(argv[3]) : 3;

	while (children--) {
		pid_t pid = fork();
		if (!pid) {
			while (1) {
				malloc(bytes);
				sleep(nap);
			}
		}
	}

	while (-1 != wait(NULL) && ECHILD != errno);

	return 0;
}