Statistic changes in HTCondor 7.7

February 12, 2013

Notice to HTCondor 7.8 users -

Statistics implemented during the 7.5 series that landed in 7.7.0 were rewritten by the time 7.8 was released. If you were using the original statistics for monitoring and/or reporting, here is a table to help you map old (left column) to new (right column).

See – 7.6 -> 7.8 schedd stats
(embedding content requires javascript, which is not available on wordpress.com)

Note: The *Rate and Mean* attributes require math, and UpdateTime requires memory

How accounting group configuration could work with Wallaby

February 5, 2013

Configuration of accounting groups in HTCondor is too often an expert task that requires coordination between administrators and their tools.

Wallaby provides a coordination point, so long as a little convention is employed, and can provide a task specific interface to simplify configuration.

Quick background, Wallaby provides semantic configuration for HTCondor. It models a pool as parameters aggregated into features and nodes aggregated in groups, with features and individual parameters associated with nodes and groups. It provides semantic validation of configuration before it is distributed, and has expert knowledge for minimal impact configuration changes.

And, accounting group configuration in HTCondor is spread across seven fixed parameters (GROUP_NAMES, GROUP_ACCEPT_SURPLUS, GROUP_SORT_EXPR, GROUP_NAMES, GROUP_QUOTA_ROUND_ROBIN_RATE, GROUP_AUTOREGROUP, GROUP_QUOTA_MAX_ALLOCATION_ROUNDS), and another five dynamic parameters (GROUP_ACCEPT_SURPLUS_groupname, GROUP_AUTOREGROUP_groupname, GROUP_PRIO_FACTOR_groupname, GROUP_QUOTA_groupname, GROUP_QUOTA_DYNAMIC_groupname). These are dynamic because the “groupname” in the parameter is any name listed in the GROUP_NAMES parameter.

In addition to its other features, Wallaby has an extensible shell mechanism, which can be used to create task specific porcelain.

For instance, agree that tools and administrators will store accounting group configuration on a feature called AccountingGroups, and the tools can use Wallaby’s API to manipulate the configuration while the following porcelain can simplify the task for managing that configuration by administrators.

See – wallaby_accounting_group_porcelain.txt
(embedding content requires javascript, which is not available on wordpress.com)

Some htcondor-wiki stats

January 29, 2013

A few years ago I discovered Web Numbr, a service that will monitor a web page for a number and graph that number over time.

I installed a handful of webnumbrs to track things at HTCondor’s gittrac instance.

http://webnumbr.com/search?query=condor

Thing such as -

  • Tickets resolved with no destination: tickets that don’t indicate what version they were fixed in. Anyone wanting to know if a bug is fixed or feature was added to their version of HTCodnor and encounters one of these will have to go spelunking in the repository for their answer.
  • Tickets resolved but not assigned: tickets that were worked on, completed, but whomever worked on them never claimed ownership.
  • Action items with commits: tickets that are marked as Todo/Incident, yet have associated code changes. Once there is a code change the ticket is either a bug fix (ticket type: defect) or feature addition (ticket type: enhancement). Extra work is imposed on whomever comes after the ticket owner who wants to understand what they are looking at. Additionally, these tickets skew information about bugs and features in releases.
  • Tickets with invalid version fields: tickets that do not follow the, somewhat strict, version field syntax – vXXYYZZ, e.g. v070901. All the extra 0s are necessary and the v must be lowercase.

I wanted to embed the numbers here, but javascript is needed and wordpress.com filters javascript from posts.

Concurrency Limits: Group defaults

January 21, 2013

Concurrency limits allow for protecting resources by providing a way to cap the number of jobs requiring a specific resource that can run at one time.

For instance, limit licenses and filer access at four regional data centers.

CONCURRENCY_LIMIT_DEFAULT = 15
license.north_LIMIT = 30
license.south_LIMIT = 30
license.east_LIMIT = 30
license.west_LIMIT = 45
filer.north_LIMIT = 75
filer.south_LIMIT = 150
filer.east_LIMIT = 75
filer.west_LIMIT = 75

Notice the repetition.

In addition to the repetition, every license.* and filer.* must be known and recorded in configuration. The set may be small in this example, but imagine imposing a limit on each user or each submission. The set of users is board, dynamic and may differ by region. The set of submissions is a more extreme version of the users case, yet it is still realistic.

To simplify the configuration management for groups of limits, a new feature to provide group defaults to limit was added for the Condor 7.8 series.

The feature requires that only the exception to a rule be called out explicitly in configuration. For instance, license.west and filer.south are the exceptions in the configuration above. Simplified configuration available in 7.8,

CONCURRENCY_LIMIT_DEFAULT = 15
CONCURRENCY_LIMIT_DEFAULT_license = 30
CONCURRENCY_LIMIT_DEFAULT_filer = 75
license.west_LIMIT = 45
filer.south_LIMIT = 150

In action,

$ for limit in license.north license.south license.east license.west filer.north filer.south filer.east filer.west; do echo queue 1000 | condor_submit -a cmd=/bin/sleep -a args=1d -a concurrency_limits=$limit; done

$ condor_q -format '%s\n' ConcurrencyLimits -const 'JobStatus == 2' | sort | uniq -c | sort -n
     30 license.east
     30 license.north
     30 license.south
     45 license.west
     75 filer.east
     75 filer.north
     75 filer.west
    150 filer.south

Your API is a feature, give it real resource management

January 14, 2013

So much these days is about distributed resource management. That’s anything that can be created and destroyed in the cloud[0]. Proper management is especially important when the resource’s existence is tied to a real economy, e.g. your user’s credit card[1].

EC2 instance creation without idempotent RunInstance

EC2 instance creation without idempotent RunInstance

Above is a state machine required to ensure that resources created in AWS EC2 are not lost, i.e. do not have to be manually cleaned up. The green arrows represent error free flow. The rest is about error handling or external state changes, e.g. user terminated operation. This is from before EC2 supported idempotent instance creation.

The state machine rewritten to use idempotent instance creation,

EC2 instance creation with idempotent RunInstance

EC2 instance creation with idempotent RunInstance

What’s going on here? Handling failure during resource creation.

The important failure to consider as a client is what happens if you ask your resource provider to create something and you never hear back. This is a distributed system, there are numerous reasons why you may not hear back. For simplicity, consider the client code crashed between sending the request and receiving the response.

The solution is to construct a transaction for resource creation[2]. To construct a transaction, you need to atomically associate a piece of information with the resource at creation time. We’ll call that piece of information an anima.

In the old EC2 API, the only way to construct an anima was through controlling a security group or keypair. Since neither is tied to a real economy, both are reasonable options. The non-idempotent state machine above uses the keypair as it is less resource intensive for EC2.

On creation failure and with the anima in hand[3], the client must search the remote system for the anima before reattempting creation. This is handled by the GM_CHECK_VM state above.

Unfortunately, without explicit support in the API, i.e. lookup by anima, the search can be unnatural and expensive. For example, EC2 instances are not indexed on keypair. Searching requires a client side scan of all instances.

With the introduction of idempotent RunInstances, the portion of the state machine for constructing and locating the anima is reduced to the GM_SAVE_CLIENT_TOKEN state, an entirely local operation. The reduction in complexity is clear.

After two years, EC2 appears to be the only API providing idempotent instance creation[4]. Though APIs are starting to provide atomic anima association, often through metadata or instance attributes, and some even provide lookup.

You should provide an idempotent resource creation operation in your API too!

[0] “in the cloud” – really anywhere in any distributed system!
[1] Making money from forgotten or misplaced resources is a short term play.
[2] Alternatively, you can choose an architecture with a janitor process, which will bring its own complexities.
[3] “in hand” – so long as your hand is reliable storage.
[4] After a quick survey, I’m looking at you Rackspace, RimuHosting, vCloud, OpenNebula, OpenStack, Eucalyptus, GoGrid, Deltacloud, Google Compute Engine and Gandi.

Web design complexity

January 7, 2013

One thing that has always impressed me is the ability of web designers to deal with browser idiosyncrasies.

For instance, knowing why this happens in firefox-17.0.1-1.fc17.x86_64 -

A bootstrap btn-primary viewed from 0.0.0.0

A bootstrap btn-primary viewed from 0.0.0.0

A bootstrap btn-primary viewed from localhost

A bootstrap btn-primary viewed from localhost

A bootstrap btn-primary firebug computed color from 0.0.0.0 and localhost

A bootstrap btn-primary firebug computed color from 0.0.0.0 and localhost

Needless to say, the web is littered with questions about why btn-primary background color is not always white. Most have answers, some with varying degrees of complex css. Others involve changing versions of software. All the while it might just be the URL used to view the page.

Test in a production environment.

Configuration and policy evaluation

December 10, 2012

Figuring out how evaluation happens in configuration and policy is a common problem. The confusion is justified.

Configuration provides substitution with $() syntax, while policy is full ClassAd language evaluation without $() syntax.

Configuration is all the parameters listed in files discoverable with condor_config_val -config.

$ condor_config_val -config
Configuration source:
	/etc/condor/condor_config
Local configuration sources:
	/etc/condor/config.d/00personal_condor.config

Policy is the ClassAd expression found on the right-hand side of specific configuration parameters. For instance,

$ condor_config_val -v START
START: ( (KeyboardIdle > 15 * 60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner")) )
  Defined in '/etc/condor/condor_config', line 753.

Configuration evaluation allows for substitution of configuration parameters with $().

$ cat /etc/condor/condor_config | head -n753 | tail -n1
START			= $(UWCS_START)

$ condor_config_val -v UWCS_START
UWCS_START: ( (KeyboardIdle > 15 * 60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner")) )
  Defined in '/etc/condor/condor_config', line 808.

$ cat /etc/condor/condor_config | head -n808 | tail -n3
UWCS_START	= ( (KeyboardIdle > $(StartIdleTime)) \
                    && ( $(CPUIdle) || \
                         (State != "Unclaimed" && State != "Owner")) )

Here START is actually the value of UWCS_START, provided by $(UWCS_START).

The substitution is recursive. Explore /etc/condor/condor_config and the JustCPU parameter. It is actually a parameter that is never read by daemons or tools. It is only useful in other configuration parameters. It’s shorthand.

Policy evaluation is full ClassAd expression evaluation. The evaluation happens at the appropriate times while daemons or tools are running.

Taking START as an example, the words KeyboardIdle, LoadAvg, CondorLoadAvg, State are attributes found on machine ads, and it is evaluated by the condor_startd and condor_negotiator to figure out if a job is allowed to start on a resource.

$ condor_status -l slot1@eeyore.local | grep -e ^KeyboardIdle -e ^LoadAvg -e ^CondorLoadAvg -e ^State
KeyboardIdle = 0
LoadAvg = 0.290000
CondorLoadAvg = 0.0
State = "Owner"

Evaluation happens by recursively evaluating those attributes. The expression ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner"))) becomes ((0 > 15 * 60) && (((0.29 - 0.0) <= 0.3) || ("Owner" != "Unclaimed" && "Owner" != "Owner"))). And so forth.

That’s it.

Tail your logs, for fun and profit

December 3, 2012

If you don’t run tail -F on your logs periodically, you should. It’s illuminating. Try,

tail -F /var/log/condor/*Log | grep -i -e error -e fail -e warn

I ran that over the weekend and learned a few things -

0) ERROR WriteUserLog Failed to grab global event log lock means that the EVENT_LOG is lossy in unexpected ways. We know the EVENT_LOG rotates and if you’re watching it but miss a rotation you’ll miss events. However, when the above warning (not ERROR imho) is printed the event that was going to be written is dropped. So the EVENT_LOG could be lossy on the edges and in the middle.

1) GroupTracker (pid = 13252): fopen error: Failed to open file '/proc/13252/cgroup'. Error No such file or directory (2), coming from the ProcLog, means that a tracked process has disappeared. The exact implications are not clear, but the author, Brian Bockelman, suggest the message could be quieted as it doesn’t represent a functional problem. Maybe D_ALWAYS -> D_FULLDEBUG.

2) tail: `/var/log/condor/JobServerLog' has become inaccessible: No such file or directory many times in a row. When the job_queue.log is compressed, effectively recreated, the condor_job_server enters a phase where it reconstructs its internal state, in an apparently noisy fashion and can rotate its log file multiple times per second.

3) (1157197.152) (12639): attempt to connect to <10.10.10.10:52143> failed: Connection refused (connect errno = 111). and (1157197.152) (12639): Attempt to reconnect failed: Failed to connect to starter &tl;10.10.10.10:52143> turned out to be an issue on 10.10.10.10, where all jobs from a user were failing to start because of passwd_cache::cache_uid(): getpwnam("matt") failed: user not found with ERROR: Uid for "matt" not found in passwd file and SOFT_UID_DOMAIN is False and ERROR: Failed to determine what user to run this job as, aborting. The host was effectively a black hole because of a misconfigured UID_DOMAIN.

4) (1157079.244) (1199): ERROR "Can no longer talk to condor_starter <10.10.10.11:52725> turned out to be an issue on 10.10.10.11, where all jobs were failing to start because of Create_Process: Cannot access specified executable "/tmp/mycondor/release_dir/sbin/condor_starter": errno = 2 (No such file or directory) with slot5: ERROR: exec_starter failed! and slot5: ERROR: exec_starter returned 0, which was more bad configuration.

5) FileLock::obtain(1) failed - errno 0 (Success) looks wrong.

Social scheduling

November 26, 2012

As a thought experiment.

There are always multiple users and limited resources. Users have work, which takes time and resources to complete.

The top resource users are visible to all.

A user can relinquish resources she is using.

A relinquished resource, either by work completing or by user action, is reassigned randomly.

How would this not work?

How would you refine it?

Extensible machine resources

November 19, 2012

Physical machines are home to many types of resources these days. The traditional cores, memory, disk, now share space with gpus, co-processors or even protein sequence analysis accelerators.

To facilitate use and management of these resources, a new feature is available in HTCondor for extending machine resources. Analogous to concurrency limits, which operate on a pool / global level, machine resources operate on a machine / local level.

The feature allows a machine to advertise that it has specific types of resources available. Jobs can then specify that they require those specific types of resources. And the matchmaker will take into account the new resource types.

By example, a machine may have some GPU resources, an RS232 connected to your favorite telescope, and a number of physical spinning hard disk drives. The configuration for this would be,

MACHINE_RESOURCE_NAMES = GPU, RS232, SPINDLE
MACHINE_RESOURCE_GPU = 2
MACHINE_RESOURCE_RS232 = 1
MACHINE_RESOURCE_SPINDLE = 4

SLOT_TYPE_1 = cpus=100%,auto
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 1

Aside – cpus=100%,auto instead of just auto because of GT3327. Also, the configuration for SLOT_TYPE_1 will likely go away in the future when all slots are partitionable by default.

Once a machine with this configuration is running,

$ condor_status -long | grep -i MachineResources
MachineResources = &quot;cpus memory disk swap gpu rs232 spindle&quot;

$ condor_status -long | grep -i -e TotalCpus -e TotalMemory -e TotalGpu -e TotalRs232 -e TotalSpindle
TotalCpus = 24
TotalMemory = 49152
TotalGpu = 2
TotalRs232 = 1
TotalSpindle = 4

$ condor_status -long | grep -i -e ^Cpus -e ^Memory -e ^Gpu -e ^Rs232 -e ^Spindle
Cpus = 24
Memory = 49152
Gpu = 2
Rs232 = 1
Spindle = 4

As you can see, the machine is reporting the different types of resources, how many of each it has and how many are currently available.

A job can take advantage of these new types of resources using a syntax already familiar for requesting resources from partitionable slots.

To consume one of the GPUs,

cmd = luxmark.sh

request_gpu = 1

queue

Or for a disk intensive workload,

cmd = hadoop_datanode.sh

request_spindle = 1

queue

With these jobs submitted and running,

$ condor_status
Name            OpSys      Arch   State     Activity LoadAv Mem ActvtyTime

slot1@eeyore    LINUX      X86_64 Unclaimed Idle      0.400 48896 0+00:00:28
slot1_1@eeyore  LINUX      X86_64 Claimed   Busy      0.000  128 0+00:00:04
slot1_2@eeyore  LINUX      X86_64 Claimed   Busy      0.000  128 0+00:00:04
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        3     0       2         1       0          0
               Total        3     0       2         1       0          0

$ condor_status -l slot1@eeyore | grep -i -e ^Cpus -e ^Memory -e ^Gpu -e ^Rs232 -e ^Spindle
Cpus = 22
Memory = 48896
Gpu = 1
Rs232 = 1
Spindle = 3

That’s 22 cores, 1 gpu and 3 spindles still available.

Submit four more of the spindle consuming jobs and you’ll find the fourth does not run, because the available number of spindles is 0.

$ condor_status -l slot1@eeyore | grep -i -e ^Cpus -e ^Memory -e ^Gpu -e ^Rs232 -e ^Spindle
Cpus = 19
Memory = 48512
Gpu = 1
Rs232 = 1
Spindle = 0

Since these custom resources are available as attributes in various ClassAds the same way Cpu, Memory and Disk are, all the policy, management and reporting capabilities you would expect is available.


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: