Archive for the ‘Condor’ Category

Statistic changes in HTCondor 7.7

February 12, 2013

Notice to HTCondor 7.8 users –

Statistics implemented during the 7.5 series that landed in 7.7.0 were rewritten by the time 7.8 was released. If you were using the original statistics for monitoring and/or reporting, here is a table to help you map old (left column) to new (right column).

See – 7.6 -> 7.8 schedd stats
(embedding content requires javascript, which is not available on wordpress.com)

Note: The *Rate and Mean* attributes require math, and UpdateTime requires memory

Advertisements

Some htcondor-wiki stats

January 29, 2013

A few years ago I discovered Web Numbr, a service that will monitor a web page for a number and graph that number over time.

I installed a handful of webnumbrs to track things at HTCondor’s gittrac instance.

http://webnumbr.com/search?query=condor

Thing such as –

  • Tickets resolved with no destination: tickets that don’t indicate what version they were fixed in. Anyone wanting to know if a bug is fixed or feature was added to their version of HTCodnor and encounters one of these will have to go spelunking in the repository for their answer.
  • Tickets resolved but not assigned: tickets that were worked on, completed, but whomever worked on them never claimed ownership.
  • Action items with commits: tickets that are marked as Todo/Incident, yet have associated code changes. Once there is a code change the ticket is either a bug fix (ticket type: defect) or feature addition (ticket type: enhancement). Extra work is imposed on whomever comes after the ticket owner who wants to understand what they are looking at. Additionally, these tickets skew information about bugs and features in releases.
  • Tickets with invalid version fields: tickets that do not follow the, somewhat strict, version field syntax – vXXYYZZ, e.g. v070901. All the extra 0s are necessary and the v must be lowercase.

I wanted to embed the numbers here, but javascript is needed and wordpress.com filters javascript from posts.

Concurrency Limits: Group defaults

January 21, 2013

Concurrency limits allow for protecting resources by providing a way to cap the number of jobs requiring a specific resource that can run at one time.

For instance, limit licenses and filer access at four regional data centers.

CONCURRENCY_LIMIT_DEFAULT = 15
license.north_LIMIT = 30
license.south_LIMIT = 30
license.east_LIMIT = 30
license.west_LIMIT = 45
filer.north_LIMIT = 75
filer.south_LIMIT = 150
filer.east_LIMIT = 75
filer.west_LIMIT = 75

Notice the repetition.

In addition to the repetition, every license.* and filer.* must be known and recorded in configuration. The set may be small in this example, but imagine imposing a limit on each user or each submission. The set of users is board, dynamic and may differ by region. The set of submissions is a more extreme version of the users case, yet it is still realistic.

To simplify the configuration management for groups of limits, a new feature to provide group defaults to limit was added for the Condor 7.8 series.

The feature requires that only the exception to a rule be called out explicitly in configuration. For instance, license.west and filer.south are the exceptions in the configuration above. Simplified configuration available in 7.8,

CONCURRENCY_LIMIT_DEFAULT = 15
CONCURRENCY_LIMIT_DEFAULT_license = 30
CONCURRENCY_LIMIT_DEFAULT_filer = 75
license.west_LIMIT = 45
filer.south_LIMIT = 150

In action,

$ for limit in license.north license.south license.east license.west filer.north filer.south filer.east filer.west; do echo queue 1000 | condor_submit -a cmd=/bin/sleep -a args=1d -a concurrency_limits=$limit; done

$ condor_q -format '%s\n' ConcurrencyLimits -const 'JobStatus == 2' | sort | uniq -c | sort -n
     30 license.east
     30 license.north
     30 license.south
     45 license.west
     75 filer.east
     75 filer.north
     75 filer.west
    150 filer.south

Tail your logs, for fun and profit

December 3, 2012

If you don’t run tail -F on your logs periodically, you should. It’s illuminating. Try,

tail -F /var/log/condor/*Log | grep -i -e error -e fail -e warn

I ran that over the weekend and learned a few things –

0) ERROR WriteUserLog Failed to grab global event log lock means that the EVENT_LOG is lossy in unexpected ways. We know the EVENT_LOG rotates and if you’re watching it but miss a rotation you’ll miss events. However, when the above warning (not ERROR imho) is printed the event that was going to be written is dropped. So the EVENT_LOG could be lossy on the edges and in the middle.

1) GroupTracker (pid = 13252): fopen error: Failed to open file '/proc/13252/cgroup'. Error No such file or directory (2), coming from the ProcLog, means that a tracked process has disappeared. The exact implications are not clear, but the author, Brian Bockelman, suggest the message could be quieted as it doesn’t represent a functional problem. Maybe D_ALWAYS -> D_FULLDEBUG.

2) tail: `/var/log/condor/JobServerLog' has become inaccessible: No such file or directory many times in a row. When the job_queue.log is compressed, effectively recreated, the condor_job_server enters a phase where it reconstructs its internal state, in an apparently noisy fashion and can rotate its log file multiple times per second.

3) (1157197.152) (12639): attempt to connect to <10.10.10.10:52143> failed: Connection refused (connect errno = 111). and (1157197.152) (12639): Attempt to reconnect failed: Failed to connect to starter &tl;10.10.10.10:52143> turned out to be an issue on 10.10.10.10, where all jobs from a user were failing to start because of passwd_cache::cache_uid(): getpwnam("matt") failed: user not found with ERROR: Uid for "matt" not found in passwd file and SOFT_UID_DOMAIN is False and ERROR: Failed to determine what user to run this job as, aborting. The host was effectively a black hole because of a misconfigured UID_DOMAIN.

4) (1157079.244) (1199): ERROR "Can no longer talk to condor_starter <10.10.10.11:52725> turned out to be an issue on 10.10.10.11, where all jobs were failing to start because of Create_Process: Cannot access specified executable "/tmp/mycondor/release_dir/sbin/condor_starter": errno = 2 (No such file or directory) with slot5: ERROR: exec_starter failed! and slot5: ERROR: exec_starter returned 0, which was more bad configuration.

5) FileLock::obtain(1) failed - errno 0 (Success) looks wrong.

Extensible machine resources

November 19, 2012

Physical machines are home to many types of resources these days. The traditional cores, memory, disk, now share space with gpus, co-processors or even protein sequence analysis accelerators.

To facilitate use and management of these resources, a new feature is available in HTCondor for extending machine resources. Analogous to concurrency limits, which operate on a pool / global level, machine resources operate on a machine / local level.

The feature allows a machine to advertise that it has specific types of resources available. Jobs can then specify that they require those specific types of resources. And the matchmaker will take into account the new resource types.

By example, a machine may have some GPU resources, an RS232 connected to your favorite telescope, and a number of physical spinning hard disk drives. The configuration for this would be,

MACHINE_RESOURCE_NAMES = GPU, RS232, SPINDLE
MACHINE_RESOURCE_GPU = 2
MACHINE_RESOURCE_RS232 = 1
MACHINE_RESOURCE_SPINDLE = 4

SLOT_TYPE_1 = cpus=100%,auto
SLOT_TYPE_1_PARTITIONABLE = TRUE
NUM_SLOTS_TYPE_1 = 1

Aside – cpus=100%,auto instead of just auto because of GT3327. Also, the configuration for SLOT_TYPE_1 will likely go away in the future when all slots are partitionable by default.

Once a machine with this configuration is running,

$ condor_status -long | grep -i MachineResources
MachineResources = &quot;cpus memory disk swap gpu rs232 spindle&quot;

$ condor_status -long | grep -i -e TotalCpus -e TotalMemory -e TotalGpu -e TotalRs232 -e TotalSpindle
TotalCpus = 24
TotalMemory = 49152
TotalGpu = 2
TotalRs232 = 1
TotalSpindle = 4

$ condor_status -long | grep -i -e ^Cpus -e ^Memory -e ^Gpu -e ^Rs232 -e ^Spindle
Cpus = 24
Memory = 49152
Gpu = 2
Rs232 = 1
Spindle = 4

As you can see, the machine is reporting the different types of resources, how many of each it has and how many are currently available.

A job can take advantage of these new types of resources using a syntax already familiar for requesting resources from partitionable slots.

To consume one of the GPUs,

cmd = luxmark.sh

request_gpu = 1

queue

Or for a disk intensive workload,

cmd = hadoop_datanode.sh

request_spindle = 1

queue

With these jobs submitted and running,

$ condor_status
Name            OpSys      Arch   State     Activity LoadAv Mem ActvtyTime

slot1@eeyore    LINUX      X86_64 Unclaimed Idle      0.400 48896 0+00:00:28
slot1_1@eeyore  LINUX      X86_64 Claimed   Busy      0.000  128 0+00:00:04
slot1_2@eeyore  LINUX      X86_64 Claimed   Busy      0.000  128 0+00:00:04
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        3     0       2         1       0          0
               Total        3     0       2         1       0          0

$ condor_status -l slot1@eeyore | grep -i -e ^Cpus -e ^Memory -e ^Gpu -e ^Rs232 -e ^Spindle
Cpus = 22
Memory = 48896
Gpu = 1
Rs232 = 1
Spindle = 3

That’s 22 cores, 1 gpu and 3 spindles still available.

Submit four more of the spindle consuming jobs and you’ll find the fourth does not run, because the available number of spindles is 0.

$ condor_status -l slot1@eeyore | grep -i -e ^Cpus -e ^Memory -e ^Gpu -e ^Rs232 -e ^Spindle
Cpus = 19
Memory = 48512
Gpu = 1
Rs232 = 1
Spindle = 0

Since these custom resources are available as attributes in various ClassAds the same way Cpu, Memory and Disk are, all the policy, management and reporting capabilities you would expect is available.

No longer thinking in slots, thinking in aggregate resources and consumption policies

November 13, 2012

The slot model was natural when a machine housed a single core. Though, the slot model did not exist when a machine housed a single core.

When machines were single core the model was a machine, represented as a MachineAd. A MachineAd had an associated CPU, some nominal amount of RAM and some chunk of disk space. Running a job meant consuming a machine.

When machines grew multiple cores the machine model was split. A single machine became independent MachineAds, called virtual machines. However, the name didn’t stick as the term virtual machine became a popular term in hardware virtualization. So a machine became independent MachineAds, called slots. The unifying entity, the machine itself, was lost. Running a job still meant consuming a slot.

Most recently, slots split into two classes: static and partitionable. Static slots are the slots formerly known as virtual machines. Partitionable slots are a representation of the physical machine itself, and are carved up, on-demand to service jobs. Both types are still MachineAds, but the consumption of partitionable slots is dynamic.

The slot model has demonstrated great utility but has been stretched.

In this time workloads have also changed. They have become more memory bound, disk IO bound, and network bound. They have started relying on specialized hardware and even application level services. They have started both spanning and packing into cores. They have grown complex data dependencies, become very short running, and become infrastructure level long running.

Machines have also grown to include scores of cores, hundreds of gigabytes of RAM, dozens of terabytes of disk, specialized hardware such as GPUs, co-processors, entropy keys, high speed interconnects and a bevy of other attached devices.

Machines are lumpy, heterogeneous means more than operating system and CPU architecture.

Furthermore, if it still existed, the machine model itself would fail to cleanly describe available resources. Classes of resources exist that house entire clusters, grids, or life-cycle manageable application services. Resources share addressable memory across operating systems instances, are custom architectures across whole data centers, and even those that don’t provide an outline of their capacity. Resources may grow and shrink while in use.

Consumption of these resources is not necessarily straightforward or uniform.

It’s time to stop thinking in slots. Its time to start thinking in aggregate resources and their consumption policies.

FAQ: Job resubmission?

November 5, 2012

A question that often arises when approaching Condor from other batch systems is “How does Condor deal with resubmission of failed/preempted/killed jobs?”

The answer requires a slight shift in thinking.

Condor provides more functionality around the resubmission use case than most other schedulers. And the default policy is setup in such a way that most Condor folks don’t ever think about “resubmission.”

Condor will keep your job in the queue (condor_schedd managed) until the policy attached to the job says otherwise.

The default policy says a job will be run as many time as necessary for the job to terminate. So if the machine a job is running on crashes (generally, becomes unavailable), the condor_schedd will automatically try to run the job on another machine.

When you start changing the default policy you can control things such as: if a job should be removed after a period of time, even if it is running or only if it hasn’t started running; if a job should run multiple times even if it terminated cleanly; if a termination w/ an error should make the job run again, be held in the queue for inspection, be removed from the queue; if a job held for inspection should be held forever or a specific amount of time; if a job should only start running at a specific time in the future, or be run at repeated intervals.

The condor_submit manual page can provide specifics.

Pre and Post job scripts

October 29, 2012

Condor has a few ways to run programs associated with a job, beyond the job itself. If you’re an administrator, you can use the USER_JOB_WRAPPER. If you’re a user who is friends with your administrator, you can use Job Hooks. If you are ambitious, you can wrap all your jobs in a script that runs programs before and after your actual job.

Or, you can use the PreCmd and PostCmd attributes on your job. They specify programs to run before and after your job executes. By example,

$ cat prepost.job
cmd = /bin/sleep
args = 1

log = prepost.log
output = prepost.out
error = prepost.err

+PreCmd = &quot;pre_script&quot;
+PostCmd = &quot;post_script&quot;

transfer_input_files = pre_script, post_script
should_transfer_files = always

queue
$ cat pre_script
#!/bin/sh
date &gt; prepost.pre

$ cat post_script
#!/bin/sh
date &gt; prepost.post

Running,

$ condor_submit prepost.job
Submitting job(s)
.
1 job(s) submitted to cluster 1.

...wait a few seconds, or 259...

$ cat prepost.pre
Sun Oct 14 18:06:00 UTC 2012

$ cat prepost.post
Sun Oct 14 18:06:02 UTC 2012

That’s about it, except for some gotchas.

  • transfer_input_files is manual and required
  • The scripts are run from Iwd, you can’t use +PreCmd=”/bin/blah”, instead +PreCmd=”blah” and transfer_input_files=/bin/blah
  • should_transfer_files = always, scripts are run from Iwd, if run local to the Schedd Iwd will be in the EXECUTE directory but the scripts won’t be
  • Script stdout/err and exit code are ignored
  • You must use +Attr=”” syntax, +PreCmd=pre_script won’t work
  • There is no option of arguments for the scripts
  • There is no starter environment, thus no $_CONDOR_JOB_AD/$_CONDOR_MACHINE_AD, but you can find .job_ad and .machine_ad in $_CONDOR_SCRATCH_DIR
  • Make sure the scripts are executable, otherwise the job will be put on hold with a reason similar to: Error from 127-0-0-1.NO_DNS: Failed to execute ‘…/dir_30626/pre_script’: Permission denied
  • PostCmd is broken in condor 7.6, but works in 7.8

Tip: ISO8601 dates in your logs

October 22, 2012

Condor produces internal data in both structured and unstructured forms.

The structured forms are just that and designed to be processed by external programs. These are the event logs (UserLog or EVENT_LOG), the HISTORY file and PER_JOB_HISTORY_DIR and POOL_HISTORY_DIR, and the job_queue.log and Accountantnew.log transaction logs.

The unstructured forms are for debugging and designed to be read by a person, often an experienced person. They are often called trace, or debug, logs and are the files in the LOG directory, or the extra output seen when passing -debug to command-line tools, i.e. condor_q -debug.

Consuming and processing the unstructured forms with external programs is increasingly important. Consider tracing incidents through a deployment of 50,000 geographically distributed, physical and virtual systems. Or, even 100 local systems.

More and more tools that provide the ability to aggregate unstructured logs are emerging and they all need to do some basic parsing of the logs. Help make their integration simpler and use a well defined format for timestamps.

For instance, ISO8601 –

DEBUG_TIME_FORMAT = "%Y-%m-%dT%H:%M:%S%z "

Advanced scheduling: Execute periodically with cron jobs

October 15, 2012

If you want to run a job periodically you could repeatedly submit jobs, or qedit existing jobs after they run, but both of those options are a kludge. Instead, the condor_schedd provides support for cron-like jobs as a first-class citizen.

The cron-like feature builds on the ability to defer job execution. However, instead of using deferral_time, commands analogous to crontab(5) fields are available. cron_month, cron_day_of_month, cron_day_of_week, cron_hour, and cron_minute all behave as you would expect, and default to * when not provided.

To run a job every two minutes,

executable = /bin/date
log = cron.log
output = cron.out
error = cron.err

cron_minute = 0-59/2
on_exit_remove = false

queue

Note – on_exit_remove = false is required or the job will only be run once. It is arguable that on_exit_remove should default to false for jobs using cron_* commands.

After submitting and waiting 10 minutes, results can be found in the cron.log file.

$ grep ^00 cron.log
000 (009.000.000) 09/09 09:22:46 Job submitted from host: &lt;127.0.0.1:56639&gt;
001 (009.000.000) 09/09 09:24:00 Job executing on host: &lt;127.0.0.1:45887&gt;
006 (009.000.000) 09/09 09:24:00 Image size of job updated: 75
004 (009.000.000) 09/09 09:24:00 Job was evicted.
001 (009.000.000) 09/09 09:26:00 Job executing on host: &lt;127.0.0.1:45887&gt;
004 (009.000.000) 09/09 09:26:00 Job was evicted.
001 (009.000.000) 09/09 09:28:00 Job executing on host: &lt;127.0.0.1:45887&gt;
004 (009.000.000) 09/09 09:28:00 Job was evicted.
001 (009.000.000) 09/09 09:30:00 Job executing on host: &lt;127.0.0.1:45887&gt;
004 (009.000.000) 09/09 09:30:00 Job was evicted.
001 (009.000.000) 09/09 09:32:00 Job executing on host: &lt;127.0.0.1:45887&gt;
004 (009.000.000) 09/09 09:32:01 Job was evicted.

Note – the job appears to be evicted instead of terminated. What really happens is the job remains in the queue on termination. This is arguably a poor choice of wording in the log.

Just like for job deferral, there is no guarantee resources will be available at exactly the right time to run the job. cron_prep_time and cron_window provide a means to introduce tolerance.

Common question: What happens when a job takes longer than the time between defined starts, i.e. job takes 30 minutes to complete and is set to be run every 15 minutes?

Answer: The job will run serially. It will not stack up. The job does not need to serialize itself.

Note – a common complication, arguably a bug, which occurs only in pools with little or no new jobs being submitted, is that matchmaking must happen in time for the job dispatch. The Schedd does not publish a new Submitter Ad for the cron job’s owner when the job completes. This means that submitter ad the Negotiator sees may have zero idle jobs, resulting in no new match being handed out to dispatch the job on the next time it is set to execute.

Enjoy.


%d bloggers like this: