Archive for June, 2011

Concurrency Limits: Protecting shared resources

June 27, 2011

Concurrency Limits, sometimes called resource limits, are Condor‘s way of giving administrators and users a tool to protect limited resources.

A popular resource to protect is a software license. Take for example jobs that run Matlab. Matlab uses flexlm and users often have a limited number of licenses available, effectively limiting how many jobs they can run concurrently. Condor does not and does not need to integrate with flexlm here. Condor lets a user specify concurrency_limits = matlab with their job and administrators to add MATLAB_LIMIT = 64 to configuration.

Other uses include limiting the number of jobs connecting to network filesystem filer, limiting the number of jobs a user can be running, limiting the number of jobs running in a submission, and really anything else that can be managed at a global pool level. I have also heard of people using them to limit database connections and implement a global pool load share.

The global aspect of these resources is important. Concurrency limits are not local to nodes, e.g. for GPU management. Limits are managed by the Negotiator. They work because jobs contain a list of their limits and slot advertisements contain a list of active limits. During the negotiation cycle, the negotiator can sum up the active limits and compare with the configured maximum and what a job is requesting.

Also, limits are not considered in preemption decisions. Changes to limits on a running job, via qedit, will not impact the job until it stops. This means a job cannot give up a limit it no longer needs when it exits a certain phase of execution – consider DAGs here. And, lowering a limit via configuration will not result in job preemption.

By example,

First the configuration needs to be on the Negotiator, e.g.

$ condor_config_val -dump | grep LIMIT
CONCURRENCY_LIMIT_DEFAULT = 3
SLEEP_LIMIT = 1
AWAKE_LIMIT = 2

This says that there can be a maximum of 1 job using the SLEEP resources at a time. This is across all users and all accounting groups.

$ cat > limits.sub
cmd = /bin/sleep
args = 1d
concurrency_limits = sleep
queue
^D

$ condor_submit -a 'queue 4' limits.sub
Submitting job(s)...
3 job(s) submitted to cluster 41.

$ condor_q
-- Submitter: eeyore.local :  : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  41.0   matt            7/04 12:21   0+00:55:55 R  0   4.2  sleep 1d
  41.1   matt            7/04 12:21   0+00:00:00 I  0   0.0  sleep 1d
  41.2   matt            7/04 12:21   0+00:00:00 I  0   0.0  sleep 1d
  41.3   matt            7/04 12:21   0+00:00:00 I  0   0.0  sleep 1d
4 jobs; 3 idle, 1 running, 0 held

(A) $ condor_q -format "%s " GlobalJobId -format "%s " ConcurrencyLimits -format "%s" LastRejMatchReason -format "\n" None
eeyore.local#41.0 sleep
eeyore.local#41.1 sleep concurrency limit reached
eeyore.local#41.2 sleep
eeyore.local#41.3 sleep

(B) $ condor_status -format "%s " Name -format "%s " GlobalJobId -format "%s" ConcurrencyLimits -format "\n" None
slot1@eeyore.local eeyore.local#41.0 sleep
slot2@eeyore.local
slot3@eeyore.local
slot4@eeyore.local

(A) shows each job wants to use the sleep limit. It also shows that job 41.1 did not match because its concurrency limits were reached. (B) shows that only 41.0 got to run, on slot1. Notice, the limit is present on the slot’s ad.

The Negotiator can also be asked about active limits directly,

$ condor_userprio -l | grep ConcurrencyLimit
ConcurrencyLimit_sleep = 1.000000

That’s well and good, but there are three more things to know about: 0) the default maximum, 1) multiple limits, 2) duplicate limits.

First, the default maximum, CONCURRENCY_LIMIT_DEFAULT, apply to any limit that is not explicitly named in configuration, as SLEEP was.

$ condor_submit -a 'concurrency_limits = biff' -a 'queue 4' limits.sub
Submitting job(s)....
4 job(s) submitted to cluster 42.

$ condor_rm 41
Cluster 41 has been marked for removal.

$ condor_q
-- Submitter: eeyore.local :  : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  42.0   matt            7/04 12:34   0+00:00:22 R  0   0.0  sleep 1d
  42.1   matt            7/04 12:34   0+00:00:22 R  0   0.0  sleep 1d
  42.2   matt            7/04 12:34   0+00:00:22 R  0   0.0  sleep 1d
  42.3   matt            7/04 12:34   0+00:00:00 I  0   0.0  sleep 1d
8 jobs; 4 idle, 4 running, 0 held

$ condor_q -format "%s " GlobalJobId -format "%s " ConcurrencyLimits -format "%s" LastRejMatchReason -format "\n" None
eeyore.local#42.0 biff
eeyore.local#42.1 biff
eeyore.local#42.2 biff
eeyore.local#42.3 biff concurrency limit reached

Second, a job can require multiple limits at the same time. The job will need to consume each limit to run, and the most restricted limit will dictate if the job runs.

$ condor_rm -a
All jobs marked for removal.

$ condor_submit -a 'concurrency_limits = sleep,awake' -a 'queue 4' limits.sub
Submitting job(s)....
4 job(s) submitted to cluster 43.

$ condor_q
-- Submitter: eeyore.local :  : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  43.0   matt            7/04 13:07   0+00:00:13 R  0   0.0  sleep 1d
  43.1   matt            7/04 13:07   0+00:00:00 I  0   0.0  sleep 1d
  43.2   matt            7/04 13:07   0+00:00:00 I  0   0.0  sleep 1d
  43.3   matt            7/04 13:07   0+00:00:00 I  0   0.0  sleep 1d
4 jobs; 3 idle, 1 running, 0 held

$ condor_q -format "%s " GlobalJobId -format "%s " ConcurrencyLimits -format "%s" LastRejMatchReason -format "\n" None
eeyore.local#43.0 awake,sleep
eeyore.local#43.1 awake,sleep concurrency limit reached
eeyore.local#43.2 awake,sleep
eeyore.local#43.3 awake,sleep

$ condor_status -format "%s " Name -format "%s " GlobalJobId -format "%s" ConcurrencyLimits -format "\n" None
slot1@eeyore.local eeyore.local#43.0 awake,sleep
slot2@eeyore.local
slot3@eeyore.local
slot4@eeyore.local

Only one job gets to run because even though there are two awake limits available, there is only one sleep available.

Finally, a job can require more than one of the same limit. In fact, the requirement can be fractional.

$ condor_rm -a
All jobs marked for removal.

$ condor_submit -a 'concurrency_limits = sleep:2.0' -a 'queue 4' limits.sub
Submitting job(s)....
4 job(s) submitted to cluster 44.

$ condor_submit -a 'concurrency_limits = awake:2.0' -a 'queue 4' limits.sub
Submitting job(s)....
4 job(s) submitted to cluster 45.

$ condor_q
-- Submitter: eeyore.local :  : eeyore.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD
  44.0   matt            7/04 13:11   0+00:00:00 I  0   0.0  sleep 1d
  44.1   matt            7/04 13:11   0+00:00:00 I  0   0.0  sleep 1d
  44.2   matt            7/04 13:11   0+00:00:00 I  0   0.0  sleep 1d
  44.3   matt            7/04 13:11   0+00:00:00 I  0   0.0  sleep 1d
  45.0   matt            7/04 13:13   0+00:00:24 R  0   0.0  sleep 1d
  45.1   matt            7/04 13:13   0+00:00:00 I  0   0.0  sleep 1d
  45.2   matt            7/04 13:13   0+00:00:00 I  0   0.0  sleep 1d
  45.3   matt            7/04 13:13   0+00:00:00 I  0   0.0  sleep 1d
8 jobs; 7 idle, 1 running, 0 held


$ condor_q -format "%s " GlobalJobId -format "%s " ConcurrencyLimits -format "%s" LastRejMatchReason -format "\n" None
eeyore.local#44.0 sleep:2.0 concurrency limit reached
eeyore.local#44.1 sleep:2.0
eeyore.local#44.2 sleep:2.0
eeyore.local#44.3 sleep:2.0
eeyore.local#45.0 awake:2.0
eeyore.local#45.1 awake:2.0 concurrency limit reached
eeyore.local#45.2 awake:2.0
eeyore.local#45.3 awake:2.0

$ condor_userprio -l | grep Limit
ConcurrencyLimit_awake = 2.000000
ConcurrencyLimit_sleep = 0.0

Here none of the jobs in cluster 44 will run, they each need more SLEEP than is available. Also, only one of the jobs in cluster 45 can run at a time, because each one uses up all the AWAKE when it runs.

Getting Started: Multiple node Condor pool with firewalls

June 21, 2011

Creating a Condor pool with no firewalls up is quite a simple task. Before the condor_shared_port daemon, doing the same with firewalls was a bit painful.

Condor uses dynamic ports for everything except the Collector. The Collector endpoint is the bootstrap. This means a Schedd might start up on a random ephemeral port, and each of its shadows might as well. This causes headaches for firewalls as large ranges of ports need to be opened for communication. There are ways to control the ephemeral range used. Unfortunately, doing so just reduced the port range some, did not guarantee Condor was on the ports, and could limit scale.

The condor_shared_port daemon allows Condor to use a single inbound port on a machine.

Again, using Fedora 15. I had no luck with firewalld and firewall-cmd. Instead I fell back to using straight iptables.

The first thing to do is pick a port for Condor to use on your machines. The simplest thing to do is pick 9618, the port typically known as the Collector’s port.

On all machines where Condor is going to run, you want to –

# lokkit --enabled

# service iptables start
Starting iptables (via systemctl):  [  OK  ]

# service iptables status
Table: filter
Chain INPUT (policy ACCEPT)
num  target     prot opt source               destination
1    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           state RELATED,ESTABLISHED
2    ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0
3    ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0
4    REJECT     all  --  0.0.0.0/0            0.0.0.0/0 reject-with icmp-host-prohibited

Chain FORWARD (policy ACCEPT)
num  target     prot opt source               destination
1    REJECT     all  --  0.0.0.0/0            0.0.0.0/0 reject-with icmp-host-prohibited

Chain OUTPUT (policy ACCEPT)
num  target     prot opt source               destination

If you want to ssh to the machine again, be sure to insert rules above the “REJECT ALL — …” –

# iptables -I INPUT 4 -p tcp -m tcp --dport 22 -j ACCEPT

And open a port, both TCP and UDP, for the shared port daemon –

# iptables -I INPUT 5 -p tcp -m tcp --dport condor -j ACCEPT
# iptables -I INPUT 6 -p udp -m udp --dport condor -j ACCEPT

Next you want to configure Condor to use the shared port daemon, with port 9618 –

# cat > /etc/condor/config.d/41shared_port.config
SHARED_PORT_ARGS = -p 9618
DAEMON_LIST = $(DAEMON_LIST), SHARED_PORT
COLLECTOR_HOST = $(CONDOR_HOST)?sock=collector
USE_SHARED_PORT = TRUE
^D

In order, SHARED_PORT_ARGS tells the shared port daemon to listen on port 9618, DAEMON_LIST tells the master to start the shared port daemon, COLLECTOR_HOST specifies that the collector will be on the sock named “collector”, and finally USE_SHARED_PORT tells all daemons to register and use the shared port daemon.

After you put that configuration on all your systems, run service condor restart, and go.

You will have the shared port daemon listening on 9618 (condor), and all communication between machines will around through it.

# lsof -i | grep $(pidof condor_shared_port)
condor_sh 31040  condor    8u  IPv4  74105      0t0  TCP *:condor (LISTEN)
condor_sh 31040  condor    9u  IPv4  74106      0t0  UDP *:condor

That’s right, you have a condor pool with firewalls and a single port opened for communication on each node.

Customizing Condor configuration: LOCAL_CONFIG_FILE vs LOCAL_CONFIG_DIR

June 16, 2011

Condor has a powerful configuration system. The language is powerful and so are the ways to extend default configuration.

All Condor processes, daemons/auxiliary programs/command-line tools, read configuration files in the same way. They start with what is commonly called the “global configuration file.” It is not so much global as it is a place for Condor distributors to put configuration that should be common to all installations, not to be directly edited by users. It is a place where distributors can safely change configuration between versions without having to worry about merge conflicts, and users do not have to worry about reapplying their changes.

The global configuration file is one of the following, in order:

0) Filename specified in a CONDOR_CONFIG environment variable
1) /etc/condor/condor_config
2) /use/local/etc/condor_config
3) ~condor/condor_config

For those who care, src/condor_utils/condor_config.cpp defines the order. Fedora uses /etc/condor/condor_config, allowing CONDOR_CONFIG to override.

The most important aspect of the global config file is how it enables users to extend configuration.

Historically, extension was done via the LOCAL_CONFIG_FILE. It provided a single location that a user/administrator could add configuration. It has been around for almost 15 years (since ~1997), and still works well for some use cases, such as host config files managed in a shared filesystem. However, it has a huge drawback that it requires coordinated editing. The coordination extends to features that are packaged and installed on top of Condor. Using it as a StringList does not alleviate the coordination, just extends it to the global config file.

Enter LOCAL_CONFIG_DIR, in March 2006. It provides the common configuration directory mechanism found in other systems software, such as /etc/ld.so.conf.d and /etc/yum.repo.d. It allows administrators and packages to play in a single sandbox and be properly isolated.

The way to extend Condor configuration in Fedora or Red Hat Enterprise Linux is via /etc/condor/config.d, set from LOCAL_CONFIG_DIR=/etc/condor/config.d in /etc/condor/condor_config.

But wait, you’re right, some coordination is still necessary when there is parameter overlap between files. That’s much less coordination though.

The way Condor’s configuration language works means that files read later during configuration can override parameters set in earlier files. For instance,

$ ls -l /etc/condor/config.d
total 16
-rw-r--r--. 1 root root  720 May 31 11:42 00personal_condor.config
-rw-r--r--. 1 root root 1434 May 31 11:39 61aviary.config

$ grep DAEMON_LIST /etc/condor/config.d/00personal_condor.config
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

$ condor_config_val -v DAEMON_LIST
DAEMON_LIST: COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD, QUERY_SERVER
  Defined in '/etc/condor/config.d/61aviary.config', line 20.

This can be handled in a few simple ways,

0) Append to parameters whenever possible, for instance above –

$ grep DAEMON_LIST /etc/condor/config.d/61aviary.config
DAEMON_LIST = $(DAEMON_LIST), QUERY_SERVER

1) Separate user managed files from package managed files –

Prefix all files with two-digit numbers with the following ranges:

. 00 – reserved for a default config, e.g. 00personal_condor.config
. 10-40 – user/admin configuration files
. 50-80 – packaged configuration files
. 99 – reserved for features requiring control of configuration

Finally, if you still need to use LOCAL_CONFIG_FILE, you can always set it within a configuration file under /etc/condor/config.d.

Getting started: Creating a multiple node Condor pool

June 12, 2011

Assuming you have read how to setup a personal condor, the next step is to add more machines.

This time from Fedora 15, with iptables -L indicating the firewall is off.

[root@node0 ~]# rpm -q condor
condor-7.7.0-0.4.fc15.x86_64

[root@node0 ~]# condor_version
$CondorVersion: 7.7.0 Jun 08 2011 PRE-RELEASE-UWCS $
$CondorPlatform: X86_64-Fedora_15 $

First off, what’s actually running?

[root@node0 ~]# service condor start
Starting condor (via systemctl):  [  OK  ]

[root@node0 ~]# pstree | grep condor
        |-condor_master-+-condor_collecto
        |               |-condor_negotiat
        |               |-condor_schedd---condor_procd
        |               `-condor_startd

You have a condor_master, which spawns all the other condor daemons and monitors them, restarting them if necessary.

Since this pool is entirely contained on one machine all the components needed for a functional pool are present. The condor_collector is there, it’s the rendezvous point. The condor_schedd is there, it holds and manages the jobs. The condor_startd is there, it represents resources in the pool. The condor_negotiator is there, it hands out matches between jobs and resources.

Run condor_q to see all the jobs. It queries the condor_schedd for the list.

[root@node0 ~]# condor_q
-- Submitter: node0.local :  : node0.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
0 jobs; 0 idle, 0 running, 0 held

Nothing there yet.

Run condor_status to see all the resources. It queries the condor_collector for the list.

[root@node0 ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:14:45
slot2@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:15:06
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        2     0       0         2       0          0
               Total        2     0       0         2       0          0

The machine has two cores.

Adding more nodes to the pool means running a condor_startd on more machines and telling them to report to the collector. The collector is the entry point into the pool. All components checkin with it, and it runs on a well known port: 9618 (try: grep condor /etc/services)

After installing condor on your second machine, you need to change the configuration. You want to run just the condor_startd, and you want the node to checkin with the collector on node0.local. To do this take advantage of condor’s config.d, found at /etc/condor/config.d.

The three parameters to change are CONDOR_HOST, DAEMON_LIST and ALLOW_WRITE. Configuration files are concatenated together, with the last definition of a parameter being authoritative. ALLOW_WRITE is to make sure the condor_schedd on node0 can run jobs on node1. So,

[root@node1 ~]# cat > /etc/condor/config.d/40root.config
CONDOR_HOST = node0.local
DAEMON_LIST = MASTER, STARTD
ALLOW_WRITE = $(ALLOW_WRITE), $(CONDOR_HOST)
^D

Once started, only the condor_master and condor_startd will be running.

[root@node1 ~]# service condor start
Starting condor (via systemctl):  [  OK  ]

[root@node1 ~]# pstree | grep condor
        |-condor_master---condor_startd

If everything worked out, condor_status will now show both machines.

[root@node1 ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:34:45
slot2@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:35:06
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        2     0       0         2       0          0
               Total        2     0       0         2       0          0

No dice. We’ll do this the slow way.

Since we’re asking the collector for information, the best place to look for what’s going wrong is in the collector’s log file. It is on node0.local in /var/log/condor/CollectorLog. Note: condor_config_val COLLECTOR_LOG will tell you were to look too.

[root@node0 ~]# tail -n5 $(condor_config_val COLLECTOR_LOG)
06/11/11 23:00:45 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: ADVERTISE_STARTD authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.0.1,node1.local
06/11/11 23:00:45 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 0 (UPDATE_STARTD_AD), access level ADVERTISE_STARTD: reason: cached result for ADVERTISE_STARTD; see first case for the full reason
06/11/11 23:00:46 PERMISSION DENIED to unauthenticated@unmapped from host 10.0.0.1 for command 2 (UPDATE_MASTER_AD), access level ADVERTISE_MASTER: reason: ADVERTISE_MASTER authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.0.1,node1.local
06/11/11 23:00:51 Got QUERY_STARTD_ADS
06/11/11 23:00:51 (Sending 2 ads in response to query)

This is telling us the security configuration on node0 is preventing node1 from checking in. Specifically, the ALLOW_ADVERTISE_STARTD configuration. The security configuration mechanisms in condor are very flexible and powerful. I suggest you read about them sometime. For now, we’ll grant node1 access to ADVERTISE_STARTD as well as any other operations that require write permissions by setting ALLOW_WRITE.

[root@node0 ~]# cat > /etc/condor/config.d/40root.config
ALLOW_WRITE = $(ALLOW_WRITE), 10.0.0.1,node1.local
^D

[root@node0 ~]# condor_reconfig
Sent "Reconfig" command to local master

[root@node0 ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@node0.local  LINUX      X86_64 Unclaimed Idle     0.040   497  0+00:50:46
slot2@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:51:07
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        2     0       0         2       0          0
               Total        2     0       0         2       0          0

You should note that host-based authentication is being used here. Any connection from 10.0.0.1 or node1.local are given write access. For simplicity, the ALLOW_WRITE parameter is set so that it is appending to any existing value. And, node1 still is not showing up in the status listing.

The reason node1 is not present is it only attempts to check in every 5 minutes (condor_config_val UPDATE_INTERVAL, seconds). A quick reconfig on node1 will speed the checkin along.

[root@node1 ~]# condor_reconfig
Sent "Reconfig" command to local master

[root@node1 ~]# condor_status
Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime
slot1@node0.local  LINUX      X86_64 Unclaimed Idle     0.040   497  0+00:50:46
slot2@node0.local  LINUX      X86_64 Unclaimed Idle     0.000   497  0+00:51:07
slot1@node1.local  LINUX      X86_64 Unclaimed Idle     0.000  1003  0+00:09:04
slot2@node1.local  LINUX      X86_64 Unclaimed Idle     0.000  1003  0+00:09:25
                     Machines Owner Claimed Unclaimed Matched Preempting
        X86_64/LINUX        4     0       0         4       0          0
               Total        4     0       0         4       0          0

Now that the resources are all visible, you can run a few jobs.

[matt@node0 ~]$ cat > job.sub
cmd = /bin/sleep
args = 1d
should_transfer_files = if_needed
when_to_transfer_output = on_exit
queue 8
^D

[matt@node0 ~]$ condor_submit job.sub 
Submitting job(s)........
8 job(s) submitted to cluster 1.

[matt@node0 ~]$ condor_q
-- Submitter: node0.local :  : node0.local
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   matt            6/11 23:23   0+00:00:07 R  0   0.0  sleep 1d          
   1.1   matt            6/11 23:23   0+00:00:06 R  0   0.0  sleep 1d          
   1.2   matt            6/11 23:23   0+00:00:06 R  0   0.0  sleep 1d          
   1.3   matt            6/11 23:23   0+00:00:06 R  0   0.0  sleep 1d          
   1.4   matt            6/11 23:23   0+00:00:00 I  0   0.0  sleep 1d          
   1.5   matt            6/11 23:23   0+00:00:00 I  0   0.0  sleep 1d          
   1.6   matt            6/11 23:23   0+00:00:00 I  0   0.0  sleep 1d          
   1.7   matt            6/11 23:23   0+00:00:00 I  0   0.0  sleep 1d          
8 jobs; 4 idle, 4 running, 0 held

[matt@node0 ~]$ condor_q -run
-- Submitter: node0.local :  : node0.local
 ID      OWNER            SUBMITTED     RUN_TIME HOST(S)         
   1.0   matt            6/11 23:23   0+00:00:09 slot1@node0.local
   1.1   matt            6/11 23:23   0+00:00:08 slot2@node0.local
   1.2   matt            6/11 23:23   0+00:00:08 slot1@node1.local
   1.3   matt            6/11 23:23   0+00:00:08 slot2@node1.local

We’ve installed condor on two nodes, configured one (node0) to be what is sometimes called a head node, configured the other (node1) to report to node0, and ran a few jobs.

Other topics, enabling the firewall and using the shared_port daemon, sharing UID and FS between nodes, finer grained security.

Notes:
0) By default, ALLOW_WRITE should include $(CONDOR_HOST)
1) ShouldTransferFiles should default to IF_NEEDED
2) WhenToTransferOutput should default to ON_EXIT

Edit – changed 50root.conf to 40root.config

Detecting clock skew with Condor

June 8, 2011

Anyone who works with distributed systems knows the value of maintaining clock synchronization. Within reason of course. If you do not work with such systems, imagine debugging something by reading a trace log. The messages will be in an order you can reason about, chronological. Now imagine if the order were jumbled, or worse you did not know the order was jumbled. In a distributed system, you often have to correlate logs between systems. If the clocks are skewed it is difficult to reconstruct a timeline.

This is not a new trick, but it is one I just found useful again and thought worth sharing:

$ condor_status -master -constraint '((MyCurrentTime-LastHeardFrom) > 60) || ((MyCurrentTime-LastHeardFrom) < -60)' \
   -format "%s\t" Name \
   -format "%d\n" '(MyCurrentTime - LastHeardFrom)'
eeyore.local	75

Turns out that system has a clock about 75 seconds in the future, and did not have ntpd installed.

See also GT1671.


%d bloggers like this: