Archive for December, 2012

Configuration and policy evaluation

December 10, 2012

Figuring out how evaluation happens in configuration and policy is a common problem. The confusion is justified.

Configuration provides substitution with $() syntax, while policy is full ClassAd language evaluation without $() syntax.

Configuration is all the parameters listed in files discoverable with condor_config_val -config.

$ condor_config_val -config
Configuration source:
	/etc/condor/condor_config
Local configuration sources:
	/etc/condor/config.d/00personal_condor.config

Policy is the ClassAd expression found on the right-hand side of specific configuration parameters. For instance,

$ condor_config_val -v START
START: ( (KeyboardIdle > 15 * 60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner")) )
  Defined in '/etc/condor/condor_config', line 753.

Configuration evaluation allows for substitution of configuration parameters with $().

$ cat /etc/condor/condor_config | head -n753 | tail -n1
START			= $(UWCS_START)

$ condor_config_val -v UWCS_START
UWCS_START: ( (KeyboardIdle > 15 * 60) && ( ((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner")) )
  Defined in '/etc/condor/condor_config', line 808.

$ cat /etc/condor/condor_config | head -n808 | tail -n3
UWCS_START	= ( (KeyboardIdle > $(StartIdleTime)) \
                    && ( $(CPUIdle) || \
                         (State != "Unclaimed" && State != "Owner")) )

Here START is actually the value of UWCS_START, provided by $(UWCS_START).

The substitution is recursive. Explore /etc/condor/condor_config and the JustCPU parameter. It is actually a parameter that is never read by daemons or tools. It is only useful in other configuration parameters. It’s shorthand.

Policy evaluation is full ClassAd expression evaluation. The evaluation happens at the appropriate times while daemons or tools are running.

Taking START as an example, the words KeyboardIdle, LoadAvg, CondorLoadAvg, State are attributes found on machine ads, and it is evaluated by the condor_startd and condor_negotiator to figure out if a job is allowed to start on a resource.

$ condor_status -l slot1@eeyore.local | grep -e ^KeyboardIdle -e ^LoadAvg -e ^CondorLoadAvg -e ^State
KeyboardIdle = 0
LoadAvg = 0.290000
CondorLoadAvg = 0.0
State = "Owner"

Evaluation happens by recursively evaluating those attributes. The expression ((KeyboardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <= 0.3) || (State != "Unclaimed" && State != "Owner"))) becomes ((0 > 15 * 60) && (((0.29 - 0.0) <= 0.3) || ("Owner" != "Unclaimed" && "Owner" != "Owner"))). And so forth.

That’s it.

Tail your logs, for fun and profit

December 3, 2012

If you don’t run tail -F on your logs periodically, you should. It’s illuminating. Try,

tail -F /var/log/condor/*Log | grep -i -e error -e fail -e warn

I ran that over the weekend and learned a few things –

0) ERROR WriteUserLog Failed to grab global event log lock means that the EVENT_LOG is lossy in unexpected ways. We know the EVENT_LOG rotates and if you’re watching it but miss a rotation you’ll miss events. However, when the above warning (not ERROR imho) is printed the event that was going to be written is dropped. So the EVENT_LOG could be lossy on the edges and in the middle.

1) GroupTracker (pid = 13252): fopen error: Failed to open file '/proc/13252/cgroup'. Error No such file or directory (2), coming from the ProcLog, means that a tracked process has disappeared. The exact implications are not clear, but the author, Brian Bockelman, suggest the message could be quieted as it doesn’t represent a functional problem. Maybe D_ALWAYS -> D_FULLDEBUG.

2) tail: `/var/log/condor/JobServerLog' has become inaccessible: No such file or directory many times in a row. When the job_queue.log is compressed, effectively recreated, the condor_job_server enters a phase where it reconstructs its internal state, in an apparently noisy fashion and can rotate its log file multiple times per second.

3) (1157197.152) (12639): attempt to connect to <10.10.10.10:52143> failed: Connection refused (connect errno = 111). and (1157197.152) (12639): Attempt to reconnect failed: Failed to connect to starter &tl;10.10.10.10:52143> turned out to be an issue on 10.10.10.10, where all jobs from a user were failing to start because of passwd_cache::cache_uid(): getpwnam("matt") failed: user not found with ERROR: Uid for "matt" not found in passwd file and SOFT_UID_DOMAIN is False and ERROR: Failed to determine what user to run this job as, aborting. The host was effectively a black hole because of a misconfigured UID_DOMAIN.

4) (1157079.244) (1199): ERROR "Can no longer talk to condor_starter <10.10.10.11:52725> turned out to be an issue on 10.10.10.11, where all jobs were failing to start because of Create_Process: Cannot access specified executable "/tmp/mycondor/release_dir/sbin/condor_starter": errno = 2 (No such file or directory) with slot5: ERROR: exec_starter failed! and slot5: ERROR: exec_starter returned 0, which was more bad configuration.

5) FileLock::obtain(1) failed - errno 0 (Success) looks wrong.


%d bloggers like this: