Posts Tagged ‘Tip’

Tail your logs, for fun and profit

December 3, 2012

If you don’t run tail -F on your logs periodically, you should. It’s illuminating. Try,

tail -F /var/log/condor/*Log | grep -i -e error -e fail -e warn

I ran that over the weekend and learned a few things –

0) ERROR WriteUserLog Failed to grab global event log lock means that the EVENT_LOG is lossy in unexpected ways. We know the EVENT_LOG rotates and if you’re watching it but miss a rotation you’ll miss events. However, when the above warning (not ERROR imho) is printed the event that was going to be written is dropped. So the EVENT_LOG could be lossy on the edges and in the middle.

1) GroupTracker (pid = 13252): fopen error: Failed to open file '/proc/13252/cgroup'. Error No such file or directory (2), coming from the ProcLog, means that a tracked process has disappeared. The exact implications are not clear, but the author, Brian Bockelman, suggest the message could be quieted as it doesn’t represent a functional problem. Maybe D_ALWAYS -> D_FULLDEBUG.

2) tail: `/var/log/condor/JobServerLog' has become inaccessible: No such file or directory many times in a row. When the job_queue.log is compressed, effectively recreated, the condor_job_server enters a phase where it reconstructs its internal state, in an apparently noisy fashion and can rotate its log file multiple times per second.

3) (1157197.152) (12639): attempt to connect to <> failed: Connection refused (connect errno = 111). and (1157197.152) (12639): Attempt to reconnect failed: Failed to connect to starter &tl;> turned out to be an issue on, where all jobs from a user were failing to start because of passwd_cache::cache_uid(): getpwnam("matt") failed: user not found with ERROR: Uid for "matt" not found in passwd file and SOFT_UID_DOMAIN is False and ERROR: Failed to determine what user to run this job as, aborting. The host was effectively a black hole because of a misconfigured UID_DOMAIN.

4) (1157079.244) (1199): ERROR "Can no longer talk to condor_starter <> turned out to be an issue on, where all jobs were failing to start because of Create_Process: Cannot access specified executable "/tmp/mycondor/release_dir/sbin/condor_starter": errno = 2 (No such file or directory) with slot5: ERROR: exec_starter failed! and slot5: ERROR: exec_starter returned 0, which was more bad configuration.

5) FileLock::obtain(1) failed - errno 0 (Success) looks wrong.


Tip: ISO8601 dates in your logs

October 22, 2012

Condor produces internal data in both structured and unstructured forms.

The structured forms are just that and designed to be processed by external programs. These are the event logs (UserLog or EVENT_LOG), the HISTORY file and PER_JOB_HISTORY_DIR and POOL_HISTORY_DIR, and the job_queue.log and Accountantnew.log transaction logs.

The unstructured forms are for debugging and designed to be read by a person, often an experienced person. They are often called trace, or debug, logs and are the files in the LOG directory, or the extra output seen when passing -debug to command-line tools, i.e. condor_q -debug.

Consuming and processing the unstructured forms with external programs is increasingly important. Consider tracing incidents through a deployment of 50,000 geographically distributed, physical and virtual systems. Or, even 100 local systems.

More and more tools that provide the ability to aggregate unstructured logs are emerging and they all need to do some basic parsing of the logs. Help make their integration simpler and use a well defined format for timestamps.

For instance, ISO8601 –

DEBUG_TIME_FORMAT = "%Y-%m-%dT%H:%M:%S%z "

Tip: notification = never

October 8, 2012

By default, the condor_schedd will notify you, via email, when your job completes. This is a handy feature when running a few jobs, but can become overwhelming if you are running many jobs.

It can even turn into a problem if you are being notified at a mailbox you do not monitor.

# df /var
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/...             233747128 215868920   5813032  98% /

# du -a /var | sort -n -r | head -n 4
150436072       /var
111752396       /var/spool
111706452       /var/spool/mail
108702404       /var/spool/mail/matt

Yes, that’s ~105GB of job completion notification emails. All ignored. Oops.

The email notification feature is controlled on a per job basis by the notification command in a job’s submit file. See man condor_submit. To not get email notification, set it to NEVER, e.g.

$ echo queue | condor_submit -a cmd=/bin/hostname -a notification=never

If you are a pool administrator and want to change the default from COMPLETE to NEVER use the SUBMIT_EXPRS configuration parameter.

Notification = NEVER

Users will still be able to override the configured default by putting notification = complete|always|error in their submit files.

Keep those disks clean.

%d bloggers like this: