If you don’t run tail -F on your logs periodically, you should. It’s illuminating. Try,
tail -F /var/log/condor/*Log | grep -i -e error -e fail -e warn
I ran that over the weekend and learned a few things -
ERROR WriteUserLog Failed to grab global event log lock means that the EVENT_LOG is lossy in unexpected ways. We know the EVENT_LOG rotates and if you’re watching it but miss a rotation you’ll miss events. However, when the above warning (not ERROR imho) is printed the event that was going to be written is dropped. So the EVENT_LOG could be lossy on the edges and in the middle.
GroupTracker (pid = 13252): fopen error: Failed to open file '/proc/13252/cgroup'. Error No such file or directory (2), coming from the ProcLog, means that a tracked process has disappeared. The exact implications are not clear, but the author, Brian Bockelman, suggest the message could be quieted as it doesn’t represent a functional problem. Maybe D_ALWAYS -> D_FULLDEBUG.
tail: `/var/log/condor/JobServerLog' has become inaccessible: No such file or directory many times in a row. When the job_queue.log is compressed, effectively recreated, the condor_job_server enters a phase where it reconstructs its internal state, in an apparently noisy fashion and can rotate its log file multiple times per second.
(1157197.152) (12639): attempt to connect to <10.10.10.10:52143> failed: Connection refused (connect errno = 111). and
(1157197.152) (12639): Attempt to reconnect failed: Failed to connect to starter &tl;10.10.10.10:52143> turned out to be an issue on 10.10.10.10, where all jobs from a user were failing to start because of
passwd_cache::cache_uid(): getpwnam("matt") failed: user not found with
ERROR: Uid for "matt" not found in passwd file and SOFT_UID_DOMAIN is False and
ERROR: Failed to determine what user to run this job as, aborting. The host was effectively a black hole because of a misconfigured UID_DOMAIN.
(1157079.244) (1199): ERROR "Can no longer talk to condor_starter <10.10.10.11:52725> turned out to be an issue on 10.10.10.11, where all jobs were failing to start because of
Create_Process: Cannot access specified executable "/tmp/mycondor/release_dir/sbin/condor_starter": errno = 2 (No such file or directory) with
slot5: ERROR: exec_starter failed! and
slot5: ERROR: exec_starter returned 0, which was more bad configuration.
FileLock::obtain(1) failed - errno 0 (Success) looks wrong.