Archive for July, 2010

How Condor determines a node’s IP and hostname

July 28, 2010

Most direct way to find out what is happening is to use,

$ env _CONDOR_TOOL_DEBUG=D_HOSTNAME condor_config_val -debug FULL_HOSTNAME

(or if you’re running a version before 7.4.4 or 7.5.4, where -debug was fixed (#1541) for condor_config_val, use condor_status -debug -total)

You will get some output similar to,

$ _CONDOR_ALL_DEBUG=D_HOSTNAME condor_config_val -debug -dump | grep -e ^HOSTNAME -e ^FULL_HOSTNAME -e ^IP_ADDRESS
07/28 13:48:44 Finding local host information, calling gethostname()
07/28 13:48:44 gethostname() returned a host with no domain "eeyore"
07/28 13:48:44 Trying to find full hostname and IP addr for "eeyore"
07/28 13:48:44 Calling gethostbyname(eeyore)
07/28 13:48:44 Found IP addr in hostent: 127.0.0.1
07/28 13:48:44 Trying to find full hostname from hostent
07/28 13:48:44 Main name in hostent "eeyore" contains no '.', checking aliases
07/28 13:48:44 Checking alias "localhost.localdomain"
07/28 13:48:44 Alias "localhost.localdomain" is fully qualified
07/28 13:48:44 Trying to initialize local IP address (config file not read)
07/28 13:48:44 Already found IP with gethostbyname()
07/28 13:48:44 Trying to initialize local IP address (after reading config)
07/28 13:48:44 NETWORK_INTERFACE not in config file, using existing value
FULL_HOSTNAME = localhost.localdomain
HOSTNAME = eeyore
IP_ADDRESS = 127.0.0.1

That is from my laptop, whose IP changes with the time of day and weather, and whose non-fully qualified name is listed in /etc/hosts.

Condor is doing its best to find a FQDN and associated IP. The heuristic used to identify a FQDN is presence of a period (.). Condor starts by calling gethostname(), you can run hostname. If that returns a FQDN, gethostbyname() is called, you can run hostname -i, to find the IP and done. If a non-FQDN is returned, all the IPs associated with the name are scanned looking for what is most likely a public IP. The heuristic prefers non-private over private over 127.0.0.1, where private are 10/8, 172.16/12, or 192.168/16. Once an IP is selected, the primary name and aliases for the IP are scanned for a FQDN, you can run getent hosts $(hostname), and done.

Experimenting with various configurations, the best advice is to setup a system so hostname returns a fully qualified name, and do not bother changing /etc/hosts.

hostname /etc/hosts HOSTNAME FULL_HOSTNAME IP_ADDRESS
eeyore (none) eeyore eeyore 0.0.0.0 (OOPS!)
eeyore eeyore eeyore localhost.localdomain 127.0.0.1
eeyore.local (none) eeyore eeyore.local 192.168.3.153
eeyore.local eeyore eeyore eeyore.local 192.168.3.153
eeyore.local eeyore.local eeyore eeyore.local 127.0.0.1
Advertisements

Getting started: Installing a single node Condor pool

July 26, 2010

Install the condor package.

[root@eeyore ~]# yum -y install condor

You’ll get classads and gsoap as well.

Start Condor.

[root@eeyore ~]# service condor start
Starting Condor daemons:                                   [  OK  ]

Take a look at your new Personal Condor setup.

[root@eeyore ~]# condor_q

-- Submitter: localhost.localdomain :  : localhost.localdomain
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held
[root@eeyore ~]# condor_status

Name               OpSys      Arch   State     Activity LoadAv Mem   ActvtyTime

slot1@localhost.lo LINUX      X86_64 Unclaimed Idle     0.510   940  0+00:00:04
slot2@localhost.lo LINUX      X86_64 Unclaimed Idle     0.000   940  0+00:00:05
slot3@localhost.lo LINUX      X86_64 Unclaimed Idle     0.000   940  0+00:00:06
slot4@localhost.lo LINUX      X86_64 Unclaimed Idle     0.000   940  0+00:00:07
                     Total Owner Claimed Unclaimed Matched Preempting Backfill

        X86_64/LINUX     4     0       0         4       0          0        0

               Total     4     0       0         4       0          0        0

Test it out with a job, as yourself.

Write a job submit file.

09:22:46am> eeyore:~ $ cat > first.job
cmd = /bin/cat
args = /proc/self/status
output = first.job.$(cluster).$(process).out
error = first.job.$(cluster).$(process).err
log = first.job.log
queue 3
^D

Submit your job.

09:24:23am> eeyore:~ $ condor_submit first.job 
Submitting job(s)...
Logging submit event(s)...
3 job(s) submitted to cluster 1.

Check the queue for your jobs.

09:24:39am> eeyore:~ $ condor_q

-- Submitter: localhost.localdomain :  : localhost.localdomain
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   matt            7/26 09:24   0+00:00:00 I  0   0.1  cat /proc/self/sta
   1.1   matt            7/26 09:24   0+00:00:00 I  0   0.1  cat /proc/self/sta
   1.2   matt            7/26 09:24   0+00:00:00 I  0   0.1  cat /proc/self/sta

3 jobs; 3 idle, 0 running, 0 held

(Write a little in your blog and miss the jobs running.)

09:24:47am> eeyore:~ $ condor_q
-- Submitter: localhost.localdomain :  : localhost.localdomain
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

Check the output.

09:25:01am> eeyore:~ $ l first.job*
4.0K -rw-------. 1 matt matt  157 Jul 26 09:24 first.job
   0 -rw-------. 1 matt matt    0 Jul 26 09:24 first.job.1.0.err
   0 -rw-------. 1 matt matt    0 Jul 26 09:24 first.job.1.2.err
   0 -rw-------. 1 matt matt    0 Jul 26 09:24 first.job.1.1.err
4.0K -rw-------. 1 matt matt  871 Jul 26 09:24 first.job.1.0.out
4.0K -rw-------. 1 matt matt  871 Jul 26 09:24 first.job.1.2.out
4.0K -rw-------. 1 matt matt  871 Jul 26 09:24 first.job.1.1.out
4.0K -rw-------. 1 matt matt 1.8K Jul 26 09:24 first.job.log

The jobs can be found in the queue’s history.

09:28:59am> eeyore:~ $ condor_history
 ID      OWNER            SUBMITTED     RUN_TIME ST   COMPLETED CMD            
   1.2   matt            7/26 09:24   0+00:00:00 C   7/26 09:24 /bin/cat /proc/
   1.0   matt            7/26 09:24   0+00:00:00 C   7/26 09:24 /bin/cat /proc/
   1.1   matt            7/26 09:24   0+00:00:00 C   7/26 09:24 /bin/cat /proc/

It is just that simple.

MALLOC_PERTURB_: Finding real bugs in condor_chirp

July 17, 2010

If you are not familiar with MALLOC_PERTURB_, you should read the fedora-devel post by Jim Meyering.

After condor_ssh_to_job, which gives you a shell in the environment of your running job,

$ /usr/libexec/condor/condor_chirp get_job_attr Owner                 
"matt"uuuuuuuuuuuuuuuuuu
$ MALLOC_PERTURB_=97 /usr/libexec/condor/condor_chirp get_job_attr Owner
"matt"aaaaaaaaaaaaaaaaaa
$ MALLOC_PERTURB_=98 /usr/libexec/condor/condor_chirp get_job_attr Owner
"matt"bbbbbbbbbbbbbbbbbb
$ MALLOC_PERTURB_=0 /usr/libexec/condor/condor_chirp get_job_attr Owner
"matt"

From io_proxy_handler.cpp:

		result = REMOTE_CONDOR_get_job_attr(name,recv_expr);
		if(result==0) {
			sprintf(line,"%u",(unsigned int)strlen(recv_expr));
			r->put_line_raw(line);
			r->put_bytes_raw(recv_expr,strlen(recv_expr));
		} else {

From chirp_client_get_job_attr in chirp_client.c:

		*expr = (char*)malloc(result);
		if(*expr) {
			actual = fread(*expr,1,result,c->rstream);
			if(actual!=result) chirp_fatal_request("get_job_attr");
		} else {

From condor_chirp.cpp:

	char *p = 0;
	chirp_client_get_job_attr(client, argv[2], &p);
	printf("%s\n", p);

%d bloggers like this: