Archive for August, 2010

Subsystem and Daemon confusion

August 23, 2010

Condor has a notion of a Subsystem for customizing configuration between daemons. This is conflated with the notion of a Daemon.

Condor’s Master runs programs, we’ll call Daemons, specified by DAEMON_LIST, e.g. DAEMON_LIST = MASTER, STARTD. Condor’s tools let you manipulate Daemons, e.g. condor_on -subsystem STARTD, condor_restart -subsystem STARTD. Wait, manipulate daemons with a -subsystem argument?

Take the SHADOW_STARTD example, DAEMON_LIST = MASTER, STARTD, SHADOW_STARTD. We know both are part of the same subsystem, STARTD, because they are both the condor_startd executable. It is perfectly reasonable to think that running “condor_restart -subsystem STARTD” will restart both the STARTD and the SHADOW_STARTD. After all, they are both part of the STARTD subsystem.

That’s not what will happen. The SHADOW_STARTD will not be restarted.

Historically, the name of a daemon in the DAEMON_LIST has mapped one to one with subsystems. The code does not, and should not, enforce this. The -subsystem argument is just misleading. It should be -daemon.

condor_restart -daemon STARTD,SHADOW_STARTD

could restart both daemons.

Right now (7.4),

condor_restart -subsystem STARTD
condor_restart -subsystem SHADOW_STARTD

will restart both.

Now, the expected behavior of -subsystem may actually be a desirable feature. But probably not with the name -subsystem. The feature that is really desirable is a grouping of daemons. The information to do it properly is not easily accessible to the condor_master though. Providing daemon-centric configuration could make it possible, e.g.

STARTD_GROUP = STARTDS
SHADOW_STARTD_GROUP = STARTDS

And then,

condor_restart -group STARTDS

Side note, why does MASTER have to be in the DAEMON_LIST? The condor_master will bail if it is missing. Probably to avoid special case code paths, since condor_on/off/restart -subsystem MASTER works too.

Advertisements

Condor Configuration: Subsystem and Local-name

August 20, 2010

Subsystem

Every Condor daemon has a burned in notion of a subsystem, its subsystem. These are fairly logical, e.g. condor_startd’s subsystem is STARTD while condor_collector’s subsystem is COLLECTOR. See the pattern? As of 7.4 there are about 30, including MASTER, SCHEDD, SHADOW, STARTER, TOOL, GRIDMANAGER, VM_GAHP, …

All Condor daemons read the same configuration files. Subsystem is a useful mechanism to vary configuration parameters between daemons. For instance, the configuration parameter NOT_RESPONDING_TIMEOUT controls how long a daemon can go without sending a keep-alive to its parent. It defaults to one hour, but maybe you do not want to wait for an hour if your condor_collector hangs. To achieve this you can set COLLECTOR.NOT_RESPONDING_TIMEOUT = 1800, in seconds of course, which means the condor_collector only gets to go off the reservation for at most 30 minutes.

Local-name

As you surely know, the condor_master reads the DAEMON_LIST parameter to figure out what daemons it should run, e.g. DAEMON_LIST = MASTER, STARTD runs a condor_startd. It is often popular to run multiple copies of a daemon. As a way to do deployment testing, an installation may want to have a shadow pool that only runs no-op-like jobs on a newer version of Condor than is in production, while sharing the production hardware. I want to meet the folks who buy an extra 5,000 node cluster just for production testing. In such a configuration the DAEMON_LIST may be MASTER, STARTD, SHADOW_STARTD. Pretend the SHADOW_STARTD is defined to be some different condor_startd version.

SHADOW_STARTD = $(STARTD)
DAEMON_LIST = MASTER, STARTD, SHADOW_STARTD

This means the condor_master tries to run two condor_startd daemons. This is not enough configuration to make it work though. Each Startd will read the same parameters, e.g. STARTD_LOG, EXECUTE or policy like START. That is probably not what was intended. In fact having two Startds share an EXECUTE is a recipe for disaster.

Both the STARTD and SHADOW_STARTD are the condor_startd executable, even if they are different versions, so they both have the same subsystem. Local-name to the rescue here. Each daemon can be given a -local-name parameter,

SHADOW_STARTD_ARGS = -local-name SHADOW

Local-name provides the needed differentiator. You can now set specific configuration for the SHADOW_STARTD,

STARTD.SHADOW.EXECUTE = $(LOCAL_DIR)/shadow_execute

Keep in mind, this is not enough config to run two Startds on a single system. You will probably also need to set STARTD.SHADOW.ADDRESS_FILE, STARTD.SHADOW.STARTD_NAME, STARTD.SHADOW.STARTD_LOG and disable USE_PROCD.

Firewalling execute nodes: Avoid LOWPORT/HIGHPORT, use IN_LOWPORT/IN_HIGHPORT

August 8, 2010

We all know why firewalls are setup. Typical firewall configurations minimize inbound connections and allow unrestricted outbound connections.

Condor primarily uses ephemeral ports for inbound connections. To assist configuration with firewalls, it has long provided LOWPORT and HIGHPORT configuration options to constrain the port range it uses. Going beyond port range management, Condor has grown to include the Condor Connection Broker (CCB), to reverse connections when components are entirely hidden by firewalls, and condor_shared_port, to reduce the inbound port footprint on a machine to one.

Unfortunately, there is a disconnect in the typical firewall configuration and what LOWPORT/HIGHPORT configuration expresses. LOWPORT/HIGHPORT constraints both inbound and outbound port usage.

On an execute node, Condor will run a condor_master, a condor_startd and a few condor_starter processes, one per job. All must be able to accept connections. For a node that can run 4 jobs, the minimum number of inbound ports open in the node’s firewall is 6, one for each of the 6 potential processes. However, those processes will use more than just one port during its lifetime. In fact, the processes may have 3 open connections at some point. Using LOWPORT/HIGHPORT, that means setting a range that is 3 times wider than is necessary. It is possible to reduce that because not all processes will use all 3 connections at once, until they do. Going low is fragile.

Luckily, Condor provides IN_LOWPORT/IN_HIGHPORT and OUT_LOWPORT/OUT_HIGHPORT. For a typical firewall configuration, ignore the OUT_’s and use the IN_’s, e.g. IN_LOWPORT = 10000, IN_HIGHPORT = 10005. You will be much happier.

Port usage running 4 jobs with configuration,

ALL_DEBUG = D_NETWORK
IN_LOWPORT = 10000
IN_HIGHPORT = 10015
OUT_LOWPORT = 20000
OUT_HIGHPORT = 20015

Looks like,

MasterLog:08/08/10 09:25:10 Sock::bindWithin - bound to 10012...
MasterLog:08/08/10 09:25:13 Sock::bindWithin - bound to 10000...
MasterLog:08/08/10 09:25:18 Sock::bindWithin - bound to 20009...
StartLog:08/08/10 09:25:24 Sock::bindWithin - bound to 20015...
StartLog:08/08/10 09:25:24 Sock::bindWithin - bound to 20007...
StartLog:08/08/10 09:25:28 Sock::bindWithin - bound to 20003...
StartLog:08/08/10 09:25:34 Sock::bindWithin - bound to 10007...
StartLog:08/08/10 09:25:34 Sock::bindWithin - bound to 10013...
StartLog:08/08/10 09:25:34 Sock::bindWithin - bound to 10006...
StartLog:08/08/10 09:25:34 Sock::bindWithin - bound to 10015...
StarterLog.slot1:08/08/10 09:25:34 Sock::bindWithin - bound to 20007...
StarterLog.slot1:08/08/10 09:25:34 Sock::bindWithin - bound to 20008...
StarterLog.slot2:08/08/10 09:25:34 Sock::bindWithin - bound to 20003...
StarterLog.slot2:08/08/10 09:25:34 Sock::bindWithin - bound to 20013...
StarterLog.slot3:08/08/10 09:25:34 Sock::bindWithin - bound to 20011...
StarterLog.slot3:08/08/10 09:25:34 Sock::bindWithin - bound to 20012...
StarterLog.slot4:08/08/10 09:25:34 Sock::bindWithin - bound to 20013...
StarterLog.slot4:08/08/10 09:25:34 Sock::bindWithin - bound to 20004...

That’s 6 inbound ports and 12 outbound ports, with a few reused.


%d bloggers like this: