Posts Tagged ‘Model’

No longer thinking in slots, thinking in aggregate resources and consumption policies

November 13, 2012

The slot model was natural when a machine housed a single core. Though, the slot model did not exist when a machine housed a single core.

When machines were single core the model was a machine, represented as a MachineAd. A MachineAd had an associated CPU, some nominal amount of RAM and some chunk of disk space. Running a job meant consuming a machine.

When machines grew multiple cores the machine model was split. A single machine became independent MachineAds, called virtual machines. However, the name didn’t stick as the term virtual machine became a popular term in hardware virtualization. So a machine became independent MachineAds, called slots. The unifying entity, the machine itself, was lost. Running a job still meant consuming a slot.

Most recently, slots split into two classes: static and partitionable. Static slots are the slots formerly known as virtual machines. Partitionable slots are a representation of the physical machine itself, and are carved up, on-demand to service jobs. Both types are still MachineAds, but the consumption of partitionable slots is dynamic.

The slot model has demonstrated great utility but has been stretched.

In this time workloads have also changed. They have become more memory bound, disk IO bound, and network bound. They have started relying on specialized hardware and even application level services. They have started both spanning and packing into cores. They have grown complex data dependencies, become very short running, and become infrastructure level long running.

Machines have also grown to include scores of cores, hundreds of gigabytes of RAM, dozens of terabytes of disk, specialized hardware such as GPUs, co-processors, entropy keys, high speed interconnects and a bevy of other attached devices.

Machines are lumpy, heterogeneous means more than operating system and CPU architecture.

Furthermore, if it still existed, the machine model itself would fail to cleanly describe available resources. Classes of resources exist that house entire clusters, grids, or life-cycle manageable application services. Resources share addressable memory across operating systems instances, are custom architectures across whole data centers, and even those that don’t provide an outline of their capacity. Resources may grow and shrink while in use.

Consumption of these resources is not necessarily straightforward or uniform.

It’s time to stop thinking in slots. Its time to start thinking in aggregate resources and their consumption policies.

Network use by advertisements in a Condor based Grid

April 11, 2009

It is certainly a good idea to have an understanding of how information flows through a distributed system when you consider deploying it, or when you want to build an application on top of it.

Condor has three core protocols: advertisement, negotiation and execution. The advertisement protocol keeps components apprised of what other components are in the pool and their state. This works by all components sending information about themselves, in the form of a ClassAd called an ad, to the pool’s Collector. The Schedulers, condor_schedd, send Schedd Ads, the execute nodes, running the condor_startd, send Startd Ads, and so on.

The Collector, the condor_collector daemon, aggregates all the ads and provides an interface to query them. It is a bootstrap. An important part of the advertisement protocol is the Collector does not hold onto ads forever. Doing so would make the Collector essentially a huge memory leak, introduce start up ordering issues, e.g. Collector first, and complicate the case where the Collector fails and all components need to re-advertise. Since the Collector does not hold onto ads forever, all components must periodically advertise themselves.

The Rates

There are two useful sets of rates in the system: the baseline rates and the activity driven rates.


The rates are defined for each component in the pool, and specify how often they publish information about themselves.

 Ad Type        Publisher    Frequency  Config
  Collector      Collector    15 min     COLLECTOR_UPDATE_INTERNVAL
  Negotiator     Negotiator    5 min     NEGOTIATOR_UPDATE_INTERNVAL
  DaemonMaster   Master        5 min     MASTER_UPDATE_INTERVAL
  Scheduler      Schedd        5 min     SCHEDD_INTERVAL
  Submitter      Schedd        5 min     SCHEDD_INTERVAL
  HAD            HAD           5 min     HAD_UPDATE_INTERVAL
  Grid           Gridmanager   5 min     GRIDMANAGER_COLLECTOR_UPDATE_INTERVAL
  Machine        Startd        5 min     UPDATE_INTERVAL
 (Note: Missing Quill and Standard Universe related components)

Of these rates the most interesting are the Machine, Master, and Submitter. There are going to be more of those in the pool than anything else. Every core on every execution node has a Machine ad, every physical node has a Master ad, and each submitter, a user with jobs, has a Submitter ad.

In a pool with 10K physical 4 core execute nodes, the baseline advertisement rate is 10K (Master ads) + 10K*4 (Machine ads) / 5 minutes = 50K ads / 5 minutes = 10K ads / minute ~= 167 ads / second. This is optimistic. There will be spikes.

Activity Driven

Activity driven advertisements happen when some component changes state in a meaningful way that should be shared with other components. The two primary sources of state change in a pool come from submitted jobs, and activity on execution nodes.

When a job is submitted it initiates a state change in the Scheduler, which initiates the negotiation protocol, which results in state changes in execution nodes and initiates the execution protocol, which results in changes to Scheduler and execute node state.

Independent of a submitted job, an execution node can change state based on policy, e.g. a user accesses the node, administrative activities cause load or time passes.

All these state changes can be complex to model. However, one aspect of them is directly controllable, and useful to know when monitoring a pool. While a job is running there are periodic updates from the execution node to the Scheduler.

 Publisher    Consumer    Frequency  Config
  Starter      Shadow       5 min     STARTER_UPDATE_INTERVAL
  Shadow       Schedd      15 min     SHADOW_QUEUE_UPDATE_INTERVAL

This means, while a job is running every 15 minutes the Scheduler gets an update on the job’s activity. Of course, state changes for the job, such as completion or eviction are immediately propagated.

In a pool with 40K running jobs, the rate of updates to the Scheduler is 40K / 15 minutes ~= 45 updates / second. With updates to the Scheduler’s machine coming in at 40K / 5 minutes ~= 134 updates / second.


What’s the size of a ad or update on the wire and in memory?

%d bloggers like this: