Archive for January, 2013

Some htcondor-wiki stats

January 29, 2013

A few years ago I discovered Web Numbr, a service that will monitor a web page for a number and graph that number over time.

I installed a handful of webnumbrs to track things at HTCondor’s gittrac instance.

http://webnumbr.com/search?query=condor

Thing such as –

  • Tickets resolved with no destination: tickets that don’t indicate what version they were fixed in. Anyone wanting to know if a bug is fixed or feature was added to their version of HTCodnor and encounters one of these will have to go spelunking in the repository for their answer.
  • Tickets resolved but not assigned: tickets that were worked on, completed, but whomever worked on them never claimed ownership.
  • Action items with commits: tickets that are marked as Todo/Incident, yet have associated code changes. Once there is a code change the ticket is either a bug fix (ticket type: defect) or feature addition (ticket type: enhancement). Extra work is imposed on whomever comes after the ticket owner who wants to understand what they are looking at. Additionally, these tickets skew information about bugs and features in releases.
  • Tickets with invalid version fields: tickets that do not follow the, somewhat strict, version field syntax – vXXYYZZ, e.g. v070901. All the extra 0s are necessary and the v must be lowercase.

I wanted to embed the numbers here, but javascript is needed and wordpress.com filters javascript from posts.

Concurrency Limits: Group defaults

January 21, 2013

Concurrency limits allow for protecting resources by providing a way to cap the number of jobs requiring a specific resource that can run at one time.

For instance, limit licenses and filer access at four regional data centers.

CONCURRENCY_LIMIT_DEFAULT = 15
license.north_LIMIT = 30
license.south_LIMIT = 30
license.east_LIMIT = 30
license.west_LIMIT = 45
filer.north_LIMIT = 75
filer.south_LIMIT = 150
filer.east_LIMIT = 75
filer.west_LIMIT = 75

Notice the repetition.

In addition to the repetition, every license.* and filer.* must be known and recorded in configuration. The set may be small in this example, but imagine imposing a limit on each user or each submission. The set of users is board, dynamic and may differ by region. The set of submissions is a more extreme version of the users case, yet it is still realistic.

To simplify the configuration management for groups of limits, a new feature to provide group defaults to limit was added for the Condor 7.8 series.

The feature requires that only the exception to a rule be called out explicitly in configuration. For instance, license.west and filer.south are the exceptions in the configuration above. Simplified configuration available in 7.8,

CONCURRENCY_LIMIT_DEFAULT = 15
CONCURRENCY_LIMIT_DEFAULT_license = 30
CONCURRENCY_LIMIT_DEFAULT_filer = 75
license.west_LIMIT = 45
filer.south_LIMIT = 150

In action,

$ for limit in license.north license.south license.east license.west filer.north filer.south filer.east filer.west; do echo queue 1000 | condor_submit -a cmd=/bin/sleep -a args=1d -a concurrency_limits=$limit; done

$ condor_q -format '%s\n' ConcurrencyLimits -const 'JobStatus == 2' | sort | uniq -c | sort -n
     30 license.east
     30 license.north
     30 license.south
     45 license.west
     75 filer.east
     75 filer.north
     75 filer.west
    150 filer.south

Your API is a feature, give it real resource management

January 14, 2013

So much these days is about distributed resource management. That’s anything that can be created and destroyed in the cloud[0]. Proper management is especially important when the resource’s existence is tied to a real economy, e.g. your user’s credit card[1].

EC2 instance creation without idempotent RunInstance

EC2 instance creation without idempotent RunInstance

Above is a state machine required to ensure that resources created in AWS EC2 are not lost, i.e. do not have to be manually cleaned up. The green arrows represent error free flow. The rest is about error handling or external state changes, e.g. user terminated operation. This is from before EC2 supported idempotent instance creation.

The state machine rewritten to use idempotent instance creation,

EC2 instance creation with idempotent RunInstance

EC2 instance creation with idempotent RunInstance

What’s going on here? Handling failure during resource creation.

The important failure to consider as a client is what happens if you ask your resource provider to create something and you never hear back. This is a distributed system, there are numerous reasons why you may not hear back. For simplicity, consider the client code crashed between sending the request and receiving the response.

The solution is to construct a transaction for resource creation[2]. To construct a transaction, you need to atomically associate a piece of information with the resource at creation time. We’ll call that piece of information an anima.

In the old EC2 API, the only way to construct an anima was through controlling a security group or keypair. Since neither is tied to a real economy, both are reasonable options. The non-idempotent state machine above uses the keypair as it is less resource intensive for EC2.

On creation failure and with the anima in hand[3], the client must search the remote system for the anima before reattempting creation. This is handled by the GM_CHECK_VM state above.

Unfortunately, without explicit support in the API, i.e. lookup by anima, the search can be unnatural and expensive. For example, EC2 instances are not indexed on keypair. Searching requires a client side scan of all instances.

With the introduction of idempotent RunInstances, the portion of the state machine for constructing and locating the anima is reduced to the GM_SAVE_CLIENT_TOKEN state, an entirely local operation. The reduction in complexity is clear.

After two years, EC2 appears to be the only API providing idempotent instance creation[4]. Though APIs are starting to provide atomic anima association, often through metadata or instance attributes, and some even provide lookup.

You should provide an idempotent resource creation operation in your API too!

[0] “in the cloud” – really anywhere in any distributed system!
[1] Making money from forgotten or misplaced resources is a short term play.
[2] Alternatively, you can choose an architecture with a janitor process, which will bring its own complexities.
[3] “in hand” – so long as your hand is reliable storage.
[4] After a quick survey, I’m looking at you Rackspace, RimuHosting, vCloud, OpenNebula, OpenStack, Eucalyptus, GoGrid, Deltacloud, Google Compute Engine and Gandi.

Web design complexity

January 7, 2013

One thing that has always impressed me is the ability of web designers to deal with browser idiosyncrasies.

For instance, knowing why this happens in firefox-17.0.1-1.fc17.x86_64 –

A bootstrap btn-primary viewed from 0.0.0.0

A bootstrap btn-primary viewed from 0.0.0.0

A bootstrap btn-primary viewed from localhost

A bootstrap btn-primary viewed from localhost

A bootstrap btn-primary firebug computed color from 0.0.0.0 and localhost

A bootstrap btn-primary firebug computed color from 0.0.0.0 and localhost

Needless to say, the web is littered with questions about why btn-primary background color is not always white. Most have answers, some with varying degrees of complex css. Others involve changing versions of software. All the while it might just be the URL used to view the page.

Test in a production environment.


%d bloggers like this: