Best practices: batch application’s input

For application developers and their users.

An application that is going to be run in batch mode — non-interactive, maybe scheduled to remote resources — is going to need input[0]. That input might be a few numbers or numerous data files. As an application developer, the case of data files can be difficult to get right. There are many ways to handle data files and one big pitfall: hard-coded paths.

Below are a few options for application developers and the resulting work for a Condor user of the application.

The best approach is to read input from stdin or have paths to data files passed via arguments. Doing so shows the developer has batch processing in mind, and provides the applications user with clear options.

A submit file for an application that reads from stdin –

executable = batch_app
input = input.dat

A submit file for an application that takes data files as arguments –

executable = batch_app
arguments = --input=input.dat
transfer_input_files = input.dat

A middle ground approach may be necessary if the set of input files is large or their relationships are complex. In such a case, a meta-data file can be used, or the input files can be laid out in a well-defined pattern in the filesystem. Note: “well-defined pattern in the filesystem” is often a myth.

Of these approaches, the meta-data file is preferred. It makes the input files and their relationships explicit. However, it can be more difficult for the application’s user from a Condor perspective. When the files are laid out in the filesystem the tendency is for the application to not have a well-defined layout, or a definition maintained independently of the application.

A submit file for an application that takes a meta-data file –

executable = batch_app
arguments = --input=input.meta
transfer_input_files = input.meta[,all the files listed in input.meta]

The difficulty comes in listing all the files from the input.meta. This is often mitigated by providing URIs, or paths, in input.meta that may point into a shared filesystem. The files in a shared filesystem need not be transferred by Condor and need not be listed in transfer_input_files.

A submit file for an application that takes a hopefully-well-defined filesystem layout,

executable = batch_app
arguments = --input=data_dir
transfer_input_files = data_dir

This is simpler because Condor will transfer everything under data_dir into the job’s scratch space and keep it under a directory called data_dir. Often, the data_dir will even exist on a shared filesystem and will not need to be transferred (remove transfer_input_files = data_dir and provide full path with --input).

Note: transfer_input_files = data_dir/ will not replicate the directory tree in the job’s scratch space. It will be collapsed.

These two approaches can be combined to get the best of both.

The worst approach is really a non-approach and involves hard-coding paths into the application. Arguably the application does not have a batch mode. It will fail when not run in its expected environment, which may simply mean by a user different from the developer, or on a new shared filesystem, or an old shared filesystem with new mounts. These application should be avoided or modified to provide a batch mode.

Developers beware, you can turn near success with a meta-data file into a failure by hard-coding its path.

Takeaway –

For developers, an application that has a batch processing will parametrize all its inputs[1].

For users, beware of applications that operate on data that you have not provided.

[0] Even if it is just a random seed.
[1] Database or URI connections to get data also matter.


Tags: , , ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: