Posts Tagged ‘MRG’

Where does the memory go? The condor_shadow.

May 24, 2010

The condor_schedd’s architecture currently represents each running job as a condor_shadow process it can directly manage. I’m going to skip the pros & cons of threading vs processes, or multiplexing vs not. Those are great topics, but not of interest right now. The condor_schedd uses processes and does not multiplex, so when it is managing 10,000 running jobs it will have 10,000 running condor_shadow processes. Other than the shock of 10,000 processes, what are some immediate concerns? 1) CPU usage – actually very small, condor_shadow processes spend all their time waiting on I/O, 2) I/O usage – condor_shadows may spend all their time waiting on I/O, but when that I/O opens it is typically small, maybe 10K network and a few K disk, 3) memory usage – now that’s an interesting one.

Quick ways to look at memory usage?

1) top, but it’s not going to be as helpful as you might like.

$ top -n1 -p $(pidof condor_shadow-base)                          
...
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
10487 matt      20   0 10092 3756 3176 S  0.0  0.2   0:00.01 condor_shadow-b

This gives a general idea about the memory usage of the condor_shadow. It has about 3756KB of resident memory and is sharing about 3176KB.

What we really want to see is the amount of memory that cannot be shared between condor_shadow processes. This will give us a rough lower bound on how much memory is needed for 10,000 condor_shadows. top might suggest that number is 3756-3176K=580K, but that’s a common mistake. A quick web search will tell you why.

2) pmap, which will give a better idea and where the memory is going.

$ pmap -d $(pidof condor_shadow-base)
Address   Kbytes Mode  Offset           Device    Mapping
00110000     884 r-x-- 0000000000000000 0fd:00000 libstdc++.so.6.0.13
001ed000      16 r---- 00000000000dc000 0fd:00000 libstdc++.so.6.0.13
001f1000       8 rw--- 00000000000e0000 0fd:00000 libstdc++.so.6.0.13
001f3000      24 rw--- 0000000000000000 000:00000   [ anon ]
001f9000      44 r-x-- 0000000000000000 0fd:00000 libnss_files-2.11.1.so
00204000       4 r---- 000000000000a000 0fd:00000 libnss_files-2.11.1.so
00205000       4 rw--- 000000000000b000 0fd:00000 libnss_files-2.11.1.so
0063a000     116 r-x-- 0000000000000000 0fd:00000 libgcc_s-4.4.3-20100127.so.1
00657000       4 rw--- 000000000001c000 0fd:00000 libgcc_s-4.4.3-20100127.so.1
00733000       4 r-x-- 0000000000000000 000:00000   [ anon ]
00745000       8 r-x-- 0000000000000000 0fd:00000 libcom_err.so.2.1
00747000       4 rw--- 0000000000002000 0fd:00000 libcom_err.so.2.1
0074a000       8 r-x-- 0000000000000000 0fd:00000 libkeyutils-1.2.so
0074c000       4 rw--- 0000000000001000 0fd:00000 libkeyutils-1.2.so
0074f000     716 r-x-- 0000000000000000 0fd:00000 libkrb5.so.3.3
00802000      24 rw--- 00000000000b3000 0fd:00000 libkrb5.so.3.3
0082d000     276 r-x-- 0000000000000000 0fd:00000 libfreebl3.so
00872000       4 rw--- 0000000000045000 0fd:00000 libfreebl3.so
00873000      16 rw--- 0000000000000000 000:00000   [ anon ]
00880000     180 r-x-- 0000000000000000 0fd:00000 libgssapi_krb5.so.2.2
008ad000       4 rw--- 000000000002d000 0fd:00000 libgssapi_krb5.so.2.2
008b0000     168 r-x-- 0000000000000000 0fd:00000 libk5crypto.so.3.1
008da000       4 rw--- 000000000002a000 0fd:00000 libk5crypto.so.3.1
008dd000      32 r-x-- 0000000000000000 0fd:00000 libkrb5support.so.0.1
008e5000       4 rw--- 0000000000007000 0fd:00000 libkrb5support.so.0.1
008e8000      28 r-x-- 0000000000000000 0fd:00000 libcrypt-2.11.1.so
008ef000       4 r---- 0000000000007000 0fd:00000 libcrypt-2.11.1.so
008f0000       4 rw--- 0000000000008000 0fd:00000 libcrypt-2.11.1.so
008f1000     156 rw--- 0000000000000000 000:00000   [ anon ]
0091a000     328 r-x-- 0000000000000000 0fd:00000 libssl.so.1.0.0
0096c000      16 rw--- 0000000000051000 0fd:00000 libssl.so.1.0.0
00974000     120 r-x-- 0000000000000000 0fd:00000 ld-2.11.1.so
00992000       4 r---- 000000000001d000 0fd:00000 ld-2.11.1.so
00993000       4 rw--- 000000000001e000 0fd:00000 ld-2.11.1.so
00996000    1464 r-x-- 0000000000000000 0fd:00000 libc-2.11.1.so
00b04000       8 r---- 000000000016e000 0fd:00000 libc-2.11.1.so
00b06000       4 rw--- 0000000000170000 0fd:00000 libc-2.11.1.so
00b07000      12 rw--- 0000000000000000 000:00000   [ anon ]
00b0c000      88 r-x-- 0000000000000000 0fd:00000 libpthread-2.11.1.so
00b22000       4 r---- 0000000000015000 0fd:00000 libpthread-2.11.1.so
00b23000       4 rw--- 0000000000016000 0fd:00000 libpthread-2.11.1.so
00b24000       8 rw--- 0000000000000000 000:00000   [ anon ]
00b28000      12 r-x-- 0000000000000000 0fd:00000 libdl-2.11.1.so
00b2b000       4 r---- 0000000000002000 0fd:00000 libdl-2.11.1.so
00b2c000       4 rw--- 0000000000003000 0fd:00000 libdl-2.11.1.so
00b2f000     160 r-x-- 0000000000000000 0fd:00000 libm-2.11.1.so
00b57000       4 r---- 0000000000027000 0fd:00000 libm-2.11.1.so
00b58000       4 rw--- 0000000000028000 0fd:00000 libm-2.11.1.so
00b5b000      72 r-x-- 0000000000000000 0fd:00000 libz.so.1.2.3
00b6d000       4 rw--- 0000000000011000 0fd:00000 libz.so.1.2.3
00b7b000     112 r-x-- 0000000000000000 0fd:00000 libselinux.so.1
00b97000       4 r---- 000000000001b000 0fd:00000 libselinux.so.1
00b98000       4 rw--- 000000000001c000 0fd:00000 libselinux.so.1
00b9b000      80 r-x-- 0000000000000000 0fd:00000 libresolv-2.11.1.so
00baf000       4 ----- 0000000000014000 0fd:00000 libresolv-2.11.1.so
00bb0000       4 r---- 0000000000014000 0fd:00000 libresolv-2.11.1.so
00bb1000       4 rw--- 0000000000015000 0fd:00000 libresolv-2.11.1.so
00bb2000       8 rw--- 0000000000000000 000:00000   [ anon ]
042a3000    1476 r-x-- 0000000000000000 0fd:00000 libcrypto.so.1.0.0
04414000      80 rw--- 0000000000170000 0fd:00000 libcrypto.so.1.0.0
04428000      12 rw--- 0000000000000000 000:00000   [ anon ]
049f0000     188 r-x-- 0000000000000000 0fd:00000 libpcre.so.0.0.1
04a1f000       4 rw--- 000000000002e000 0fd:00000 libpcre.so.0.0.1
08048000    2496 r-x-- 0000000000000000 0fd:00001 condor_shadow-base
082b8000       8 rw--- 0000000000270000 0fd:00001 condor_shadow-base
082ba000      12 rw--- 0000000000000000 000:00000   [ anon ]
09c64000     396 rw--- 0000000000000000 000:00000   [ anon ]
b7819000      24 rw--- 0000000000000000 000:00000   [ anon ]
b7839000      12 rw--- 0000000000000000 000:00000   [ anon ]
bf930000      84 rw--- 0000000000000000 000:00000   [ stack ]
mapped: 10092K    writeable/private: 972K    shared: 0K

“writable/private: 972K” is a better number to go by. It suggests that 10,000 shadows will require at least 972*10,000/2^20=9.269GB of memory.

So that’s where the memory is. Now why is it there?

I’ve been playing with Massif lately, so it seemed like a reasonable place to start. It can produce a usage graph,

    KB
335.8^                                                                      # 
     |                                                  ::::::::::@:::::::::#:
     |                                             @@:@::::: :::: @: : :::::#:
     |                                       ::::::@ :@::::: :::: @: : :::::#:
     |                                       ::::::@ :@::::: :::: @: : :::::#:
     |                                @::::::::::::@ :@::::: :::: @: : :::::#:
     |                         ::@::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                       @:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                      :@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                     ::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                    @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                    @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                  @@@::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                  @ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                 @@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                @@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |                @@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |              @@@@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |              @ @@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
     |             @@ @@@ @::@:: @::@:@:::: :::::::@ :@::::: :::: @: : :::::#:
   0 +----------------------------------------------------------------------->Mi
     0                                                                   17.22

Massif tells us about dynamic memory allocations, which in this case amount to 335KB. That is only a third of the 972K of private memory reported by pmap. This suggests that two-thirds of the private memory in the shadow may be from static allocations.

objdump is another tool I’ve been using lately to inspect ELF binaries.

$ objdump -t condor_shadow-base | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | tail
082bc900 l     O .bss	00000100              pe_logname.7591
082bca00 l     O .bss	00000100              pe_user.7592
082bc160 l     O .bss	00000100              priv_identifier::id
082b8d80 l     O .data	00000120              CondorEnvironList
082bbbc0 g     O .bss	000001c4              ConfigTab
082bbf20 l     O .bss	00000200              priv_history
082ba920 l     O .bss	00001000              answer.7409
082b9920 l     O .bss	00001000              answer.7437
082b89a0 g       .data	00000000              __data_start
082b89a0  w      .data	00000000              data_start

What’s this, two 1K static buffers named answer.

* * *

After some code inspection it turned out that two functions in ckpt_name.c contained static char answer[MAXPATHLEN], where MAXPATHLEN is of course 1024. One of those functions, gen_exec_name, was dead code and easy to remove. The other, gen_ckpt_name, was active, with all but one of its call sites immediately strdup()’ing its result. After necessary modifications,

$ objdump -t condor_shadow-sans-answers | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | tail
082b9720 g     O .bss	000000e8              MaxLog
082b8b00 l     O .data	000000f0              SigNameArray
082ba900 l     O .bss	00000100              pe_logname.7591
082baa00 l     O .bss	00000100              pe_user.7592
082ba160 l     O .bss	00000100              priv_identifier::id
082b8d80 l     O .data	00000120              CondorEnvironList
082b9bc0 g     O .bss	000001c4              ConfigTab
082b9f20 l     O .bss	00000200              priv_history
082b89a0 g       .data	00000000              __data_start
082b89a0  w      .data	00000000              data_start

And while running,

$ pmap -d $(pidof condor_shadow-sans-answers) | grep -e rw -e private
...
082b8000       8 rw--- 0000000000270000 0fd:00001 condor_shadow-sans-answers
082ba000       4 rw--- 0000000000000000 000:00000   [ anon ]
089f0000     396 rw--- 0000000000000000 000:00000   [ anon ]
b7877000      24 rw--- 0000000000000000 000:00000   [ anon ]
b7897000      12 rw--- 0000000000000000 000:00000   [ anon ]
bff58000      84 rw--- 0000000000000000 000:00000   [ stack ]
mapped: 10084K    writeable/private: 964K    shared: 0K

We see an expected decrease in the shadow’s private address space. But, even though those answer buffers were the largest around, we’re aiming at small game. This is a ~0.8% improvement, or ~78MB out of our ~9.269GB. At least we now have a procedure for further reducing memory, even if it is a fraction of a percent at a time. In fact, with a little more digging there are more possibilities here,

$ objdump -t condor_shadow-sans-answers | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | \
   grep NULL_XACTION | tail -n1                     
082babbc l     O .bss	00000004              classad::NULL_XACTION

While classad::NULL_XACTION may be small, there are quite a few of them,

$ objdump -t condor_shadow-sans-answers | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | \
   awk '{print $6}' | sort | uniq -c | sort -n | tail
      2 DEFAULT_INDENT
      2 EvalBool(compat_classad::ClassAd*,
      2 SharedPortEndpoint::SharedPortEndpoint(char
      2 SharedPortEndpoint::UseSharedPort(MyString*,
      2 std::__ioinit
      2 SubsystemInfo::setClass(SubsystemInfoLookup
      3 vtable
     11 guard
    107 classad::NULL_XACTION

* * *

And it turns out classad::NULL_XACTION is not only defined in a header file that is included in many places, but it is completely unused. Note: Please don’t put data like NULL_XACTION in a header.

$ objdump -t condor_shadow-sans-answers-null_xaction | grep -e "\.bss" -e "\.data" | sort -k5 | c++filt | \
   awk '{print $6}' | sort | uniq -c | sort -n | tail
      1 ZZZZZ
      2 DEFAULT_INDENT
      2 EvalBool(compat_classad::ClassAd*,
      2 SharedPortEndpoint::SharedPortEndpoint(char
      2 SharedPortEndpoint::UseSharedPort(MyString*,
      2 std::__ioinit
      2 SubsystemInfo::setClass(SubsystemInfoLookup
      3 vtable
     11 guard

Since NULL_XACTION is not actually used its removal should not change the private address space of the shadow, but instead the overall binary and memory size.

$ pmap -d $(pidof condor_shadow-sans-answers-null_xaction) | grep -e rw -e private
...
082b2000       8 rw--- 0000000000269000 0fd:00001 condor_shadow-sans-answers-null_xaction
082b4000       4 rw--- 0000000000000000 000:00000   [ anon ]
08a0a000     396 rw--- 0000000000000000 000:00000   [ anon ]
b77f5000      24 rw--- 0000000000000000 000:00000   [ anon ]
b7815000      12 rw--- 0000000000000000 000:00000   [ anon ]
bfd28000      84 rw--- 0000000000000000 000:00000   [ stack ]
mapped: 10060K    writeable/private: 964K    shared: 0K

Which appears to be the case,

$ size *
   text	   data	    bss	    dec	    hex	filename
2554154	   5032	  14360	2573546	 2744ea	condor_shadow-base
2554146	   5032	   6168	2565346	 2724e2	condor_shadow-sans-answers
2528350	   4648	   5724	2538722	 26bce2	condor_shadow-sans-answers-null_xaction

Ok, so that was just a diversion. Back to big game. From the original pmap dump, there are a number of libraries linked in with the shadow that are using memory of their own. One of those libraries, libcrypt is using far more than the others.

008e8000      28 r-x-- 0000000000000000 0fd:00000 libcrypt-2.11.1.so
008ef000       4 r---- 0000000000007000 0fd:00000 libcrypt-2.11.1.so
008f0000       4 rw--- 0000000000008000 0fd:00000 libcrypt-2.11.1.so
008f1000     156 rw--- 0000000000000000 000:00000   [ anon ]

* * *

The existing build system for Condor, which uses imake(!), has a nasty habit of linking all binaries with all libraries. And as it turns out, crypt is not used by Condor code anymore. It has been replaced by OpenSSL’s DES implementation, likely many years ago.

Running without libcrypt,

$ pmap -d $(pidof condor_shadow-sans-answers-null_xaction-crypt) | grep -e rw -e private 
005e6000       4 rw--- 000000000000b000 0fd:00000 libnss_files-2.11.1.so
00657000       4 rw--- 000000000001c000 0fd:00000 libgcc_s-4.4.3-20100127.so.1
0073b000       8 rw--- 00000000000e0000 0fd:00000 libstdc++.so.6.0.13
0073d000      24 rw--- 0000000000000000 000:00000   [ anon ]
00747000       4 rw--- 0000000000002000 0fd:00000 libcom_err.so.2.1
0074c000       4 rw--- 0000000000001000 0fd:00000 libkeyutils-1.2.so
00802000      24 rw--- 00000000000b3000 0fd:00000 libkrb5.so.3.3
008ad000       4 rw--- 000000000002d000 0fd:00000 libgssapi_krb5.so.2.2
008da000       4 rw--- 000000000002a000 0fd:00000 libk5crypto.so.3.1
008e5000       4 rw--- 0000000000007000 0fd:00000 libkrb5support.so.0.1
0096c000      16 rw--- 0000000000051000 0fd:00000 libssl.so.1.0.0
00993000       4 rw--- 000000000001e000 0fd:00000 ld-2.11.1.so
00b06000       4 rw--- 0000000000170000 0fd:00000 libc-2.11.1.so
00b07000      12 rw--- 0000000000000000 000:00000   [ anon ]
00b23000       4 rw--- 0000000000016000 0fd:00000 libpthread-2.11.1.so
00b24000       8 rw--- 0000000000000000 000:00000   [ anon ]
00b2c000       4 rw--- 0000000000003000 0fd:00000 libdl-2.11.1.so
00b58000       4 rw--- 0000000000028000 0fd:00000 libm-2.11.1.so
00b6d000       4 rw--- 0000000000011000 0fd:00000 libz.so.1.2.3
00b98000       4 rw--- 000000000001c000 0fd:00000 libselinux.so.1
00bb1000       4 rw--- 0000000000015000 0fd:00000 libresolv-2.11.1.so
00bb2000       8 rw--- 0000000000000000 000:00000   [ anon ]
04414000      80 rw--- 0000000000170000 0fd:00000 libcrypto.so.1.0.0
04428000      12 rw--- 0000000000000000 000:00000   [ anon ]
04a1f000       4 rw--- 000000000002e000 0fd:00000 libpcre.so.0.0.1
082ad000       8 rw--- 0000000000264000 0fd:00001 condor_shadow-sans-answers-null_xaction-crypt
082af000       4 rw--- 0000000000000000 000:00000   [ anon ]
0908e000     396 rw--- 0000000000000000 000:00000   [ anon ]
b77bd000      20 rw--- 0000000000000000 000:00000   [ anon ]
b77dc000      12 rw--- 0000000000000000 000:00000   [ anon ]
bf820000      84 rw--- 0000000000000000 000:00000   [ stack ]
mapped: 9548K    writeable/private: 780K    shared: 0K

We see a huge improvement in both mapped and private memory usage. 780K/972K shows a ~19% improvement, or a reduction in memory of ~1.8GB over 10,000 running jobs.

BIG NOTE: The condor_shadow used above is from the 7.5 development series, which contains a problem around inclusion of a static parameter table on its heap. Using a version without the config table and some config minimzation tricks yields similar results, private memory footprint going from 708K to 516K (516/708=0.728). All on my 32-bit laptop of course. Memory on a 64-bit architecture goes from ~1,400K to ~1,000K (1000/1400=0.714).

There is clearly still more to be done here. In addition to evaluating the other libraries linked into the shadow and eliminating many of the static buffers, the memory allocations reported by Massif can be explored. However, the config table factored out there is likely only 100KB to explore.

Setup a two node Condor pool (Red Hat KB)

January 14, 2010

I have been meaning to write a short getting-started tutorial about installing a multiple-node Condor pool, but it just hasn’t happened yet. Lucky for everyone, Jon Thomas at Red Hat has written one. Check out his How do I install a basic two node MRG grid?, which covers installation on Linux very nicely.

How QMF Submission to Condor Could Work

March 23, 2009

I recently goofed and told someone that they could use the Qpid Management Framework (QMF) to submit jobs to Condor. What I meant to say is they can use AMQP. This is maybe understandable because QMF is a management framework built on top of AMQP, and MRG Grid already has many parts of Condor modeled in QMF, but submission via QMF could be very different than via AMQP.

QMF is a framework that allows for the modeling of objects that can publish information about themselves as well as respond to actions. All information and control is sent via AMQP messages.

Along with a quick correction to my comment, s/QMF/AMQP/, I went ahead and mocked up a QMF submission interface to make my comment almost true.

Existing Submission Interfaces

Condor already has a number of submission interfaces: the command-line tools, e.g. condor_submit; a GAHP interface, the condor_c-gahp; a SOAP interface, once termed Birdbath; the previously mentioned AMQP interface; and a few others. So, what’s one more? Or, why one more!?

Command-line Interface

The command-line interface is the default means for submitting jobs to Condor’s Scheduler, the condor_schedd. The condor_submit tool takes a job description file, performs some processing on it, and generates one or many ClassAds representing jobs, a.k.a job ads. The condor_schedd only cares about job ads, and is never exposed to the job description file. condor_submit’s processing is sometimes shallow, e.g. executable = /bin/true becomes Cmd = "/bin/true", and sometimes not, e.g. getenv = TRUE becomes Environment = "<contents of env for condor_submit>". Sometimes the processing is even iterative in nature, e.g. queue 1000000 generates one million copies of the job constructed since the last queue command. The job description file is really a script in the condor_submit language that generates jobs. This makes the condor_submit tool thick, and optimizations that it performs requires it to be tightly integrated with condor_schedd.

SOAP Interface

The SOAP interface (starts slide 15) is very different from condor_submit. It is implemented within the condor_schedd, and exposes a transactional interface that accepts job ads as input. This means no high level job description file processing. It also means the thick condor_submit tool could be implemented on top of the SOAP interface. A job ad that might be submitted via SOAP would look like:

   [Owner="matt";
    Cmd="/bin/echo";
    Arguments="Hello there!";
    JobUniverse=5;
    ...;
    Requirements=TRUE]

This is a job ad that may have been created from a job description file like:

   executable = /bin/echo
   arguments = Hello there!
   requirements = TRUE
   queue 1

Pass that to condor_submit -dump job.ad to have a look.

A QMF Interface

So, what about a QMF submission interface. A nice aspect of the condor_submit interface is the script nature of the input. Unfortunately, there are some things that cannot be cleanly captured on the remote side of a submission, e.g. the getenv command, transfer_input_files, platform specific requirements bits, or working directory information. To some extent these reasons, and the desire to keep script processing out of the condor_schedd, is why the SOAP interface only deals in job ads. It’s also a reason why a QMF interface should only handle job ads.

A benefit of the SOAP interface is, quite obviously, that it makes for a more natural programmatic interface. Unfortunately, it also exposes concepts and optimizations that are used by condor_submit and may not be needed by other submission programs, e.g. transactions and clusters.

A Submission

One thing that is an afterthought when using both interfaces is the notion of a submission, something that binds together jobs based on their overall purpose. Often a cluster is thought of as the means to group jobs. However, a single job description file can generate multiple clusters. Likewise, the SOAP interface can allow for group all jobs into a single cluster, but if one of the jobs is a DAGMan workflow then the point of the single cluster is violated. The use of clusters to associate jobs is broken.

Two things the QMF interface can do are: 1) simplify the operations required to perform a submission; and, 2) motivate its users to materialize the notion of a submission.

A QMF submission API

   submit, ClassAd -> Id : Submit a new job described by ClassAd

and,

   create, void -> Id : Create a transaction to submit data and a job ad
   send, Id x Data -> void : Spool data for a forthcoming job ad

This interface would be a great simplification over the SOAP API. It eliminates the necessity of a transaction and chunked data transfers, and it does not expose the notion of a cluster. Without a cluster, job association must be done in some other way. The natural way is via an attribute on job ads, including DAGMan jobs. All jobs in a submission could have an attribute Submission = "Monday Parameter Sweep Run, features: A, B, D", a +Submission = "Monday..." in a job description file.

This interface does not have some of the high level niceties of a condor_submit submission. However, those niceties are not necessarily the ability to do many things with one line, e.g. queue 100, but to have a well defined description of a job. Understanding executable becomes the Cmd attribute is one thing, knowing universe = vanilla becomes JobUniverse = 5 is significantly different. Shortcomings in the high level interface can be addressed with improved specification for a job ad.

Submitting jobs with AMQP to a Condor based Grid

March 8, 2009

I recently took MRG Grid’s low-latency feature for a spin. It’s basically a way to use AMQP to submit jobs onto Condor managed nodes, but without going through normal scheduling, called matchmaking. The jobs are AMQP messages, and they don’t go to a Condor scheduler (a Schedd) to be matched with an execute node (a Startd). They go directly to the execute nodes, and they can go there really quickly.

This all works because Condor nodes can operate in a very autonomous fashion. Even when traditional jobs get matched with a node by the global scheduler (the Negotiator), the node always has the final say about if it will run the matched job or not. This has all sorts of benefits in a distributed system, but for low-latency it means that an execute node can take jobs from the Condor scheduler or from some other source. That other source here is an AMQP message queue.

To get going I setup the low-latency feature as described in MRG Grid User Guide: Low Latency Scheduling with the help of condor_configure_node. Since I was doing this using MRG I also used the Qpid implementation of AMQP. Qpid comes in the qpidc package, and in qpidc-devel I found some really helpful examples – the direct example and request-response client were all I needed.

The code’s flow is pretty natural…

Setup the AMQP queues

From the low-latency config (carod.conf) I know that jobs are read from a queue named grid. So the first thing I need to do is setup a way to send messages to the grid queue.

   session.queueDeclare(arg::queue="grid");
   session.exchangeBind(arg::exchange="amq.direct",
                        arg::queue="grid",
                        arg::bindingKey="grid_key");

Next, I want to make sure there’s a way for result to get back to me, so I setup my own unique queue where I can receive results.

   session.queueDeclare(arg::queue="my_grid_results",
				 // Automatically delete the queue
				 // when finished
			 arg::exclusive=true, arg::autoDelete=true);
   session.exchangeBind(arg::exchange="amq.direct",
                        arg::queue="my_grid_results",
                        arg::bindingKey="my_grid_results_key");

Construct the Condor job

With the queues all setup, I need to construct a job. In Condor all jobs are represented as ClassAds, which is basically a bunch of (attribute,value) pairs. The low-latency code needs a job in a similar form, and does so with the message’s application headers.

   message.getHeaders().setString("Cmd", "\"/bin/echo\"");
   message.getHeaders().setString("Arguments", "\"Hello Grid!\"");
   message.getHeaders().setString("Out", "\"job.out\""); // stdout
   message.getHeaders().setString("Err", "\"job.err\""); // stderr
   message.getHeaders().setString("Requirements", "TRUE");

Immediately, it looks like crazy quoting going on here, and there is. In ClassAds, the values of attributes are typed in expected ways, e.g. ints floats strings, but there is also an expression type that can be evaluated to a boolean. An expression is something like FALSE, ClusterId > 15, ((KeyboardIdle 2 * 60) && (CurrentTime - JobStart) > 90)) or regexp(".*mf.*", Name). So, to distinguish between a string and an expression, strings are quoted and expressions are not, e.g. "/bin/echo" is a string and TRUE is an expression.

By the way, at first I thought I’d need to specify the MyType, TargetType, Owner, User, In, JobUniverse, … attributes found on traditional jobs, but I discovered MyType, TargetType, and In were superfluous, and JobUniverse has a sane default that can run /bin/echo, or essentially anything. I also didn’t care about the identify of the job owner for accessing resources, so I skipped the Owner and User attributes. The job just ran as the nobody user on the execute node.

One thing that I did on purpose was set the Requirements to TRUE, which means that job is happy to run anywhere. I can do that because I know all the nodes I setup to run jobs have /bin/echo.

Send and receive messages

There are a few final steps to do to setup the job message before it can be sent. It needs a unique message id, required by the low-latency feature.

   message.getMessageProperties().setMessageId(Uuid(true));

And, it needs to pass along information about where to send results.

   message.getMessageProperties().setReplyTo(ReplyTo("amq.direct",
                                                     "my_grid_results_key"));

Finally, the message can be sent.

   message.getDeliveryProperties().setRoutingKey("grid_key");
   session.messageTransfer(arg::content=message,
                           arg::destination="amq.direct");

This got the job out onto the network, but didn’t actually give me a way to get the results back. For that I setup a function to receive messages on the my_grid_results queue.

void
Listener::received(Message &message) {
   const MessageProperties properties = message.getMessageProperties();
   const FieldTable headers = properties.getApplicationHeaders();

   const string state = headers.getAsString("JobState");
   const int status = headers.getAsInt("JobStatus");
   const int exitCode = headers.getAsInt("ExitCode");
   const string exitBySignal = headers.getAsString("ExitBySigbal");

   std::cout
      << "Response: " << properties.getMessageId() << std::endl
      << "JobState: " << state << std::endl
      << "JobStatus: " << status << std::endl
      << "ExitCode: " << exitCode << std::endl
      << "ExitBySignal: " << exitBySignal << std::endl
      << "Is Body Empty? " << (message.getData().empty() ? "Yes" : "No") << std::endl;
//    << headers << std::endl;

   if ("\"Exited\"" == state && 4 == status) {
         // There were some results returned, they're stored in the
         // message body as a zip archive
      if (!message.getData().empty()) {
         std::ofstream out;
         out.open("job.results");
         out << message.getData();
         out.close();
      }
      subscriptions.cancel(message.getDestination());
   }
}

The function is pretty simple. It prints out information about the messages on the my_grid_results queue, and when it sees a message that represents the completion of my job it writes out the results and stops listening.

To get the receive function called when messages come in, it needs to be setup and run. I started it running after I sent the job message.

   SubscriptionManager subscriptions(session);
   Listener listener(subscriptions);
   subscriptions.subscribe(listener, "my_grid_results");
   subscriptions.run();

See it all work

That was basically it. I compiled the program, ran it, and in short order found the job.results file created.

$ ./a.out
Response: 00000000-0000-0000-0000-000000000000
JobState: 
JobStatus: 2
ExitCode: 0
ExitBySignal: 
Is Body Empty? Yes
Response: 00000000-0000-0000-0000-000000000000
JobState: "Exited"
JobStatus: 4
ExitCode: 0
ExitBySignal: 
Is Body Empty? No
$ unzip -l job.results   
Archive:  job.results
  Length     Date   Time    Name
 --------    ----   ----    ----
        0  03-06-09 13:33   job.err
       12  03-06-09 13:33   job.out
 --------                   -------
       12                   2 files

What happens when my job completes is all the files in its working directory are wrapped up in a zip archive and sent back in the body of a message that has a JobState header with a value of "Exited" and a JobStatus of 4. JobStatus 4 means Completed in the Condor world.

That’s pretty much it. The full example is in LL.cpp.