This was a simple walk to get a feel for what a single, out of the box, 7.6 condor_schedd could handle for a job load. There were about 1,500 dynamic slots and all jobs were 5 second sleeps.
It turns out that without errors in the system, a single Schedd can sustain at least a rate of 55 jobs per second.
Graphs courtesy Cumin and Ernie Allen‘s recent visualization of Erik Erlandson‘s new Schedd stats. This is a drill-down into a Scheduler.
Here’s what happened. I started submitting 25 jobs (queue 25) every 5 seconds. You can’t see this unfortunately, it is off the left side of the graphs. The submission, start and completion rates were all equal at 5 per second. Every five/ten/fifteen minutes, when I felt like it, I ramped that up a bit, by increasing the jobs per submit (queue 50 then 100 then 150 then 200) and the rate of submission (every 5 seconds then 2 seconds then 1). The scaling up matched my expectations. At 50 jobs every second, I saw 10 jobs submitted/started/completed per second. At 100 job, the rates were 20 per second. I eventually worked it up to rates about 50-55 per second.
Then we got to the 30 minute mark in the graphs. B shows that Shadows started failing. I let this run for about 10 minutes, the Schedd’s rates fluctuated down to between 45-50 jobs per second, and then kicked up the submission rate. The spike in submissions is quite visible to the right of A in the Rates graph.
At this point the Schedd was sustaining about 45 jobs per second and the Shadow exception rate was fairly sustained. I decided to kill off the submissions, also quite visible. The Schedd popped back up to 55 jobs per second and finished off its backlog.
A bit about the errors, simple investigation: condor_status -sched -long | grep ExitCode, oh a bunch of 108s; grep -B10 “STATUS 108” /var/log/condor/ShadowLog, oh a bunch of network issues and some evictions; pull out the hosts the Shadows were connecting to, oh mostly a couple nodes that were reported as being flakey; done.
Throughout this entire walk, the Mean start times turned out to be an interesting graph. It shows two things: 0) Mean time to start cumulative – the mean time between a job is queued to when it first starts running, over the lifetime of the Schedd; and, 1) Mean time to start – the same metric, but over an adjustable window, defaulting to 5 minutes. Until the exceptions started and when I blew the submissions out, around C, the mean queue time/wait time/time to start was consistently between 4 and 6 seconds. I did not look at the service rate on the back side of the job’s execution, e.g. how long it waited before termination was recognized, mail was sent, etc.
That’s about it. Though, remember this was done with about 1,500 slots. I didn’t vary the number of slots, say to see if the Schedd would have issues sustaining more than 1,500 concurrent jobs. I also didn’t do much in the way of calculations to compare the mean start time to the backlog (idle job count).
Also, if anyone tries this themselves, I would suggest
notification=never in your submissions. It will prevent an email getting sent for each job, will save the Schedd and Shadow a good amount of work, and in my case would have resulted in a 389MB reduction in email.