Max Walltimes

Note: The reduction to 7 days was made January 14, 2013. Users were notified half a year in advance via email. Group and individual meetings were held. Notifications were posted for months on our website, Google+, Twitter, and MOTD for ssh logins.

FSL would like to reduce the maximum walltimes for jobs on its clusters. There are many reasons for doing this that will be beneficial to users.

To clarify, when we say "reduce the maximum walltime" we mean that we want to reduce the time that each job is allowed to run for.

Industry Standard

Our counterparts at many universities and national labs specify a maximum walltime of 1 or 3 days. It is very rare to hear of sites with walltimes longer than three days, while some only allow one day. The only sites we are aware of with longer walltimes (like us) are usually desperate to figure out how to get their walltimes significantly reduced.

Current settings

FSL currently allows jobs to run for up to 16 days (over half a month) on normal nodes and 5 days on bigmem nodes.

Goals

Current goal: seven days
Longer term goal: three days

FSL will not just change this setting on you. We will work with you individually as needed.

We realize that three days seems unrealistically low to some of our users. We assure you that users at many other universities and national labs can and do work within this constraint or even a 24 hour limit.

Limits on remaining cputime encourage shorter walltimes

Limits are imposed based on remaining cputime of running jobs. This is calculated by summing (time_remaining * cpus_allocated) for each job running in the account. As time goes on, time_remaining per job drops. Eventually more jobs will be able to start.

You are hurting yourself by requesting 16 days since time_remaining is larger than if you used a lower number. After 1 day of running time, that number is now 15 days. If that job had requested 2 days, the remaining time would be 1 days. This effectively allows you to use more CPUs at the same time, subject to system availability.

The remaining cputime limit (aka GrpCPURunMins in Slurm):

  1. helps create more job turnover (explained in the next section)
  2. prevents a single account from swamping the system
  3. encourages shorter walltimes
  4. allows users with shorter walltimes to use more of the system
  5. allows users with longer walltimes to run but not as much since longer walltimes have a negative effect on others

Unfortunately we cannot set this on a per-account, per-resource basis. This means that the 128 64 GB nodes of m7, for example, may be almost completely used by a single user for almost a week. We hope to address this in the future.

Higher job turnover rates benefit everyone

Frequent job turnover is a big benefit for every user of the system. Your jobs will start much more quickly because other jobs are ending more frequently. That opens up more resources for your pending jobs to use.

Consider the following situation where 900 nodes are filled with jobs of a uniform job length (specified below). Each job uses a whole node and was started in a staggered fashion. The jobs of each length below would result in the following number of nodes being freed up per hour.

Walltime Nodes freed up per hour
16 days 2.3
7 days 5.4
3 days 12.5
1 day 37.5

If you're at the top of the queue or hoping to get to the top of the queue, you want column 1 to be low and column 2 to be high. The sooner that nodes free up, the sooner your job can start. Whether you're waiting for one node or one hundred nodes, the quicker the turnover the better.

Obviously there will be variation in job runtimes no matter what the maximum walltime is set to. These numbers are both worst case in one way (all jobs use the maximum time allowed) and best case in another (even distribution of job starts/ends across time). If a 100-node job finishes and 100 single-node 16-day jobs are the only jobs queued up, those jobs will start immediately and not in a staggered fashion.

Long walltimes result in "unfairness"

FSL's scheduling system uses a "fairshare" mechanism to decide what jobs to run next when resources are available. Fairshare does a good job of making sure everyone gets their "fair" share of resources. However, fairshare only matters if there are jobs from multiple users in the queue.

What happens when there are 300 nodes available and a user submits 300 single-node jobs that specify a 16 day walltime? The jobs start! It doesn't matter if the user has used 99% of the system in the past since there were free resources with no competition. The problem is that these 300 nodes are now occupied for half a month. If 50 other users all try to submit jobs later that day, about 1/3 of FSL's resources are unavailable to them for the next 16 days, even if they would have had much higher priority.

In order for this problem to occur, all that has to happen is for a very active user to queue up hundreds of 16 day jobs. They may be assigned the lowest job priority possible, but the jobs will start if there is no other competition in the queue. This means that a slight lull in job submissions by other users guarantees that these 16 day jobs will start and be able to occupy much of the system for the next half a month!

This is a frequent occurrence. It is partially mitigated by capping the usage of users such that no user or group can overwhelm the system. This is not an effective use of resources because our clusters will sit partially idle while some users have hundreds of jobs ready that aren't allowed to start.

Slurm fairshare

Slurm doesn't account for an entire job all at once; it's usage is only accounted on a minute-by-minute basis throughout the life of the job. This means that a new user or one who hasn't submitted jobs in months can log in, submit hundreds of 16 day jobs, and have all those jobs be right at the top of the queue priority-wise. Until some of those 16 day jobs have run for a long time, that user's priority will stay sky high. That means they can potentially have many days of unimpeded job submission before their priority drops significantly. This can be heavily mitigated by enforcing shorter walltimes.

Node failures and power outages

If a node fails for whatever reason, your job will die. If your job uses multiple nodes, the chances of a node failure are much higher. If your job dies and you aren't using checkpointing, all work that occurred until that point was a waste. It is better to have that number be almost seven days of work rather than almost 16 days of work.

Long walltimes cause maintenance headaches

Without belaboring the point, let's just say that it's difficult to schedule emergency maintenance and have to wait 16 days for it, especially if it's externally imposed by Physical Facilities needing to replace a problematic breaker panel or something similar.

A common concern for users with larger jobs

"I have such a hard time getting scheduled that once I'm scheduled I want to run for as long as possible." This is a common, valid concern for those who run larger jobs. Your situation will improve as everyone has shorter jobs since there will be higher turnover. Nodes will free up sooner than they currently do, thus allowing your jobs to start sooner.

A happy medium

It is possible that your jobs run for too short of a time. Launching large numbers of jobs that run for only a few minutes is not ideal. Consider grouping your work so that each job runs for an hour or more.

We're in this together

FSL asks that you consider how you can do your part to reduce your jobs' runtimes. Everyone will benefit from such an effort.

Not everyone will have an easy time accomplishing this goal. We have staff who are happy to help you reduce your walltimes. Please also see the ideas below.

How to reduce your jobs' runtimes

Please see the article How to Reduce Job Runtimes for ideas.