GSP
Quick Navigator

Search Site

Unix VPS
A - Starter
B - Basic
C - Preferred
D - Commercial
MPS - Dedicated
Previous VPSs
* Sign Up! *

Support
Contact Us
Online Help
Handbooks
Domain Status
Man Pages

FAQ
Virtual Servers
Pricing
Billing
Technical

Network
Facilities
Connectivity
Topology Map

Miscellaneous
Server Agreement
Year 2038
Credits
 

USA Flag

 

 

Man Pages


Manual Reference Pages  -  NONSTOP.CONF (5)

NAME

nonstop.conf - Slurm configuration file for fault-tolerant computing.

CONTENTS

DESCRIPTION

nonstop.conf is an ASCII file which describes the configuration used for fault-tolerant computing with Slurm using the optional slurmctld/nonstop plugin. This plugin provides a means for users to notify Slurm of nodes it believes are suspect, replace the job’s failing or failed nodes, and extend a job’s in response to failures. The file location can be modified at system build time using the DEFAULT_SLURM_CONF parameter or at execution time by setting the SLURM_CONF environment variable. The file will always be located in the same directory as the slurm.conf file.

Parameter names are case insensitive. Any text following a "#" in the configuration file is treated as a comment through the end of that line. Changes to the configuration file take effect upon restart of Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the command "scontrol reconfigure" unless otherwise noted. The configuration parameters available include:

BackupAddr
  Communications address used for the slurmctld daemon. This can either be a hostname or IP address. This value would typically be identical to the value of BackupAddr in the slurm.conf file.

ControlAddr
  Communications address used for the slurmctld daemon. This can either be a hostname or IP address. This value would typically be identical to the value of ControlAddr in the slurm.conf file.

Debug A number indicating the level of additional logging desired for the plugin. The default value is zero, which generates no additional logging.

HotSpareCount
  This identifies how many nodes in each partition should be maintained as spare resources. When a job fails, this pool of resources will be depleted and then replenished when possible using idle resources. The value should be a comma delimited list of partition and node count pairs separated by a colon.

MaxSpareNodeCount
  This identifies the maximum number of nodes any single job may replace through the job’s entire lifetime. This could prevent a single job from causing all of the nodes in a cluster to fail. By default, there is no maximum node count.

Port Port used for communications. The default value is 6820.

TimeLimitDelay
  If a job requires replacement resources and none are immediately available, then permit a job to extend its time limit by the length of time required to secure replacement resources up to the number of minutes specified by TimeLimitDelay. This option will only take effect if no hot spare resources are available at the time replacement resources are requested. This time limit extension is in addition to the value calculated using the TimeLimitExtend. The default value is zero (no time limit extension). The value may not exceed 65533 seconds.

TimeLimitDrop
  Specifies the number of minutes that a job can extend it’s time limit for each failed or failing node removed from the job’s allocation. The default value is zero (no time limit extension). The value may not exceed 65533 seconds.

TimeLimitExtend
  Specifies the number of minutes that a job can extend it’s time limit for each replaced node. The default value is zero (no time limit extension). The value may not exceed 65533 seconds.

UserDrainAllow
  This identifies a comma delimited list of user names or user IDs of users who are authorized to drain nodes they believe are failing. Specify a value of "ALL" to permit any user to drain nodes. By default, no users may drain nodes using this interface.

UserDrainDeny
  This identifies a comma delimited list of user names or user IDs of users who are NOT authorized to drain nodes they believe are failing. Specifying a value for UserDrainDeny implicitly allows all other users to drain nodes (sets the value of UserDrainAllow to "ALL").

EXAMPLE

#
# Sample nonstop.conf file
# Date: 12 Feb 2013
#
ControlAddr=12.34.56.78
BackupAddr=12.34.56.79
Port=1234
#
HotSpareCount=batch:6,interactive:0
MaxSpareNodesCount=4
TimeLimitDelay=30
TimeLimitExtend=20
TimeLimitExtend=10
UserDrainAllow=adam,brenda

COPYING

Copyright (C) 2013-2014 SchedMD LLC. All rights reserved.

Slurm is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

SEE ALSO

slurm.conf(5)

Search for    or go to Top of page |  Section 5 |  Main Index


April 2015 NONSTOP.CONF (5) Slurm Configuration File

Powered by GSP Visit the GSP FreeBSD Man Page Interface.
Output converted with manServer 1.07.