 |
|
| |
scrun(1) |
Slurm Commands |
scrun(1) |
scrun - an OCI runtime proxy for Slurm.
scrun [GLOBAL OPTIONS...] create [CREATE
OPTIONS] <container-id>
- Prepares a new container with container-id in current working
directory.
scrun [GLOBAL OPTIONS...] start
<container-id>
- Request to start and run container in job.
scrun [GLOBAL OPTIONS...] state
<container-id>
- Output OCI defined JSON state of container.
scrun [GLOBAL OPTIONS...] kill
<container-id> [signal]
- Send signal (default: SIGTERM) to container.
scrun [GLOBAL OPTIONS...] delete [DELETE
OPTIONS] <container-id>
- Release any resources held by container locally and remotely.
Perform OCI runtime operations against container-id per:
https://github.com/opencontainers/runtime-spec/blob/main/runtime.md
scrun attempts to mimic the commandline behavior as closely
as possible to crun(1) and runc(1) in order to maintain in
place replacement compatibility with DOCKER(1) and podman(1).
All commandline arguments for crun(1) and runc(1) will be
accepted for compatibility but may be ignored depending on their
applicability.
scrun is an OCI runtime proxy for Slurm. It acts as a
common interface to DOCKER(1) or podman(1) to allow container
operations to be executed under Slurm as jobs. scrun will accept all
commands as an OCI compliant runtime but will proxy the container and all
STDIO to Slurm for scheduling and execution. The containers will be executed
remotely on Slurm compute nodes according to settings in
oci.conf(5).
scrun requires all containers to be OCI image compliant
per:
https://github.com/opencontainers/image-spec/blob/main/spec.md
On successful operation, scrun will return 0. For any other
condition scrun will return any non-zero number to denote a
error.
- --cgroup-manager
- Ignored.
-
- --debug
- Activate debug level logging.
-
- -f
<slurm_conf_path>
- Use specified slurm.conf for configuration.
Default: sysconfdir from configure during compilation
-
- --usage
- Show quick help on how to call scrun
-
- --log-format=<json|text>
- Optional select format for logging. May be "json" or
"text".
Default: text
-
- --root=<root_path>
- Path to spool directory to communication sockets and temporary directories
and files. This should be a tmpfs and should be cleared on reboot.
Default: /run/user/{user_id}/scrun/
-
- --rootless
- Ignored. All scrun commands are always rootless.
-
- --systemd-cgroup
- Ignored.
-
- -v
- Increase logging verbosity. Multiple -v's increase verbosity.
-
- -V, --version
- Print version information and exit.
-
- --force
- Ignored. All delete requests are forced and will kill any running
jobs.
-
- SLURM_*_HET_GROUP_#
- For a heterogeneous job allocation, the environment variables are set
separately for each component.
-
- SLURM_CLUSTER_NAME
- Name of the cluster on which the job is executing.
-
- SLURM_CONTAINER
- OCI Bundle for job.
-
- SLURM_CONTAINER_ID
- OCI id for job.
-
- SLURM_CPUS_PER_GPU
- Number of CPUs requested per allocated GPU.
-
- SLURM_CPUS_PER_TASK
- Number of CPUs requested per task.
-
- SLURM_DIST_PLANESIZE
- Plane distribution size. Only set for plane distributions.
-
- SLURM_DISTRIBUTION
- Distribution type for the allocated jobs.
-
- SLURM_GPU_BIND
- Requested binding of tasks to GPU.
-
- SLURM_GPU_FREQ
- Requested GPU frequency.
-
- SLURM_GPUS
- Number of GPUs requested.
-
- SLURM_GPUS_PER_NODE
- Requested GPU count per allocated node.
-
- SLURM_GPUS_PER_SOCKET
- Requested GPU count per allocated socket.
-
- SLURM_GPUS_PER_TASK
- Requested GPU count per allocated task.
-
- SLURM_HET_SIZE
- Set to count of components in heterogeneous job.
-
- SLURM_JOB_ACCOUNT
- Account name associated of the job allocation.
-
- SLURM_JOB_CPUS_PER_NODE
- Count of CPUs available to the job on the nodes in the allocation, using
the format CPU_count[(xnumber_of_nodes)][,CPU_count
[(xnumber_of_nodes)] ...]. For example:
SLURM_JOB_CPUS_PER_NODE='72(x2),36' indicates that on the first and second
nodes (as listed by SLURM_JOB_NODELIST) the allocation has 72 CPUs, while
the third node has 36 CPUs. NOTE: The select/linear plugin
allocates entire nodes to jobs, so the value indicates the total count of
CPUs on allocated nodes. The select/cons_tres plugin allocates
individual CPUs to jobs, so this number indicates the number of CPUs
allocated to the job.
-
- SLURM_JOB_END_TIME
- The UNIX timestamp for a job's projected end time.
-
- SLURM_JOB_GPUS
- The global GPU IDs of the GPUs allocated to this job. The GPU IDs are not
relative to any device cgroup, even if devices are constrained with
task/cgroup. Only set in batch and interactive jobs.
-
- SLURM_JOB_ID
- The ID of the job allocation.
-
- SLURM_JOB_NODELIST
- List of nodes allocated to the job.
-
- SLURM_JOB_NUM_NODES
- Total number of nodes in the job allocation.
-
- SLURM_JOB_PARTITION
- Name of the partition in which the job is running.
-
- SLURM_JOB_QOS
- Quality Of Service (QOS) of the job allocation.
-
- SLURM_JOB_RESERVATION
- Advanced reservation containing the job allocation, if any.
-
- SLURM_JOB_START_TIME
- UNIX timestamp for a job's start time.
-
- SLURM_MEM_BIND
- Bind tasks to memory.
-
- SLURM_MEM_BIND_LIST
- Set to bit mask used for memory binding.
-
- SLURM_MEM_BIND_PREFER
- Set to "prefer" if the SLURM_MEM_BIND option includes the
prefer option.
-
- SLURM_MEM_BIND_SORT
- Sort free cache pages (run zonesort on Intel KNL nodes)
-
- SLURM_MEM_BIND_TYPE
- Set to the memory binding type specified with the SLURM_MEM_BIND
option. Possible values are "none", "rank",
"map_map", "mask_mem" and "local".
-
- SLURM_MEM_BIND_VERBOSE
- Set to "verbose" if the SLURM_MEM_BIND option includes
the verbose option. Set to "quiet" otherwise.
-
- SLURM_MEM_PER_CPU
- Minimum memory required per usable allocated CPU.
-
- SLURM_MEM_PER_GPU
- Requested memory per allocated GPU.
-
- SLURM_MEM_PER_NODE
- Specify the real memory required per node.
-
- SLURM_NTASKS
- Specify the number of tasks to run.
-
- SLURM_NTASKS_PER_CORE
- Request the maximum ntasks be invoked on each core.
-
- SLURM_NTASKS_PER_GPU
- Request that there are ntasks tasks invoked for every GPU.
-
- SLURM_NTASKS_PER_NODE
- Request that ntasks be invoked on each node.
-
- SLURM_NTASKS_PER_SOCKET
- Request the maximum ntasks be invoked on each socket.
-
- SLURM_OVERCOMMIT
- Overcommit resources.
-
- SLURM_PROFILE
- Enables detailed data collection by the acct_gather_profile plugin.
-
- SLURM_SHARDS_ON_NODE
- Number of GPU Shards available to the step on this node.
-
- SLURM_SUBMIT_HOST
- The hostname of the computer from which scrun was invoked.
-
- SLURM_TASKS_PER_NODE
- Number of tasks to be initiated on each node. Values are comma separated
and in the same order as SLURM_JOB_NODELIST. If two or more consecutive
nodes are to have the same task count, that count is followed by
"(x#)" where "#" is the repetition count. For example,
"SLURM_TASKS_PER_NODE=2(x3),1" indicates that the first three
nodes will each execute two tasks and the fourth node will execute one
task.
-
- SLURM_THREADS_PER_CORE
- This is only set if --threads-per-core or
SCRUN_THREADS_PER_CORE were specified. The value will be set to the
value specified by --threads-per-core or
SCRUN_THREADS_PER_CORE. This is used by subsequent srun calls
within the job allocation.
-
/etc/slurm/scrun.lua must be present on any node where
scrun will be invoked. scrun.lua must be a compliant
lua(1) script.
The following functions must be defined.
- • function slurm_scrun_stage_in(id, bundle,
spool_dir, config_file, job_id, user_id,
group_id, job_env)
- Called right after job allocation to stage container into job node(s).
Must return SLURM.success or job will be cancelled. It is required
that function will prepare the container for execution on job node(s) as
required to run as configured in oci.conf(1). The function may
block as long as required until container has been fully prepared (up to
the job's max wall time).
- id
- Container ID
- bundle
- OCI bundle path
- spool_dir
- Temporary working directory for container
- config_file
- Path to config.json for container
- job_id
- jobid of job allocation
- user_id
- Resolved numeric user id of job allocation. It is generally expected that
the lua script will be executed inside of a user namespace running under
the root(0) user.
- group_id
- Resolved numeric group id of job allocation. It is generally expected that
the lua script will be executed inside of a user namespace running under
the root(0) group.
- job_env
- Table with each entry of Key=Value or Value of each environment variable
of the job.
- • function slurm_scrun_stage_out(id, bundle,
orig_bundle, root_path, orig_root_path,
spool_dir, config_file, jobid, user_id,
group_id)
- Called right after container step completes to stage out files from job
nodes. Must return SLURM.success or job will be cancelled. It is
required that function will pull back any changes and cleanup the
container on job node(s). The function may block as long as required until
container has been fully prepared (up to the job's max wall time).
- id
- Container ID
- bundle
- OCI bundle path
- orig_bundle
- Originally submitted OCI bundle path before modification by
set_bundle_path().
- root_path
- Path to directory root of container contents.
- orig_root_path
- Original path to directory root of container contents before modification
by set_root_path().
- spool_dir
- Temporary working directory for container
- config_file
- Path to config.json for container
- job_id
- jobid of job allocation
- user_id
- Resolved numeric user id of job allocation. It is generally expected that
the lua script will be executed inside of a user namespace running under
the root(0) user.
- group_id
- Resolved numeric group id of job allocation. It is generally expected that
the lua script will be executed inside of a user namespace running under
the root(0) group.
The following functions are provided for any Lua function to call
as needed.
- • slurm.set_bundle_path(PATH)
- Called to notify scrun to use PATH as new OCI container
bundle path. Depending on the filesystem layout, cloning the container
bundle may be required to allow execution on job nodes.
- • slurm.set_root_path(PATH)
- Called to notify scrun to use PATH as new container root
filesystem path. Depending on the filesystem layout, cloning the container
bundle may be required to allow execution on job nodes. Script must also
update #/root/path in config.json when changing root path.
- • STATUS,OUTPUT =
slurm.remote_command(SCRIPT)
- Run SCRIPT in new job step on all job nodes. Returns numeric job
status as STATUS and job stdio as OUTPUT. Blocks until
SCRIPT exits.
- • STATUS,OUTPUT =
slurm.allocator_command(SCRIPT)
- Run SCRIPT as forked child process of scrun. Returns numeric
job status as STATUS and job stdio as OUTPUT. Blocks until
SCRIPT exits.
- • slurm.log(MSG, LEVEL)
- Log MSG at log LEVEL. Valid range of values for LEVEL
is [0, 4].
- • slurm.error(MSG)
- Log error MSG.
- • slurm.log_error(MSG)
- Log error MSG.
- • slurm.log_info(MSG)
- Log MSG at log level INFO.
- • slurm.log_verbose(MSG)
- Log MSG at log level VERBOSE.
- • slurm.log_verbose(MSG)
- Log MSG at log level VERBOSE.
- • slurm.log_debug(MSG)
- Log MSG at log level DEBUG.
- • slurm.log_debug2(MSG)
- Log MSG at log level DEBUG2.
- • slurm.log_debug3(MSG)
- Log MSG at log level DEBUG3.
- • slurm.log_debug4(MSG)
- Log MSG at log level DEBUG4.
- • MINUTES =
slurm.time_str2mins(TIME_STRING)
- Parse TIME_STRING into number of minutes as MINUTES. Valid
formats:
- • days-[hours[:minutes[:seconds]]]
- • hours:minutes:seconds
- • minutes[:seconds]
- • -1
- • INFINITE
- • UNLIMITED
- Full Container staging example
using rsync:
- This full example will stage a container as given by docker(1) or
podman(1). The container's config.json is modified to remove
unwanted functions that may cause the container run to under
crun(1) or runc(1). The script uses rsync(1) to move
the container to a shared filesystem under the scratch_path
variable.
local json = require 'json'
local open = io.open
local scratch_path = "/run/user/"
local function read_file(path)
local file = open(path, "rb")
if not file then return nil end
local content = file:read "*all"
file:close()
return content
end
local function write_file(path, contents)
local file = open(path, "wb")
if not file then return nil end
file:write(contents)
file:close()
return
end
function slurm_scrun_stage_in(id, bundle, spool_dir, config_file, job_id, user_id, group_id, job_env)
slurm.log_debug(string.format("stage_in(%s, %s, %s, %s, %d, %d, %d)",
id, bundle, spool_dir, config_file, job_id, user_id, group_id))
local status, output, user, rc
local config = json.decode(read_file(config_file))
local src_rootfs = config["root"]["path"]
rc, user = slurm.allocator_command(string.format("id -un %d", user_id))
user = string.gsub(user, "%s+", "")
local root = scratch_path..math.floor(user_id).."/slurm/scrun/"
local dst_bundle = root.."/"..id.."/"
local dst_config = root.."/"..id.."/config.json"
local dst_rootfs = root.."/"..id.."/rootfs/"
if string.sub(src_rootfs, 1, 1) ~= "/"
then
-- always use absolute path
src_rootfs = string.format("%s/%s", bundle, src_rootfs)
end
status, output = slurm.allocator_command("mkdir -p "..dst_rootfs)
if (status ~= 0)
then
slurm.log_info(string.format("mkdir(%s) failed %u: %s",
dst_rootfs, status, output))
return slurm.ERROR
end
status, output = slurm.allocator_command(string.format("/usr/bin/env rsync --exclude sys --exclude proc --numeric-ids --delete-after --ignore-errors --stats -a -- %s/ %s/", src_rootfs, dst_rootfs))
if (status ~= 0)
then
-- rsync can fail due to permissions which may not matter
slurm.log_info(string.format("WARNING: rsync failed: %s", output))
end
slurm.set_bundle_path(dst_bundle)
slurm.set_root_path(dst_rootfs)
config["root"]["path"] = dst_rootfs
-- Always force user namespace support in container or runc will reject
local process_user_id = 0
local process_group_id = 0
if ((config["process"] ~= nil) and (config["process"]["user"] ~= nil))
then
-- resolve out user in the container
if (config["process"]["user"]["uid"] ~= nil)
then
process_user_id=config["process"]["user"]["uid"]
else
process_user_id=0
end
-- resolve out group in the container
if (config["process"]["user"]["gid"] ~= nil)
then
process_group_id=config["process"]["user"]["gid"]
else
process_group_id=0
end
-- purge additionalGids as they are not supported in rootless
if (config["process"]["user"]["additionalGids"] ~= nil)
then
config["process"]["user"]["additionalGids"] = nil
end
end
if (config["linux"] ~= nil)
then
-- force user namespace to always be defined for rootless mode
local found = false
if (config["linux"]["namespaces"] == nil)
then
config["linux"]["namespaces"] = {}
else
for _, namespace in ipairs(config["linux"]["namespaces"]) do
if (namespace["type"] == "user")
then
found=true
break
end
end
end
if (found == false)
then
table.insert(config["linux"]["namespaces"], {type= "user"})
end
-- Provide default user map as root if one not provided
if (true or config["linux"]["uidMappings"] == nil)
then
config["linux"]["uidMappings"] =
{{containerID=process_user_id, hostID=math.floor(user_id), size=1}}
end
-- Provide default group map as root if one not provided
-- mappings fail with build???
if (true or config["linux"]["gidMappings"] == nil)
then
config["linux"]["gidMappings"] =
{{containerID=process_group_id, hostID=math.floor(group_id), size=1}}
end
-- disable trying to use a specific cgroup
config["linux"]["cgroupsPath"] = nil
end
if (config["mounts"] ~= nil)
then
-- Find and remove any user/group settings in mounts
for _, mount in ipairs(config["mounts"]) do
local opts = {}
if (mount["options"] ~= nil)
then
for _, opt in ipairs(mount["options"]) do
if ((string.sub(opt, 1, 4) ~= "gid=") and (string.sub(opt, 1, 4) ~= "uid="))
then
table.insert(opts, opt)
end
end
end
if (opts ~= nil and #opts > 0)
then
mount["options"] = opts
else
mount["options"] = nil
end
end
-- Remove all bind mounts by copying files into rootfs
local mounts = {}
for i, mount in ipairs(config["mounts"]) do
if ((mount["type"] ~= nil) and (mount["type"] == "bind") and (string.sub(mount["source"], 1, 4) ~= "/sys") and (string.sub(mount["source"], 1, 5) ~= "/proc"))
then
status, output = slurm.allocator_command(string.format("/usr/bin/env rsync --numeric-ids --ignore-errors --stats -a -- %s %s", mount["source"], dst_rootfs..mount["destination"]))
if (status ~= 0)
then
-- rsync can fail due to permissions which may not matter
slurm.log_info("rsync failed")
end
else
table.insert(mounts, mount)
end
end
config["mounts"] = mounts
end
-- Merge in Job environment into container -- this is optional!
if (config["process"]["env"] == nil)
then
config["process"]["env"] = {}
end
for _, env in ipairs(job_env) do
table.insert(config["process"]["env"], env)
end
-- Remove all prestart hooks to squash any networking attempts
if ((config["hooks"] ~= nil) and (config["hooks"]["prestart"] ~= nil))
then
config["hooks"]["prestart"] = nil
end
-- Remove all rlimits
if ((config["process"] ~= nil) and (config["process"]["rlimits"] ~= nil))
then
config["process"]["rlimits"] = nil
end
write_file(dst_config, json.encode(config))
slurm.log_info("created: "..dst_config)
return slurm.SUCCESS
end
function slurm_scrun_stage_out(id, bundle, orig_bundle, root_path, orig_root_path, spool_dir, config_file, jobid, user_id, group_id)
if (root_path == nil)
then
root_path = ""
end
slurm.log_debug(string.format("stage_out(%s, %s, %s, %s, %s, %s, %s, %d, %d, %d)",
id, bundle, orig_bundle, root_path, orig_root_path, spool_dir, config_file, jobid, user_id, group_id))
if (bundle == orig_bundle)
then
slurm.log_info(string.format("skipping stage_out as bundle=orig_bundle=%s", bundle))
return slurm.SUCCESS
end
status, output = slurm.allocator_command(string.format("/usr/bin/env rsync --numeric-ids --delete-after --ignore-errors --stats -a -- %s/ %s/", root_path, orig_root_path))
if (status ~= 0)
then
-- rsync can fail due to permissions which may not matter
slurm.log_info("rsync failed")
else
-- cleanup temporary after they have been synced backed to source
slurm.allocator_command(string.format("/usr/bin/rm --preserve-root=all --one-file-system -dr -- %s", bundle))
end
return slurm.SUCCESS
end
slurm.log_info("initialized scrun.lua")
return slurm.SUCCESS
When scrun receives SIGINT, it will attempt to gracefully
cancel any related jobs (if any) and cleanup.
Copyright (C) 2023 SchedMD LLC.
This file is part of Slurm, a resource management program. For
details, see <https://slurm.schedmd.com/>.
Slurm is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your option)
any later version.
Slurm is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
Slurm(1), oci.conf(5), srun(1),
crun(1), runc(1), DOCKER(1) and podman(1)
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc.
|