- Updated to 23.02.4 with the following changes:
* Bug Fixes:
+ Fix main scheduler loop not starting after a failover to backup
controller. Avoid slurmctld segfault when specifying
`AccountingStorageExternalHost` (bsc#1214983).
+ Fix sbatch return code when `--wait` is requested on a job array.
+ Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
+ Fix `slurmrestd` handling of job hold/release operations.
+ Fix step running indefinitely when slurmctld takes more than
`MessageTimeout` to respond. Now, `slurmctld` will cancel the step when
detected, preventing following steps from getting stuck waiting for
resources to be released.
+ Fix regression to make `job_desc.min_cpus` accurate again in `job_submit`
when requesting a job with `--ntasks-per-node`.
+ Fix handling of `ArrayTaskThrottle` in backfill.
+ Fix regression in 23.02.2 when checking gres state on `slurmctld`
startup or reconfigure. Gres changes in the configuration were not
updated on slurmctld startup. On startup or reconfigure, these messages
were present in the log: `"error: Attempt to change gres/gpu Count`".
+ Fix potential double count of gres when dealing with limits.
+ Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
+ Fixed an issue where jobs requesting licenses were incorrectly rejected.
+ `scrontab` - Fix cutting off the final character of quoted variables.
+ `smail` - Fix issues where e-mails at job completion were not being sent.
+ `scontrol/slurmctld` - fix comma parsing when updating a reservation's
nodes.
+ Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus
having more tasks than they should and other gpus being unused.
+ Fix regression in 23.02 that causes slurmstepd to crash when `srun`
requests more than `TreeWidth` nodes in a step and uses the pmi2 or
OBS-URL: https://build.opensuse.org/request/show/1110259
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=93
* Bug Fixes:
+ Fix main scheduler loop not starting after a failover to backup
controller. Avoid slurmctld segfault when specifying
`AccountingStorageExternalHost` (bsc#1214983).
+ Fix sbatch return code when `--wait` is requested on a job array.
+ Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
+ Fix `slurmrestd` handling of job hold/release operations.
+ Fix step running indefinitely when slurmctld takes more than
`MessageTimeout` to respond. Now, `slurmctld` will cancel the step when
detected, preventing following steps from getting stuck waiting for
resources to be released.
+ Fix regression to make `job_desc.min_cpus` accurate again in `job_submit`
when requesting a job with `--ntasks-per-node`.
+ Fix handling of `ArrayTaskThrottle` in backfill.
+ Fix regression in 23.02.2 when checking gres state on `slurmctld`
startup or reconfigure. Gres changes in the configuration were not
updated on slurmctld startup. On startup or reconfigure, these messages
were present in the log: `"error: Attempt to change gres/gpu Count`".
+ Fix potential double count of gres when dealing with limits.
+ Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
+ Fixed an issue where jobs requesting licenses were incorrectly rejected.
+ `scrontab` - Fix cutting off the final character of quoted variables.
+ `smail` - Fix issues where e-mails at job completion were not being sent.
+ `scontrol/slurmctld` - fix comma parsing when updating a reservation's
nodes.
+ Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus
having more tasks than they should and other gpus being unused.
+ Fix regression in 23.02 that causes slurmstepd to crash when `srun`
requests more than `TreeWidth` nodes in a step and uses the pmi2 or
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=260
- Fixes since 23.02.03:
Highlights:
* Fix main scheduler loop not starting after a failover to backup controller.
* Avoid slurmctld segfault when specifying `AccountingStorageExternalHost`
(bsc#1214983).
Other:
* Fix sbatch return code when `--wait` is requested on a job array.
* Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
* Fix `slurmrestd` handling of job hold/release operations.
* Make spank `S_JOB_ARGV` item value hold the requested command `argv`
instead of the `srun --bcast` value when `--bcast` requested (only in local
context).
* Fix step running indefinitely when slurmctld takes more than
`MessageTimeout` to respond. Now, slurmctld will cancel the step when
detected, preventing following steps from getting stuck waiting for
resources to be released.
* Fix regression to make `job_desc.min_cpus` accurate again in job_submit when
requesting a job with `--ntasks-per-node`.
* Fix handling of `ArrayTaskThrottle` in backfill.
* Fix regression in 23.02.2 when checking gres state on `slurmctld` startup or
reconfigure. Gres changes in the configuration were not updated on slurmctld
startup. On startup or reconfigure, these messages were present in the log:
`"error: Attempt to change gres/gpu Count`".
* Fix potential double count of gres when dealing with limits.
* Fix slurmstepd segfault when ContainerPath is not set in `oci.conf`
* Fixed an issue where jobs requesting licenses were incorrectly rejected.
* `scrontab` - Fix cutting off the final character of quoted variables.
* `smail` - Fix issues where e-mails at job completion were not being sent.
* `scontrol/slurmctld` - fix comma parsing when updating a reservation's
nodes.
OBS-URL: https://build.opensuse.org/request/show/1109308
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=92
Highlights:
* Fix main scheduler loop not starting after a failover to backup controller.
* Avoid slurmctld segfault when specifying `AccountingStorageExternalHost`
(bsc#1214983).
Other:
* Fix sbatch return code when `--wait` is requested on a job array.
* Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
* Fix `slurmrestd` handling of job hold/release operations.
* Make spank `S_JOB_ARGV` item value hold the requested command `argv`
instead of the `srun --bcast` value when `--bcast` requested (only in local
context).
* Fix step running indefinitely when slurmctld takes more than
`MessageTimeout` to respond. Now, slurmctld will cancel the step when
detected, preventing following steps from getting stuck waiting for
resources to be released.
* Fix regression to make `job_desc.min_cpus` accurate again in job_submit when
requesting a job with `--ntasks-per-node`.
* Fix handling of `ArrayTaskThrottle` in backfill.
* Fix regression in 23.02.2 when checking gres state on `slurmctld` startup or
reconfigure. Gres changes in the configuration were not updated on slurmctld
startup. On startup or reconfigure, these messages were present in the log:
`"error: Attempt to change gres/gpu Count`".
* Fix potential double count of gres when dealing with limits.
* Fix slurmstepd segfault when ContainerPath is not set in `oci.conf`
* Fixed an issue where jobs requesting licenses were incorrectly rejected.
* `scrontab` - Fix cutting off the final character of quoted variables.
* `smail` - Fix issues where e-mails at job completion were not being sent.
* `scontrol/slurmctld` - fix comma parsing when updating a reservation's
nodes.
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=258
- updated to 23.02.04 which includes following changes:
* fixing the main scheduler loop not starting on the backup controller after
a failover event, a segfault when attempting to use
* AccountingStorageExternalHost, and an issue where steps could continue
running indefinitely if the slurmctld takes too long to respond (bsc#1214983)
* include a fix for a potential slurmctld crashes when the backup slurmctld
takes over.
* This also fixes some issues when using older versions of the command line
tools with a 23.02 controller.
* srun/sbatch/salloc - In order to support user namespaces, process user and
group ids are no longer used unless explicitly requested as an argument and
are left as nobody(99) by default. Any cli_filters or SPANK plugins need to
ignore any uid or gid that equal SLURM_AUTH_NOBODY (99). User and group ids
are now resolved by the active auth plugin. To determine the actual job uid
or gid you should use the RESPONSE_RESOURCE_ALLOCATION RPC.
- removed Fix-test-3.13.patch as fixed upstream
- removed Fix-test-38.11.patch as test changed upstream (forwarded request 1109009 from mslacken)
OBS-URL: https://build.opensuse.org/request/show/1109029
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=91
- updated to 23.02.04 which includes following changes:
* fixing the main scheduler loop not starting on the backup controller after
a failover event, a segfault when attempting to use
* AccountingStorageExternalHost, and an issue where steps could continue
running indefinitely if the slurmctld takes too long to respond (bsc#1214983)
* include a fix for a potential slurmctld crashes when the backup slurmctld
takes over.
* This also fixes some issues when using older versions of the command line
tools with a 23.02 controller.
* srun/sbatch/salloc - In order to support user namespaces, process user and
group ids are no longer used unless explicitly requested as an argument and
are left as nobody(99) by default. Any cli_filters or SPANK plugins need to
ignore any uid or gid that equal SLURM_AUTH_NOBODY (99). User and group ids
are now resolved by the active auth plugin. To determine the actual job uid
or gid you should use the RESPONSE_RESOURCE_ALLOCATION RPC.
- removed Fix-test-3.13.patch as fixed upstream
- removed Fix-test-38.11.patch as test changed upstream
OBS-URL: https://build.opensuse.org/request/show/1109009
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=256
- updated to 23.02.02 which includes a number of fixes to Slurm stability
* Includes a fix for a regression in 23.02 that caused openmpi mpirun to fail
to launch tasks.
* It also includes two functional changes: Don't update the cron job tasks if
the whole crontab file is left untouched after opening it with scrontab -e
* Sort dynamic nodes and include them in topology after scontrol reconfigure
or a slurmctld restart.
OBS-URL: https://build.opensuse.org/request/show/1085668
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=254
- Web-configurator: changed presets to SUSE defaults.
- If %_restart_on_update is no longer defined replace by own
macro.
- Marked slurm-openlava, slurm-seff and slurm-sjstat noarch.
- rpmlint:
* dropped some rpmlint filters which are no longer relevant.
* added/refreshed filters. For Details, see rpmlintrc.
- Remove workaround to fix the restart issue in an Slurm package
described in bsc#1088693.
The Slurm version in this package as 16.05. Any attempt to
directly migrate to the current version is bound to fail
anyway.
- Now require slurm-munge if munge authentication is installed.
OBS-URL: https://build.opensuse.org/request/show/1083466
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=89
- Web-configurator: changed presets to SUSE defaults.
- If %_restart_on_update is no longer defined replace by own
macro.
- Marked slurm-openlava, slurm-seff and slurm-sjstat noarch.
- rpmlint:
* dropped some rpmlint filters which are no longer relevant.
* added/refreshed filters. For Details, see rpmlintrc.
- Remove workaround to fix the restart issue in an Slurm package
described in bsc#1088693.
The Slurm version in this package as 16.05. Any attempt to
directly migrate to the current version is bound to fail
anyway.
OBS-URL: https://build.opensuse.org/request/show/1082770
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=251
- updated to 23.02.1 with the following changes:
* job_container/tmpfs - cleanup job container even if namespace mount is
already unmounted.
* openapi/dbv0.0.38 - Fix not displaying an error when updating QOS or
associations fails.
* Fix nodes remaining as PLANNED after slurmctld save state recovery.
* Add cgroup.conf EnableControllers option for cgroup/v2.
* Get correct cgroup root to allow slurmd to run in containers like Docker.
* slurmctld - add missing PrivateData=jobs check to step ContainerID lookup
requests originated from 'scontrol show step container-id=<id>' or certain
scrun operations when container state can't be directly queried.
* Fix nodes un-draining after being drained due to unkillable step.
* Fix remote licenses allowed percentages reset to 0 during upgrade.
* sacct - Avoid truncating time strings when using SLURM_TIME_FORMAT with
the --parsable option.
* Fix regression in 22.05.0rc1 that broke Nodes=ALL in a NodeSet.
* openapi/v0.0.39 - fix jobs submitted via slurmrestd being allocated fewer
CPUs than tasks when requesting multiple tasks.
* Fix job not being scheduled on valid nodes and potentially being rejected
when using parentheses at the beginning of square brackets in a feature
request, for example: "feat1&[(feat2|feat3)]".
* Fix regression in 23.02.0rc1 which made --gres-flags=enforce-binding no
longer enforce optimal core-gpu job placement.
* mpi/pmix - Fix v5 to load correctly when libpmix.so isn't in the normal
lib path.
* data_parser/v0.0.39 - fix regression where "memory_per_node" would be
rejected for job submission.
* data_parser/v0.0.39 - fix regression where "memory_per_cpu" would be
rejected for job submission.
* slurmctld - add an assert to check for magic number presence before deleting
OBS-URL: https://build.opensuse.org/request/show/1076522
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=88
- updated to 23.02.1 with following chnages:
* job_container/tmpfs - cleanup job container even if namespace mount is
already unmounted.
* openapi/dbv0.0.38 - Fix not displaying an error when updating QOS or
associations fails.
* Fix nodes remaining as PLANNED after slurmctld save state recovery.
* Add cgroup.conf EnableControllers option for cgroup/v2.
* Get correct cgroup root to allow slurmd to run in containers like Docker.
* slurmctld - add missing PrivateData=jobs check to step ContainerID lookup
requests originated from 'scontrol show step container-id=<id>' or certain
scrun operations when container state can't be directly queried.
* Fix nodes un-draining after being drained due to unkillable step.
* Fix remote licenses allowed percentages reset to 0 during upgrade.
* sacct - Avoid truncating time strings when using SLURM_TIME_FORMAT with
the --parsable option.
* Fix regression in 22.05.0rc1 that broke Nodes=ALL in a NodeSet.
* openapi/v0.0.39 - fix jobs submitted via slurmrestd being allocated fewer
CPUs than tasks when requesting multiple tasks.
* Fix job not being scheduled on valid nodes and potentially being rejected
when using parentheses at the beginning of square brackets in a feature
request, for example: "feat1&[(feat2|feat3)]".
* Fix regression in 23.02.0rc1 which made --gres-flags=enforce-binding no
longer enforce optimal core-gpu job placement.
* mpi/pmix - Fix v5 to load correctly when libpmix.so isn't in the normal
lib path.
* data_parser/v0.0.39 - fix regression where "memory_per_node" would be
rejected for job submission.
* data_parser/v0.0.39 - fix regression where "memory_per_cpu" would be
rejected for job submission.
* slurmctld - add an assert to check for magic number presence before deleting
OBS-URL: https://build.opensuse.org/request/show/1076461
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=248
- updated to 23.02.0
* Highlights
+ slurmctld - Add new RPC rate limiting feature. This is enabled through
SlurmctldParameters=rl_enable, otherwise disabled by default.
+ Make scontrol reconfigure and sending a SIGHUP to the slurmctld behave
the same. If you were using SIGHUP as a 'lighter' scontrol reconfigure
to rotate logs please update your scripts to use SIGUSR2 instead.
+ Change cloud nodes to show by default. PrivateData=cloud is no longer
needed.
+ sreport - Count planned (FKA reserved) time for jobs running in
IGNORE_JOBS reservations. Previously was lumped into IDLE time.
+ job_container/tmpfs - Support running with an arbitrary list of private
mount points (/tmp and /dev/shm are the default, but not required).
+ job_container/tmpfs - Set more environment variables in InitScript.
+ Make all cgroup directories created by Slurm owned by root. This was the
behavior in cgroup/v2 but not in cgroup/v1 where by default the step
directories ownership were set to the user and group of the job.
+ accounting_storage/mysql - change purge/archive to calculate record ages
based on end time, rather than start or submission times.
+ job_submit/lua - add support for log_user() from slurm_job_modify().
+ Run the following scripts in slurmscriptd instead of slurmctld:
ResumeProgram, ResumeFailProgram, SuspendProgram, ResvProlog, ResvEpilog,
and RebootProgram (only with SlurmctldParameters=reboot_from_controller).
+ Only permit changing log levels with 'srun --slurmd-debug' by root
or SlurmUser.
+ slurmctld will fatal() when reconfiguring the job_submit plugin fails.
+ Add PowerDownOnIdle partition option to power down nodes after nodes
become idle.
+ Add "[jobid.stepid]" prefix from slurmstepd and "slurmscriptd" prefix
from slurmcriptd to Syslog logging. Previously was only happening when
OBS-URL: https://build.opensuse.org/request/show/1068320
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=83
+ Fixed GpuFreqDef option. When set in slurm.conf, it will be used if
--gpu-freq was not explicitly set by the job step.
+ topology/tree - Add new TopologyParam=SwitchAsNodeRank option to reorder
nodes based on switch layout. This can be useful if the naming convention
for the nodes does not natually map to the network topology.
+ Removed the default setting for GpuFreqDef. If unset, no attempt to change
the GPU frequency will be made if --gpu-freq is not set for the step.
OBS-URL: https://build.opensuse.org/request/show/1068316
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=232
- updated to 23.02.0-0rc1
* Highlights
+ slurmctld - Add new RPC rate limiting feature. This is enabled through
SlurmctldParameters=rl_enable, otherwise disabled by default.
+ Make scontrol reconfigure and sending a SIGHUP to the slurmctld behave
the same. If you were using SIGHUP as a 'lighter' scontrol reconfigure
to rotate logs please update your scripts to use SIGUSR2 instead.
+ Change cloud nodes to show by default. PrivateData=cloud is no longer
needed.
+ sreport - Count planned (FKA reserved) time for jobs running in
IGNORE_JOBS reservations. Previously was lumped into IDLE time.
+ job_container/tmpfs - Support running with an arbitrary list of private
mount points (/tmp and /dev/shm are the default, but not required).
+ job_container/tmpfs - Set more environment variables in InitScript.
+ Make all cgroup directories created by Slurm owned by root. This was the
behavior in cgroup/v2 but not in cgroup/v1 where by default the step
directories ownership were set to the user and group of the job.
+ accounting_storage/mysql - change purge/archive to calculate record ages
based on end time, rather than start or submission times.
+ job_submit/lua - add support for log_user() from slurm_job_modify().
+ Run the following scripts in slurmscriptd instead of slurmctld:
ResumeProgram, ResumeFailProgram, SuspendProgram, ResvProlog, ResvEpilog,
and RebootProgram (only with SlurmctldParameters=reboot_from_controller).
+ Only permit changing log levels with 'srun --slurmd-debug' by root
or SlurmUser.
+ slurmctld will fatal() when reconfiguring the job_submit plugin fails.
+ Add PowerDownOnIdle partition option to power down nodes after nodes
become idle.
+ Add "[jobid.stepid]" prefix from slurmstepd and "slurmscriptd" prefix
from slurmcriptd to Syslog logging. Previously was only happening when
OBS-URL: https://build.opensuse.org/request/show/1067475
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=231
- Test Suite fixes:
* Update README_Testsuite.md.
* Clean up left over files when de-installing test suite.
* Adjustment to test suite package: for SLE mark the openmpi4
devel package and slurm-hdf5 optional.
* Add -ffat-lto-objects to the build flags when LTO is set to
make sure the object files we ship with the test suite still
work correctly.
* Improve setup-testsuite.sh: copy ssh fingerprints from all nodes.
- set environment variable SUSE_ZNOW to 0 in %build to avoid module load
failures due to unresolved symbols as module take advantage of lazy
bindings (bsc#1200030).
OBS-URL: https://build.opensuse.org/request/show/1031255
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=80
* Update README_Testsuite.md.
* Clean up left over files when de-installing test suite.
* Adjustment to test suite package: for SLE mark the openmpi4
devel package and slurm-hdf5 optional.
* Add -ffat-lto-objects to the build flags when LTO is set to
make sure the object files we ship with the test suite still
work correctly.
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=224
- updated to 22.05.5
- NOTE: Slurm validates that libraries are of the same version. Unfortunately,
due to an oversight, we failed to notice that the slurmstepd loads the
hash_k12 library only after a job has completed. This means that if the
hash_k12 library is upgraded before a job finishes, the slurmstepd will load
the new library when the job finishes, and will fail due to a mismatch of
versions. This results in nodes with slurmstepd processes stuck
indefinitely. These processes require manual intervention to clean up. There
is no clean way to resolve these hung slurmstepd processes.
The only recommended way to upgrade between minor versions of 22.05 with
RPM’s or upgrades that replace current binaries and libraries is to drain the
nodes of running jobs first.
- Fixes a number of moderate severity issues, noteable are:
* Load hash plugin at slurmstepd launch time to prevent issues loading the
plugin at step completion if the Slurm installation is upgraded.
* Update nvml plugin to match the unique id format for MIG devices in new
Nvidia drivers.
* Fix multi-node step launch failure when nodes in the controller aren't in
natural order. This can happen with inconsistent node naming (such as
node15 and node052) or with dynamic nodes which can register in any order.
* job_container/tmpfs - cleanup containers even when the .ns file isn't
mounted anymore.
* Wait up to PrologEpilogTimeout before shutting down slurmd to allow prolog
and epilog scripts to complete or timeout. Previously, slurmd waited 120
seconds before timing out and killing prolog and epilog scripts. (forwarded request 1010642 from mslacken)
OBS-URL: https://build.opensuse.org/request/show/1030432
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=79
- updated to 22.05.5
- NOTE: Slurm validates that libraries are of the same version. Unfortunately,
due to an oversight, we failed to notice that the slurmstepd loads the
hash_k12 library only after a job has completed. This means that if the
hash_k12 library is upgraded before a job finishes, the slurmstepd will load
the new library when the job finishes, and will fail due to a mismatch of
versions. This results in nodes with slurmstepd processes stuck
indefinitely. These processes require manual intervention to clean up. There
is no clean way to resolve these hung slurmstepd processes.
The only recommended way to upgrade between minor versions of 22.05 with
RPM’s or upgrades that replace current binaries and libraries is to drain the
nodes of running jobs first.
- Fixes a number of moderate severity issues, noteable are:
* Load hash plugin at slurmstepd launch time to prevent issues loading the
plugin at step completion if the Slurm installation is upgraded.
* Update nvml plugin to match the unique id format for MIG devices in new
Nvidia drivers.
* Fix multi-node step launch failure when nodes in the controller aren't in
natural order. This can happen with inconsistent node naming (such as
node15 and node052) or with dynamic nodes which can register in any order.
* job_container/tmpfs - cleanup containers even when the .ns file isn't
mounted anymore.
* Wait up to PrologEpilogTimeout before shutting down slurmd to allow prolog
and epilog scripts to complete or timeout. Previously, slurmd waited 120
seconds before timing out and killing prolog and epilog scripts.
OBS-URL: https://build.opensuse.org/request/show/1010642
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=220