Accepting request 1030432 from network:cluster
- updated to 22.05.5 - NOTE: Slurm validates that libraries are of the same version. Unfortunately, due to an oversight, we failed to notice that the slurmstepd loads the hash_k12 library only after a job has completed. This means that if the hash_k12 library is upgraded before a job finishes, the slurmstepd will load the new library when the job finishes, and will fail due to a mismatch of versions. This results in nodes with slurmstepd processes stuck indefinitely. These processes require manual intervention to clean up. There is no clean way to resolve these hung slurmstepd processes. The only recommended way to upgrade between minor versions of 22.05 with RPM’s or upgrades that replace current binaries and libraries is to drain the nodes of running jobs first. - Fixes a number of moderate severity issues, noteable are: * Load hash plugin at slurmstepd launch time to prevent issues loading the plugin at step completion if the Slurm installation is upgraded. * Update nvml plugin to match the unique id format for MIG devices in new Nvidia drivers. * Fix multi-node step launch failure when nodes in the controller aren't in natural order. This can happen with inconsistent node naming (such as node15 and node052) or with dynamic nodes which can register in any order. * job_container/tmpfs - cleanup containers even when the .ns file isn't mounted anymore. * Wait up to PrologEpilogTimeout before shutting down slurmd to allow prolog and epilog scripts to complete or timeout. Previously, slurmd waited 120 seconds before timing out and killing prolog and epilog scripts. (forwarded request 1010642 from mslacken) OBS-URL: https://build.opensuse.org/request/show/1030432 OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=79
This commit is contained in:
commit
220eec76a4
@ -1,3 +0,0 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:8ff2d1f1cc9b0cbdd344cfcbbe4f14b08d4260b7012619f6cc9c38263f276c41
|
||||
size 7094002
|
3
slurm-22.05.5.tar.bz2
Normal file
3
slurm-22.05.5.tar.bz2
Normal file
@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:f687c98c4f7c0b7409f865771bbb05986daa3e207616667a9aa7390ba5a50fce
|
||||
size 7098772
|
@ -1,3 +1,32 @@
|
||||
-------------------------------------------------------------------
|
||||
Fri Oct 14 08:49:24 UTC 2022 - Christian Goll <cgoll@suse.com>
|
||||
|
||||
- updated to 22.05.5
|
||||
- NOTE: Slurm validates that libraries are of the same version. Unfortunately,
|
||||
due to an oversight, we failed to notice that the slurmstepd loads the
|
||||
hash_k12 library only after a job has completed. This means that if the
|
||||
hash_k12 library is upgraded before a job finishes, the slurmstepd will load
|
||||
the new library when the job finishes, and will fail due to a mismatch of
|
||||
versions. This results in nodes with slurmstepd processes stuck
|
||||
indefinitely. These processes require manual intervention to clean up. There
|
||||
is no clean way to resolve these hung slurmstepd processes.
|
||||
The only recommended way to upgrade between minor versions of 22.05 with
|
||||
RPM’s or upgrades that replace current binaries and libraries is to drain the
|
||||
nodes of running jobs first.
|
||||
- Fixes a number of moderate severity issues, noteable are:
|
||||
* Load hash plugin at slurmstepd launch time to prevent issues loading the
|
||||
plugin at step completion if the Slurm installation is upgraded.
|
||||
* Update nvml plugin to match the unique id format for MIG devices in new
|
||||
Nvidia drivers.
|
||||
* Fix multi-node step launch failure when nodes in the controller aren't in
|
||||
natural order. This can happen with inconsistent node naming (such as
|
||||
node15 and node052) or with dynamic nodes which can register in any order.
|
||||
* job_container/tmpfs - cleanup containers even when the .ns file isn't
|
||||
mounted anymore.
|
||||
* Wait up to PrologEpilogTimeout before shutting down slurmd to allow prolog
|
||||
and epilog scripts to complete or timeout. Previously, slurmd waited 120
|
||||
seconds before timing out and killing prolog and epilog scripts.
|
||||
|
||||
-------------------------------------------------------------------
|
||||
Sat Sep 24 07:34:31 UTC 2022 - Egbert Eich <eich@suse.com>
|
||||
|
||||
|
@ -18,7 +18,7 @@
|
||||
|
||||
# Check file META in sources: update so_version to (API_CURRENT - API_AGE)
|
||||
%define so_version 38
|
||||
%define ver 22.05.2
|
||||
%define ver 22.05.5
|
||||
%define _ver _22_05
|
||||
%define dl_ver %{ver}
|
||||
# so-version is 0 and seems to be stable
|
||||
|
Loading…
Reference in New Issue
Block a user