forked from pool/slurm
- Updated to version 23.02.5 with the following changes:
* Bug Fixes: + Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the job's environment when `--ntasks-per-node` was requested. The method that is is being set, however, is different and should be more accurate in more situations. + Change pmi2 plugin to honor the `SrunPortRange` option. This matches the new behavior of the pmix plugin in 23.02.0. Note that neither of these plugins makes use of the "`MpiParams=ports=`" option, and previously were only limited by the systems ephemeral port range. + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if a node features plugin is configured. + Fix and prevent reoccurring reservations from overlapping. + `job_container/tmpfs` - Avoid attempts to share BasePath between nodes. + With `CR_Cpu_Memory`, fix node selection for jobs that request gres and `--mem-per-cpu`. + Fix a regression from 22.05.7 in which some jobs were allocated too few nodes, thus overcommitting cpus to some tasks. + Fix a job being stuck in the completing state if the job ends while the primary controller is down or unresponsive and the backup controller has not yet taken over. + Fix `slurmctld` segfault when a node registers with a configured `CpuSpecList` while `slurmctld` configuration has the node without `CpuSpecList`. + Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after not registering by `ResumeTimeout`. + `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir getting skipped. + Fix scontrol segfault when 'completing' command requested repeatedly in interactive mode. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=264
This commit is contained in:
parent
a323feff42
commit
74529b6cc2
@ -1,3 +0,0 @@
|
|||||||
version https://git-lfs.github.com/spec/v1
|
|
||||||
oid sha256:6634f57991c6a1a7d140c4de2f50a3e66dd06abef6ef83a8571f6eaa2fe048c7
|
|
||||||
size 7259848
|
|
3
slurm-23.02.5.tar.bz2
Normal file
3
slurm-23.02.5.tar.bz2
Normal file
@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:7620f1dd1134d14dff402a9127d5a36c340d7a2b69b55f67d8a44b3b8681a59d
|
||||||
|
size 7274119
|
137
slurm.changes
137
slurm.changes
@ -1,3 +1,140 @@
|
|||||||
|
-------------------------------------------------------------------
|
||||||
|
Mon Sep 18 05:23:19 UTC 2023 - Egbert Eich <eich@suse.com>
|
||||||
|
|
||||||
|
- Updated to version 23.02.5 with the following changes:
|
||||||
|
* Bug Fixes:
|
||||||
|
+ Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the
|
||||||
|
job's environment when `--ntasks-per-node` was requested.
|
||||||
|
The method that is is being set, however, is different and should be more
|
||||||
|
accurate in more situations.
|
||||||
|
+ Change pmi2 plugin to honor the `SrunPortRange` option. This matches the
|
||||||
|
new behavior of the pmix plugin in 23.02.0. Note that neither of these
|
||||||
|
plugins makes use of the "`MpiParams=ports=`" option, and previously
|
||||||
|
were only limited by the systems ephemeral port range.
|
||||||
|
+ Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
|
||||||
|
a node features plugin is configured.
|
||||||
|
+ Fix and prevent reoccurring reservations from overlapping.
|
||||||
|
+ `job_container/tmpfs` - Avoid attempts to share BasePath between nodes.
|
||||||
|
+ With `CR_Cpu_Memory`, fix node selection for jobs that request gres and
|
||||||
|
`--mem-per-cpu`.
|
||||||
|
+ Fix a regression from 22.05.7 in which some jobs were allocated too few
|
||||||
|
nodes, thus overcommitting cpus to some tasks.
|
||||||
|
+ Fix a job being stuck in the completing state if the job ends while the
|
||||||
|
primary controller is down or unresponsive and the backup controller has
|
||||||
|
not yet taken over.
|
||||||
|
+ Fix `slurmctld` segfault when a node registers with a configured
|
||||||
|
`CpuSpecList` while `slurmctld` configuration has the node without
|
||||||
|
`CpuSpecList`.
|
||||||
|
+ Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after
|
||||||
|
not registering by `ResumeTimeout`.
|
||||||
|
+ `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir
|
||||||
|
getting skipped.
|
||||||
|
+ Fix scontrol segfault when 'completing' command requested repeatedly in
|
||||||
|
interactive mode.
|
||||||
|
+ Properly handle a race condition between `bind()` and `listen()` calls
|
||||||
|
in the network stack when running with SrunPortRange set.
|
||||||
|
+ Federation - Fix revoked jobs being returned regardless of the
|
||||||
|
`-a`/`--all` option for privileged users.
|
||||||
|
+ Federation - Fix canceling pending federated jobs from non-origin
|
||||||
|
clusters which could leave federated jobs orphaned from the origin
|
||||||
|
cluster.
|
||||||
|
+ Fix sinfo segfault when printing multiple clusters with `--noheader`
|
||||||
|
option.
|
||||||
|
+ Federation - fix clusters not syncing if clusters are added to a
|
||||||
|
federation before they have registered with the dbd.
|
||||||
|
+ `node_features/helpers` - Fix node selection for jobs requesting
|
||||||
|
changeable.
|
||||||
|
features with the '`|`' operator, which could prevent jobs from
|
||||||
|
running on some valid nodes.
|
||||||
|
+ `node_features/helpers` - Fix inconsistent handling of '`&`' and '`|`',
|
||||||
|
where an AND'd feature was sometimes AND'd to all sets of features
|
||||||
|
instead of just the current set. E.g. "`foo|bar&baz`" was interpreted
|
||||||
|
as `{foo,baz}` or `{bar,baz}` instead of how it is documented:
|
||||||
|
"`{foo} or {bar,baz}`".
|
||||||
|
+ Fix job accounting so that when a job is requeued its allocated node
|
||||||
|
count is cleared. After the requeue, sacct will correctly show that
|
||||||
|
the job has 0 `AllocNodes` while it is pending or if it is canceled
|
||||||
|
before restarting.
|
||||||
|
+ `sacct` - `AllocCPUS` now correctly shows 0 if a job has not yet
|
||||||
|
received an allocation or if the job was canceled before getting one.
|
||||||
|
+ Fix intel OneAPI autodetect: detect the `/dev/dri/renderD[0-9]+` GPUs,
|
||||||
|
and do not detect `/dev/dri/card[0-9]+`.
|
||||||
|
+ Fix node selection for jobs that request `--gpus` and a number of
|
||||||
|
tasks fewer than GPUs, which resulted in incorrectly rejecting these jobs.
|
||||||
|
+ Remove `MYSQL_OPT_RECONNECT` completely.
|
||||||
|
+ Fix cloud nodes in `POWERING_UP` state disappearing (getting set
|
||||||
|
to `FUTURE`)
|
||||||
|
when an `scontrol reconfigure` happens.
|
||||||
|
+ `openapi/dbv0.0.39` - Avoid assert / segfault on missing coordinators
|
||||||
|
list.
|
||||||
|
+ `slurmrestd` - Correct memory leak while parsing OpenAPI specification
|
||||||
|
templates with server overrides.
|
||||||
|
+ Fix overwriting user node reason with system message.
|
||||||
|
+ Prevent deadlock when `rpc_queue` is enabled.
|
||||||
|
+ `slurmrestd` - Correct OpenAPI specification generation bug where
|
||||||
|
fields with overlapping parent paths would not get generated.
|
||||||
|
+ Fix memory leak as a result of a partition info query.
|
||||||
|
+ Fix memory leak as a result of a job info query.
|
||||||
|
+ For step allocations, fix `--gres=none` sometimes not ignoring gres
|
||||||
|
from the job.
|
||||||
|
+ Fix `--exclusive` jobs incorrectly gang-scheduling where they shouldn't.
|
||||||
|
+ Fix allocations with `CR_SOCKET`, gres not assigned to a specific
|
||||||
|
socket, and block core distribion potentially allocating more sockets
|
||||||
|
than required.
|
||||||
|
+ Revert a change in 23.02.3 where Slurm would kill a script's process
|
||||||
|
group as soon as the script ended instead of waiting as long as any
|
||||||
|
process in that process group held the stdout/stderr file descriptors
|
||||||
|
open. That change broke some scripts that relied on the previous
|
||||||
|
behavior. Setting time limits for scripts (such as
|
||||||
|
`PrologEpilogTimeout`) is strongly encouraged to avoid Slurm waiting
|
||||||
|
indefinitely for scripts to finish.
|
||||||
|
+ Fix `slurmdbd -R` not returning an error under certain conditions.
|
||||||
|
+ `slurmdbd` - Avoid potential NULL pointer dereference in the mysql
|
||||||
|
plugin.
|
||||||
|
+ Fix regression in 23.02.3 which broken X11 forwarding for hosts when
|
||||||
|
MUNGE sends a localhost address in the encode host field. This is caused
|
||||||
|
when the node hostname is mapped to 127.0.0.1 (or similar) in
|
||||||
|
`/etc/hosts`.
|
||||||
|
+ `openapi/[db]v0.0.39` - fix memory leak on parsing error.
|
||||||
|
+ `data_parser/v0.0.39` - fix updating qos for associations.
|
||||||
|
+ `openapi/dbv0.0.39` - fix updating values for associations with null
|
||||||
|
users.
|
||||||
|
+ Fix minor memory leak with `--tres-per-task` and licenses.
|
||||||
|
+ Fix cyclic socket cpu distribution for tasks in a step where
|
||||||
|
`--cpus-per-task` < usable threads per core.
|
||||||
|
+ `slurmrestd` - For '`GET /slurm/v0.0.39/node[s]`', change format of
|
||||||
|
node's energy field "`current_watts`" to a dictionary to account for
|
||||||
|
unset value instead of dumping 4294967294.
|
||||||
|
+ `slurmrestd` - For '`GET /slurm/v0.0.39/qos`', change format of QOS's
|
||||||
|
field "priority" to a dictionary to account for unset value instead of
|
||||||
|
dumping 4294967294.
|
||||||
|
+ slurmrestd - For '`GET /slurm/v0.0.39/job[s]`', the 'return code'
|
||||||
|
code field in `v0.0.39_job_exit`_code will be set to -127 instead of
|
||||||
|
being left unset where job does not have a relevant return code.
|
||||||
|
* Other Changes:
|
||||||
|
+ Remove --uid / --gid options from salloc and srun commands. These options
|
||||||
|
did not work correctly since the CVE-2022-29500 fix in combination with
|
||||||
|
some changes made in 23.02.0.
|
||||||
|
+ Add the `JobId` to `debug()` messages indicating when
|
||||||
|
`cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being automatically
|
||||||
|
adjusted.
|
||||||
|
+ Change the log message warning for rate limited users from verbose to
|
||||||
|
info.
|
||||||
|
+ `slurmstepd` - Cleanup per task generated environment for containers in
|
||||||
|
spooldir.
|
||||||
|
+ Format batch, extern, interactive, and pending step ids into strings that
|
||||||
|
are human readable.
|
||||||
|
+ `slurmrestd` - Reduce memory usage when printing out job CPU frequency.
|
||||||
|
+ `data_parser/v0.0.39` - Add `required/memory_per_cpu` and
|
||||||
|
`required/memory_per_node` to `sacct --json` and `sacct --yaml` and
|
||||||
|
'`GET /slurmdb/v0.0.39/jobs`' from slurmrestd.
|
||||||
|
+ `gpu/oneapi` - Store cores correctly so CPU affinity is tracked.
|
||||||
|
+ Allow `slurmdbd -R` to work if the root assoc id is not 1.
|
||||||
|
+ Limit periodic node registrations to 50 instead of the full `TreeWidth`.
|
||||||
|
Since unresolvable `cloud/dynamic` nodes must disable fanout by setting
|
||||||
|
`TreeWidth` to a large number, this would cause all nodes to register at
|
||||||
|
once.
|
||||||
|
|
||||||
-------------------------------------------------------------------
|
-------------------------------------------------------------------
|
||||||
Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
|
Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
|
||||||
|
|
||||||
|
@ -18,7 +18,7 @@
|
|||||||
|
|
||||||
# Check file META in sources: update so_version to (API_CURRENT - API_AGE)
|
# Check file META in sources: update so_version to (API_CURRENT - API_AGE)
|
||||||
%define so_version 39
|
%define so_version 39
|
||||||
%define ver 23.02.4
|
%define ver 23.02.5
|
||||||
%define _ver _23_02
|
%define _ver _23_02
|
||||||
#%%define rc_v 0rc1
|
#%%define rc_v 0rc1
|
||||||
%define dl_ver %{ver}
|
%define dl_ver %{ver}
|
||||||
|
Loading…
Reference in New Issue
Block a user