Accepting request 1111943 from network:cluster

- Updated to version 23.02.5 with the following changes:
  * Bug Fixes:
    + Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the
      job's environment when `--ntasks-per-node` was requested.
      The method that is is being set, however, is different and should be more
      accurate in more situations.
    + Change pmi2 plugin to honor the `SrunPortRange` option. This matches the
      new behavior of the pmix plugin in 23.02.0. Note that neither of these
      plugins makes use of the `MpiParams=ports=` option, and previously
      were only limited by the systems ephemeral port range.
    + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
      a node features plugin is configured.
    + Fix and prevent reoccurring reservations from overlapping.
    + `job_container/tmpfs` - Avoid attempts to share BasePath between nodes.
    + With `CR_Cpu_Memory`, fix node selection for jobs that request gres and
      `--mem-per-cpu`.
    + Fix a regression from 22.05.7 in which some jobs were allocated too few
      nodes, thus overcommitting cpus to some tasks.
    + Fix a job being stuck in the completing state if the job ends while the
      primary controller is down or unresponsive and the backup controller has
      not yet taken over.
    + Fix `slurmctld` segfault when a node registers with a configured
      `CpuSpecList` while `slurmctld` configuration has the node without
      `CpuSpecList`.
    + Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after
      not registering by `ResumeTimeout`.
    + `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir
      getting skipped.
    + Fix scontrol segfault when 'completing' command requested repeatedly in
      interactive mode.

OBS-URL: https://build.opensuse.org/request/show/1111943
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=95
This commit is contained in:
Dominique Leuenberger 2023-09-20 11:26:46 +00:00 committed by Git OBS Bridge
commit 12bf38b1d0
4 changed files with 145 additions and 7 deletions

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:6634f57991c6a1a7d140c4de2f50a3e66dd06abef6ef83a8571f6eaa2fe048c7
size 7259848

3
slurm-23.02.5.tar.bz2 Normal file
View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:7620f1dd1134d14dff402a9127d5a36c340d7a2b69b55f67d8a44b3b8681a59d
size 7274119

View File

@ -1,3 +1,141 @@
-------------------------------------------------------------------
Mon Sep 18 05:23:19 UTC 2023 - Egbert Eich <eich@suse.com>
- Updated to version 23.02.5 with the following changes:
* Bug Fixes:
+ Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the
job's environment when `--ntasks-per-node` was requested.
The method that is is being set, however, is different and should be more
accurate in more situations.
+ Change pmi2 plugin to honor the `SrunPortRange` option. This matches the
new behavior of the pmix plugin in 23.02.0. Note that neither of these
plugins makes use of the `MpiParams=ports=` option, and previously
were only limited by the systems ephemeral port range.
+ Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
a node features plugin is configured.
+ Fix and prevent reoccurring reservations from overlapping.
+ `job_container/tmpfs` - Avoid attempts to share BasePath between nodes.
+ With `CR_Cpu_Memory`, fix node selection for jobs that request gres and
`--mem-per-cpu`.
+ Fix a regression from 22.05.7 in which some jobs were allocated too few
nodes, thus overcommitting cpus to some tasks.
+ Fix a job being stuck in the completing state if the job ends while the
primary controller is down or unresponsive and the backup controller has
not yet taken over.
+ Fix `slurmctld` segfault when a node registers with a configured
`CpuSpecList` while `slurmctld` configuration has the node without
`CpuSpecList`.
+ Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after
not registering by `ResumeTimeout`.
+ `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir
getting skipped.
+ Fix scontrol segfault when 'completing' command requested repeatedly in
interactive mode.
+ Properly handle a race condition between `bind()` and `listen()` calls
in the network stack when running with SrunPortRange set.
+ Federation - Fix revoked jobs being returned regardless of the
`-a`/`--all` option for privileged users.
+ Federation - Fix canceling pending federated jobs from non-origin
clusters which could leave federated jobs orphaned from the origin
cluster.
+ Fix sinfo segfault when printing multiple clusters with `--noheader`
option.
+ Federation - fix clusters not syncing if clusters are added to a
federation before they have registered with the dbd.
+ `node_features/helpers` - Fix node selection for jobs requesting
changeable.
features with the `|` operator, which could prevent jobs from
running on some valid nodes.
+ `node_features/helpers` - Fix inconsistent handling of `&` and `|`,
where an AND'd feature was sometimes AND'd to all sets of features
instead of just the current set. E.g. `foo|bar&baz` was interpreted
as `{foo,baz}` or `{bar,baz}` instead of how it is documented:
`{foo} or {bar,baz}`.
+ Fix job accounting so that when a job is requeued its allocated node
count is cleared. After the requeue, sacct will correctly show that
the job has 0 `AllocNodes` while it is pending or if it is canceled
before restarting.
+ `sacct` - `AllocCPUS` now correctly shows 0 if a job has not yet
received an allocation or if the job was canceled before getting one.
+ Fix intel OneAPI autodetect: detect the `/dev/dri/renderD[0-9]+` GPUs,
and do not detect `/dev/dri/card[0-9]+`.
+ Fix node selection for jobs that request `--gpus` and a number of
tasks fewer than GPUs, which resulted in incorrectly rejecting these
jobs.
+ Remove `MYSQL_OPT_RECONNECT` completely.
+ Fix cloud nodes in `POWERING_UP` state disappearing (getting set
to `FUTURE`)
when an `scontrol reconfigure` happens.
+ `openapi/dbv0.0.39` - Avoid assert / segfault on missing coordinators
list.
+ `slurmrestd` - Correct memory leak while parsing OpenAPI specification
templates with server overrides.
+ Fix overwriting user node reason with system message.
+ Prevent deadlock when `rpc_queue` is enabled.
+ `slurmrestd` - Correct OpenAPI specification generation bug where
fields with overlapping parent paths would not get generated.
+ Fix memory leak as a result of a partition info query.
+ Fix memory leak as a result of a job info query.
+ For step allocations, fix `--gres=none` sometimes not ignoring gres
from the job.
+ Fix `--exclusive` jobs incorrectly gang-scheduling where they shouldn't.
+ Fix allocations with `CR_SOCKET`, gres not assigned to a specific
socket, and block core distribion potentially allocating more sockets
than required.
+ Revert a change in 23.02.3 where Slurm would kill a script's process
group as soon as the script ended instead of waiting as long as any
process in that process group held the stdout/stderr file descriptors
open. That change broke some scripts that relied on the previous
behavior. Setting time limits for scripts (such as
`PrologEpilogTimeout`) is strongly encouraged to avoid Slurm waiting
indefinitely for scripts to finish.
+ Fix `slurmdbd -R` not returning an error under certain conditions.
+ `slurmdbd` - Avoid potential NULL pointer dereference in the mysql
plugin.
+ Fix regression in 23.02.3 which broken X11 forwarding for hosts when
MUNGE sends a localhost address in the encode host field. This is caused
when the node hostname is mapped to 127.0.0.1 (or similar) in
`/etc/hosts`.
+ `openapi/[db]v0.0.39` - fix memory leak on parsing error.
+ `data_parser/v0.0.39` - fix updating qos for associations.
+ `openapi/dbv0.0.39` - fix updating values for associations with null
users.
+ Fix minor memory leak with `--tres-per-task` and licenses.
+ Fix cyclic socket cpu distribution for tasks in a step where
`--cpus-per-task` < usable threads per core.
+ `slurmrestd` - For `GET /slurm/v0.0.39/node[s]`, change format of
node's energy field `current_watts` to a dictionary to account for
unset value instead of dumping 4294967294.
+ `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of QOS's
field "priority" to a dictionary to account for unset value instead of
dumping 4294967294.
+ slurmrestd - For `GET /slurm/v0.0.39/job[s]`, the 'return code'
code field in `v0.0.39_job_exit`_code will be set to -127 instead of
being left unset where job does not have a relevant return code.
* Other Changes:
+ Remove --uid / --gid options from salloc and srun commands. These options
did not work correctly since the CVE-2022-29500 fix in combination with
some changes made in 23.02.0.
+ Add the `JobId` to `debug()` messages indicating when
`cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being automatically
adjusted.
+ Change the log message warning for rate limited users from verbose to
info.
+ `slurmstepd` - Cleanup per task generated environment for containers in
spooldir.
+ Format batch, extern, interactive, and pending step ids into strings that
are human readable.
+ `slurmrestd` - Reduce memory usage when printing out job CPU frequency.
+ `data_parser/v0.0.39` - Add `required/memory_per_cpu` and
`required/memory_per_node` to `sacct --json` and `sacct --yaml` and
`GET /slurmdb/v0.0.39/jobs` from slurmrestd.
+ `gpu/oneapi` - Store cores correctly so CPU affinity is tracked.
+ Allow `slurmdbd -R` to work if the root assoc id is not 1.
+ Limit periodic node registrations to 50 instead of the full `TreeWidth`.
Since unresolvable `cloud/dynamic` nodes must disable fanout by setting
`TreeWidth` to a large number, this would cause all nodes to register at
once.
------------------------------------------------------------------- -------------------------------------------------------------------
Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com> Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
@ -19,7 +157,7 @@ Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
+ Fix regression in 23.02.2 when checking gres state on `slurmctld` + Fix regression in 23.02.2 when checking gres state on `slurmctld`
startup or reconfigure. Gres changes in the configuration were not startup or reconfigure. Gres changes in the configuration were not
updated on slurmctld startup. On startup or reconfigure, these messages updated on slurmctld startup. On startup or reconfigure, these messages
were present in the log: `"error: Attempt to change gres/gpu Count`". were present in the log: `error: Attempt to change gres/gpu Count`.
+ Fix potential double count of gres when dealing with limits. + Fix potential double count of gres when dealing with limits.
+ Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf` + Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
+ Fixed an issue where jobs requesting licenses were incorrectly rejected. + Fixed an issue where jobs requesting licenses were incorrectly rejected.
@ -163,7 +301,7 @@ Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
lookups. lookups.
+ `sacct` - when printing `PLANNED` time, use end time instead of start + `sacct` - when printing `PLANNED` time, use end time instead of start
time for jobs cancelled before they started. time for jobs cancelled before they started.
+ Hold the job with "`(Reservation ... invalid)`" state reason if the + Hold the job with `(Reservation ... invalid)` state reason if the
reservation is not usable by the job. reservation is not usable by the job.
+ `sbatch` - Added new `--export=NIL` option. + `sbatch` - Added new `--export=NIL` option.
- Removed: - Removed:

View File

@ -18,7 +18,7 @@
# Check file META in sources: update so_version to (API_CURRENT - API_AGE) # Check file META in sources: update so_version to (API_CURRENT - API_AGE)
%define so_version 39 %define so_version 39
%define ver 23.02.4 %define ver 23.02.5
%define _ver _23_02 %define _ver _23_02
#%%define rc_v 0rc1 #%%define rc_v 0rc1
%define dl_ver %{ver} %define dl_ver %{ver}
@ -1321,7 +1321,7 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
%{_mandir}/man5/cgroup.* %{_mandir}/man5/cgroup.*
%{_mandir}/man5/gres.* %{_mandir}/man5/gres.*
%{_mandir}/man5/helpers.* %{_mandir}/man5/helpers.*
%{_mandir}/man5/nonstop.conf.5.* #%%{_mandir}/man5/nonstop.conf.5.*
%{_mandir}/man5/oci.conf.5.gz %{_mandir}/man5/oci.conf.5.gz
%{_mandir}/man5/topology.* %{_mandir}/man5/topology.*
%{_mandir}/man5/knl.conf.5.* %{_mandir}/man5/knl.conf.5.*