SHA256
1
0
forked from pool/slurm

Accepting request 1111943 from network:cluster

- Updated to version 23.02.5 with the following changes:
  * Bug Fixes:
    + Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the
      job's environment when `--ntasks-per-node` was requested.
      The method that is is being set, however, is different and should be more
      accurate in more situations.
    + Change pmi2 plugin to honor the `SrunPortRange` option. This matches the
      new behavior of the pmix plugin in 23.02.0. Note that neither of these
      plugins makes use of the `MpiParams=ports=` option, and previously
      were only limited by the systems ephemeral port range.
    + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
      a node features plugin is configured.
    + Fix and prevent reoccurring reservations from overlapping.
    + `job_container/tmpfs` - Avoid attempts to share BasePath between nodes.
    + With `CR_Cpu_Memory`, fix node selection for jobs that request gres and
      `--mem-per-cpu`.
    + Fix a regression from 22.05.7 in which some jobs were allocated too few
      nodes, thus overcommitting cpus to some tasks.
    + Fix a job being stuck in the completing state if the job ends while the
      primary controller is down or unresponsive and the backup controller has
      not yet taken over.
    + Fix `slurmctld` segfault when a node registers with a configured
      `CpuSpecList` while `slurmctld` configuration has the node without
      `CpuSpecList`.
    + Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after
      not registering by `ResumeTimeout`.
    + `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir
      getting skipped.
    + Fix scontrol segfault when 'completing' command requested repeatedly in
      interactive mode.

OBS-URL: https://build.opensuse.org/request/show/1111943
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=95
This commit is contained in:
Dominique Leuenberger 2023-09-20 11:26:46 +00:00 committed by Git OBS Bridge
commit 12bf38b1d0
4 changed files with 145 additions and 7 deletions

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:6634f57991c6a1a7d140c4de2f50a3e66dd06abef6ef83a8571f6eaa2fe048c7
size 7259848

3
slurm-23.02.5.tar.bz2 Normal file
View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:7620f1dd1134d14dff402a9127d5a36c340d7a2b69b55f67d8a44b3b8681a59d
size 7274119

View File

@ -1,3 +1,141 @@
-------------------------------------------------------------------
Mon Sep 18 05:23:19 UTC 2023 - Egbert Eich <eich@suse.com>
- Updated to version 23.02.5 with the following changes:
* Bug Fixes:
+ Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the
job's environment when `--ntasks-per-node` was requested.
The method that is is being set, however, is different and should be more
accurate in more situations.
+ Change pmi2 plugin to honor the `SrunPortRange` option. This matches the
new behavior of the pmix plugin in 23.02.0. Note that neither of these
plugins makes use of the `MpiParams=ports=` option, and previously
were only limited by the systems ephemeral port range.
+ Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
a node features plugin is configured.
+ Fix and prevent reoccurring reservations from overlapping.
+ `job_container/tmpfs` - Avoid attempts to share BasePath between nodes.
+ With `CR_Cpu_Memory`, fix node selection for jobs that request gres and
`--mem-per-cpu`.
+ Fix a regression from 22.05.7 in which some jobs were allocated too few
nodes, thus overcommitting cpus to some tasks.
+ Fix a job being stuck in the completing state if the job ends while the
primary controller is down or unresponsive and the backup controller has
not yet taken over.
+ Fix `slurmctld` segfault when a node registers with a configured
`CpuSpecList` while `slurmctld` configuration has the node without
`CpuSpecList`.
+ Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after
not registering by `ResumeTimeout`.
+ `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir
getting skipped.
+ Fix scontrol segfault when 'completing' command requested repeatedly in
interactive mode.
+ Properly handle a race condition between `bind()` and `listen()` calls
in the network stack when running with SrunPortRange set.
+ Federation - Fix revoked jobs being returned regardless of the
`-a`/`--all` option for privileged users.
+ Federation - Fix canceling pending federated jobs from non-origin
clusters which could leave federated jobs orphaned from the origin
cluster.
+ Fix sinfo segfault when printing multiple clusters with `--noheader`
option.
+ Federation - fix clusters not syncing if clusters are added to a
federation before they have registered with the dbd.
+ `node_features/helpers` - Fix node selection for jobs requesting
changeable.
features with the `|` operator, which could prevent jobs from
running on some valid nodes.
+ `node_features/helpers` - Fix inconsistent handling of `&` and `|`,
where an AND'd feature was sometimes AND'd to all sets of features
instead of just the current set. E.g. `foo|bar&baz` was interpreted
as `{foo,baz}` or `{bar,baz}` instead of how it is documented:
`{foo} or {bar,baz}`.
+ Fix job accounting so that when a job is requeued its allocated node
count is cleared. After the requeue, sacct will correctly show that
the job has 0 `AllocNodes` while it is pending or if it is canceled
before restarting.
+ `sacct` - `AllocCPUS` now correctly shows 0 if a job has not yet
received an allocation or if the job was canceled before getting one.
+ Fix intel OneAPI autodetect: detect the `/dev/dri/renderD[0-9]+` GPUs,
and do not detect `/dev/dri/card[0-9]+`.
+ Fix node selection for jobs that request `--gpus` and a number of
tasks fewer than GPUs, which resulted in incorrectly rejecting these
jobs.
+ Remove `MYSQL_OPT_RECONNECT` completely.
+ Fix cloud nodes in `POWERING_UP` state disappearing (getting set
to `FUTURE`)
when an `scontrol reconfigure` happens.
+ `openapi/dbv0.0.39` - Avoid assert / segfault on missing coordinators
list.
+ `slurmrestd` - Correct memory leak while parsing OpenAPI specification
templates with server overrides.
+ Fix overwriting user node reason with system message.
+ Prevent deadlock when `rpc_queue` is enabled.
+ `slurmrestd` - Correct OpenAPI specification generation bug where
fields with overlapping parent paths would not get generated.
+ Fix memory leak as a result of a partition info query.
+ Fix memory leak as a result of a job info query.
+ For step allocations, fix `--gres=none` sometimes not ignoring gres
from the job.
+ Fix `--exclusive` jobs incorrectly gang-scheduling where they shouldn't.
+ Fix allocations with `CR_SOCKET`, gres not assigned to a specific
socket, and block core distribion potentially allocating more sockets
than required.
+ Revert a change in 23.02.3 where Slurm would kill a script's process
group as soon as the script ended instead of waiting as long as any
process in that process group held the stdout/stderr file descriptors
open. That change broke some scripts that relied on the previous
behavior. Setting time limits for scripts (such as
`PrologEpilogTimeout`) is strongly encouraged to avoid Slurm waiting
indefinitely for scripts to finish.
+ Fix `slurmdbd -R` not returning an error under certain conditions.
+ `slurmdbd` - Avoid potential NULL pointer dereference in the mysql
plugin.
+ Fix regression in 23.02.3 which broken X11 forwarding for hosts when
MUNGE sends a localhost address in the encode host field. This is caused
when the node hostname is mapped to 127.0.0.1 (or similar) in
`/etc/hosts`.
+ `openapi/[db]v0.0.39` - fix memory leak on parsing error.
+ `data_parser/v0.0.39` - fix updating qos for associations.
+ `openapi/dbv0.0.39` - fix updating values for associations with null
users.
+ Fix minor memory leak with `--tres-per-task` and licenses.
+ Fix cyclic socket cpu distribution for tasks in a step where
`--cpus-per-task` < usable threads per core.
+ `slurmrestd` - For `GET /slurm/v0.0.39/node[s]`, change format of
node's energy field `current_watts` to a dictionary to account for
unset value instead of dumping 4294967294.
+ `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of QOS's
field "priority" to a dictionary to account for unset value instead of
dumping 4294967294.
+ slurmrestd - For `GET /slurm/v0.0.39/job[s]`, the 'return code'
code field in `v0.0.39_job_exit`_code will be set to -127 instead of
being left unset where job does not have a relevant return code.
* Other Changes:
+ Remove --uid / --gid options from salloc and srun commands. These options
did not work correctly since the CVE-2022-29500 fix in combination with
some changes made in 23.02.0.
+ Add the `JobId` to `debug()` messages indicating when
`cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being automatically
adjusted.
+ Change the log message warning for rate limited users from verbose to
info.
+ `slurmstepd` - Cleanup per task generated environment for containers in
spooldir.
+ Format batch, extern, interactive, and pending step ids into strings that
are human readable.
+ `slurmrestd` - Reduce memory usage when printing out job CPU frequency.
+ `data_parser/v0.0.39` - Add `required/memory_per_cpu` and
`required/memory_per_node` to `sacct --json` and `sacct --yaml` and
`GET /slurmdb/v0.0.39/jobs` from slurmrestd.
+ `gpu/oneapi` - Store cores correctly so CPU affinity is tracked.
+ Allow `slurmdbd -R` to work if the root assoc id is not 1.
+ Limit periodic node registrations to 50 instead of the full `TreeWidth`.
Since unresolvable `cloud/dynamic` nodes must disable fanout by setting
`TreeWidth` to a large number, this would cause all nodes to register at
once.
-------------------------------------------------------------------
Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
@ -19,7 +157,7 @@ Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
+ Fix regression in 23.02.2 when checking gres state on `slurmctld`
startup or reconfigure. Gres changes in the configuration were not
updated on slurmctld startup. On startup or reconfigure, these messages
were present in the log: `"error: Attempt to change gres/gpu Count`".
were present in the log: `error: Attempt to change gres/gpu Count`.
+ Fix potential double count of gres when dealing with limits.
+ Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
+ Fixed an issue where jobs requesting licenses were incorrectly rejected.
@ -163,7 +301,7 @@ Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
lookups.
+ `sacct` - when printing `PLANNED` time, use end time instead of start
time for jobs cancelled before they started.
+ Hold the job with "`(Reservation ... invalid)`" state reason if the
+ Hold the job with `(Reservation ... invalid)` state reason if the
reservation is not usable by the job.
+ `sbatch` - Added new `--export=NIL` option.
- Removed:

View File

@ -18,7 +18,7 @@
# Check file META in sources: update so_version to (API_CURRENT - API_AGE)
%define so_version 39
%define ver 23.02.4
%define ver 23.02.5
%define _ver _23_02
#%%define rc_v 0rc1
%define dl_ver %{ver}
@ -1321,7 +1321,7 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
%{_mandir}/man5/cgroup.*
%{_mandir}/man5/gres.*
%{_mandir}/man5/helpers.*
%{_mandir}/man5/nonstop.conf.5.*
#%%{_mandir}/man5/nonstop.conf.5.*
%{_mandir}/man5/oci.conf.5.gz
%{_mandir}/man5/topology.*
%{_mandir}/man5/knl.conf.5.*