Accepting request 1111943 from network:cluster

- Updated to version 23.02.5 with the following changes: * Bug Fixes: + Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the job's environment when `--ntasks-per-node` was requested. The method that is is being set, however, is different and should be more accurate in more situations. + Change pmi2 plugin to honor the `SrunPortRange` option. This matches the new behavior of the pmix plugin in 23.02.0. Note that neither of these plugins makes use of the `MpiParams=ports=` option, and previously were only limited by the systems ephemeral port range. + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if a node features plugin is configured. + Fix and prevent reoccurring reservations from overlapping. + `job_container/tmpfs` - Avoid attempts to share BasePath between nodes. + With `CR_Cpu_Memory`, fix node selection for jobs that request gres and `--mem-per-cpu`. + Fix a regression from 22.05.7 in which some jobs were allocated too few nodes, thus overcommitting cpus to some tasks. + Fix a job being stuck in the completing state if the job ends while the primary controller is down or unresponsive and the backup controller has not yet taken over. + Fix `slurmctld` segfault when a node registers with a configured `CpuSpecList` while `slurmctld` configuration has the node without `CpuSpecList`. + Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after not registering by `ResumeTimeout`. + `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir getting skipped. + Fix scontrol segfault when 'completing' command requested repeatedly in interactive mode. OBS-URL: https://build.opensuse.org/request/show/1111943 OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=95
2023-09-20 11:26:46 +00:00 · 2023-09-20 11:26:46 +00:00 · 12bf38b1d0
commit 12bf38b1d0
parent 3825e9fab0 f0b994e220
4 changed files with 145 additions and 7 deletions
--- a/slurm-23.02.4.tar.bz2
+++ b/slurm-23.02.4.tar.bz2
@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:6634f57991c6a1a7d140c4de2f50a3e66dd06abef6ef83a8571f6eaa2fe048c7
-size 7259848
--- a/slurm-23.02.5.tar.bz2
+++ b/slurm-23.02.5.tar.bz2
@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:7620f1dd1134d14dff402a9127d5a36c340d7a2b69b55f67d8a44b3b8681a59d
+size 7274119
--- a/slurm.changes
+++ b/slurm.changes
@ -1,3 +1,141 @@
+-------------------------------------------------------------------
+Mon Sep 18 05:23:19 UTC 2023 - Egbert Eich <eich@suse.com>
+
+- Updated to version 23.02.5 with the following changes:
+  * Bug Fixes:
+    + Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the
+      job's environment when `--ntasks-per-node` was requested.
+      The method that is is being set, however, is different and should be more
+      accurate in more situations.
+    + Change pmi2 plugin to honor the `SrunPortRange` option. This matches the
+      new behavior of the pmix plugin in 23.02.0. Note that neither of these
+      plugins makes use of the `MpiParams=ports=` option, and previously
+      were only limited by the systems ephemeral port range.
+    + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
+      a node features plugin is configured.
+    + Fix and prevent reoccurring reservations from overlapping.
+    + `job_container/tmpfs` - Avoid attempts to share BasePath between nodes.
+    + With `CR_Cpu_Memory`, fix node selection for jobs that request gres and
+      `--mem-per-cpu`.
+    + Fix a regression from 22.05.7 in which some jobs were allocated too few
+      nodes, thus overcommitting cpus to some tasks.
+    + Fix a job being stuck in the completing state if the job ends while the
+      primary controller is down or unresponsive and the backup controller has
+      not yet taken over.
+    + Fix `slurmctld` segfault when a node registers with a configured
+      `CpuSpecList` while `slurmctld` configuration has the node without
+      `CpuSpecList`.
+    + Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after
+      not registering by `ResumeTimeout`.
+    + `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir
+      getting skipped.
+    + Fix scontrol segfault when 'completing' command requested repeatedly in
+      interactive mode.
+    + Properly handle a race condition between `bind()` and `listen()` calls
+      in the network stack when running with SrunPortRange set.
+    + Federation - Fix revoked jobs being returned regardless of the
+      `-a`/`--all` option for privileged users.
+    + Federation - Fix canceling pending federated jobs from non-origin
+      clusters which could leave federated jobs orphaned from the origin
+      cluster.
+    + Fix sinfo segfault when printing multiple clusters with `--noheader`
+      option.
+    + Federation - fix clusters not syncing if clusters are added to a
+      federation before they have registered with the dbd.
+    + `node_features/helpers` - Fix node selection for jobs requesting
+      changeable.
+      features with the `|` operator, which could prevent jobs from
+      running on some valid nodes.
+    + `node_features/helpers` - Fix inconsistent handling of `&` and `|`,
+      where an AND'd feature was sometimes AND'd to all sets of features
+      instead of just the current set. E.g. `foo|bar&baz` was interpreted
+      as `{foo,baz}` or `{bar,baz}` instead of how it is documented:
+      `{foo} or {bar,baz}`.
+    + Fix job accounting so that when a job is requeued its allocated node
+      count is cleared. After the requeue, sacct will correctly show that
+      the job has 0 `AllocNodes` while it is pending or if it is canceled
+      before restarting.
+    + `sacct` - `AllocCPUS` now correctly shows 0 if a job has not yet
+      received an allocation or if the job was canceled before getting one.
+    + Fix intel OneAPI autodetect: detect the `/dev/dri/renderD[0-9]+` GPUs,
+      and do not detect `/dev/dri/card[0-9]+`.
+    + Fix node selection for jobs that request `--gpus` and a number of
+      tasks fewer than GPUs, which resulted in incorrectly rejecting these
+      jobs.
+    + Remove `MYSQL_OPT_RECONNECT` completely.
+    + Fix cloud nodes in `POWERING_UP` state disappearing (getting set
+      to `FUTURE`)
+      when an `scontrol reconfigure` happens.
+    + `openapi/dbv0.0.39` - Avoid assert / segfault on missing coordinators
+      list.
+    + `slurmrestd` - Correct memory leak while parsing OpenAPI specification
+      templates with server overrides.
+    + Fix overwriting user node reason with system message.
+    + Prevent deadlock when `rpc_queue` is enabled.
+    + `slurmrestd` - Correct OpenAPI specification generation bug where
+      fields with overlapping parent paths would not get generated.
+    + Fix memory leak as a result of a partition info query.
+    + Fix memory leak as a result of a job info query.
+    + For step allocations, fix `--gres=none` sometimes not ignoring gres
+      from the job.
+    + Fix `--exclusive` jobs incorrectly gang-scheduling where they shouldn't.
+    + Fix allocations with `CR_SOCKET`, gres not assigned to a specific
+      socket, and block core distribion potentially allocating more sockets
+      than required.
+    + Revert a change in 23.02.3 where Slurm would kill a script's process
+      group as soon as the script ended instead of waiting as long as any
+      process in that process group held the stdout/stderr file descriptors
+      open. That change broke some scripts that relied on the previous
+      behavior. Setting time limits for scripts (such as
+      `PrologEpilogTimeout`) is strongly encouraged to avoid Slurm waiting
+      indefinitely for scripts to finish.
+    + Fix `slurmdbd -R` not returning an error under certain conditions.
+    + `slurmdbd` - Avoid potential NULL pointer dereference in the mysql
+      plugin.
+    + Fix regression in 23.02.3 which broken X11 forwarding for hosts when
+      MUNGE sends a localhost address in the encode host field. This is caused
+      when the node hostname is mapped to 127.0.0.1 (or similar) in
+      `/etc/hosts`.
+    + `openapi/[db]v0.0.39` - fix memory leak on parsing error.
+    + `data_parser/v0.0.39` - fix updating qos for associations.
+    + `openapi/dbv0.0.39` - fix updating values for associations with null
+      users.
+    + Fix minor memory leak with `--tres-per-task` and licenses.
+    + Fix cyclic socket cpu distribution for tasks in a step where
+      `--cpus-per-task` < usable threads per core.
+    + `slurmrestd` - For `GET /slurm/v0.0.39/node[s]`, change format of
+      node's energy field `current_watts` to a dictionary to account for
+      unset value instead of dumping 4294967294.
+    + `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of QOS's
+      field "priority" to a dictionary to account for unset value instead of
+      dumping 4294967294.
+    + slurmrestd - For `GET /slurm/v0.0.39/job[s]`, the 'return code'
+      code field in `v0.0.39_job_exit`_code will be set to -127 instead of
+      being left unset where job does not have a relevant return code.
+  * Other Changes:
+    + Remove --uid / --gid options from salloc and srun commands. These options
+      did not work correctly since the CVE-2022-29500 fix in combination with
+      some changes made in 23.02.0.
+    + Add the `JobId` to `debug()` messages indicating when
+      `cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being automatically
+      adjusted.
+    + Change the log message warning for rate limited users from verbose to
+      info.
+    + `slurmstepd` - Cleanup per task generated environment for containers in
+      spooldir.
+    + Format batch, extern, interactive, and pending step ids into strings that
+      are human readable.
+    + `slurmrestd` - Reduce memory usage when printing out job CPU frequency.
+    + `data_parser/v0.0.39` - Add `required/memory_per_cpu` and
+      `required/memory_per_node` to `sacct --json` and `sacct --yaml` and
+      `GET /slurmdb/v0.0.39/jobs` from slurmrestd.
+    + `gpu/oneapi` - Store cores correctly so CPU affinity is tracked.
+    + Allow `slurmdbd -R` to work if the root assoc id is not 1.
+    + Limit periodic node registrations to 50 instead of the full `TreeWidth`.
+      Since unresolvable `cloud/dynamic` nodes must disable fanout by setting
+      `TreeWidth` to a large number, this would cause all nodes to register at
+      once.
+
 -------------------------------------------------------------------
 Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>

@ -19,7 +157,7 @@ Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
    + Fix regression in 23.02.2 when checking gres state on `slurmctld`
      startup  or reconfigure. Gres changes in the configuration were not
      updated on slurmctld startup. On startup or reconfigure, these messages
-      were present in the log: `"error: Attempt to change gres/gpu Count`".
+      were present in the log: `error: Attempt to change gres/gpu Count`.
    + Fix potential double count of gres when dealing with limits.
    + Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
    + Fixed an issue where jobs requesting licenses were incorrectly rejected.
@ -163,7 +301,7 @@ Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
      lookups.
    + `sacct` - when printing `PLANNED` time, use end time instead of start
      time for jobs cancelled before they started.
-    + Hold the job with "`(Reservation ... invalid)`" state reason if the
+    + Hold the job with `(Reservation ... invalid)` state reason if the
      reservation is not usable by the job.
    + `sbatch` - Added new `--export=NIL` option.
 - Removed:
--- a/slurm.spec
+++ b/slurm.spec
@ -18,7 +18,7 @@

 # Check file META in sources: update so_version to (API_CURRENT - API_AGE)
 %define so_version 39
-%define ver 23.02.4
+%define ver 23.02.5
 %define _ver _23_02
 #%%define rc_v 0rc1
 %define dl_ver %{ver}
@ -1321,7 +1321,7 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
 %{_mandir}/man5/cgroup.*
 %{_mandir}/man5/gres.*
 %{_mandir}/man5/helpers.*
-%{_mandir}/man5/nonstop.conf.5.*
+#%%{_mandir}/man5/nonstop.conf.5.*
 %{_mandir}/man5/oci.conf.5.gz
 %{_mandir}/man5/topology.*
 %{_mandir}/man5/knl.conf.5.*