From 74529b6cc222c46d82acda71adb61bcc202fa8dc191b315c9b2fd05f7e7c0bd5 Mon Sep 17 00:00:00 2001 From: Egbert Eich Date: Mon, 18 Sep 2023 05:24:51 +0000 Subject: [PATCH] - Updated to version 23.02.5 with the following changes: * Bug Fixes: + Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the job's environment when `--ntasks-per-node` was requested. The method that is is being set, however, is different and should be more accurate in more situations. + Change pmi2 plugin to honor the `SrunPortRange` option. This matches the new behavior of the pmix plugin in 23.02.0. Note that neither of these plugins makes use of the "`MpiParams=ports=`" option, and previously were only limited by the systems ephemeral port range. + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if a node features plugin is configured. + Fix and prevent reoccurring reservations from overlapping. + `job_container/tmpfs` - Avoid attempts to share BasePath between nodes. + With `CR_Cpu_Memory`, fix node selection for jobs that request gres and `--mem-per-cpu`. + Fix a regression from 22.05.7 in which some jobs were allocated too few nodes, thus overcommitting cpus to some tasks. + Fix a job being stuck in the completing state if the job ends while the primary controller is down or unresponsive and the backup controller has not yet taken over. + Fix `slurmctld` segfault when a node registers with a configured `CpuSpecList` while `slurmctld` configuration has the node without `CpuSpecList`. + Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after not registering by `ResumeTimeout`. + `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir getting skipped. + Fix scontrol segfault when 'completing' command requested repeatedly in interactive mode. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=264 --- slurm-23.02.4.tar.bz2 | 3 - slurm-23.02.5.tar.bz2 | 3 + slurm.changes | 137 ++++++++++++++++++++++++++++++++++++++++++ slurm.spec | 2 +- 4 files changed, 141 insertions(+), 4 deletions(-) delete mode 100644 slurm-23.02.4.tar.bz2 create mode 100644 slurm-23.02.5.tar.bz2 diff --git a/slurm-23.02.4.tar.bz2 b/slurm-23.02.4.tar.bz2 deleted file mode 100644 index bf1048e..0000000 --- a/slurm-23.02.4.tar.bz2 +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:6634f57991c6a1a7d140c4de2f50a3e66dd06abef6ef83a8571f6eaa2fe048c7 -size 7259848 diff --git a/slurm-23.02.5.tar.bz2 b/slurm-23.02.5.tar.bz2 new file mode 100644 index 0000000..1918068 --- /dev/null +++ b/slurm-23.02.5.tar.bz2 @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:7620f1dd1134d14dff402a9127d5a36c340d7a2b69b55f67d8a44b3b8681a59d +size 7274119 diff --git a/slurm.changes b/slurm.changes index 5a30cc3..6c8006d 100644 --- a/slurm.changes +++ b/slurm.changes @@ -1,3 +1,140 @@ +------------------------------------------------------------------- +Mon Sep 18 05:23:19 UTC 2023 - Egbert Eich + +- Updated to version 23.02.5 with the following changes: + * Bug Fixes: + + Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the + job's environment when `--ntasks-per-node` was requested. + The method that is is being set, however, is different and should be more + accurate in more situations. + + Change pmi2 plugin to honor the `SrunPortRange` option. This matches the + new behavior of the pmix plugin in 23.02.0. Note that neither of these + plugins makes use of the "`MpiParams=ports=`" option, and previously + were only limited by the systems ephemeral port range. + + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if + a node features plugin is configured. + + Fix and prevent reoccurring reservations from overlapping. + + `job_container/tmpfs` - Avoid attempts to share BasePath between nodes. + + With `CR_Cpu_Memory`, fix node selection for jobs that request gres and + `--mem-per-cpu`. + + Fix a regression from 22.05.7 in which some jobs were allocated too few + nodes, thus overcommitting cpus to some tasks. + + Fix a job being stuck in the completing state if the job ends while the + primary controller is down or unresponsive and the backup controller has + not yet taken over. + + Fix `slurmctld` segfault when a node registers with a configured + `CpuSpecList` while `slurmctld` configuration has the node without + `CpuSpecList`. + + Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after + not registering by `ResumeTimeout`. + + `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir + getting skipped. + + Fix scontrol segfault when 'completing' command requested repeatedly in + interactive mode. + + Properly handle a race condition between `bind()` and `listen()` calls + in the network stack when running with SrunPortRange set. + + Federation - Fix revoked jobs being returned regardless of the + `-a`/`--all` option for privileged users. + + Federation - Fix canceling pending federated jobs from non-origin + clusters which could leave federated jobs orphaned from the origin + cluster. + + Fix sinfo segfault when printing multiple clusters with `--noheader` + option. + + Federation - fix clusters not syncing if clusters are added to a + federation before they have registered with the dbd. + + `node_features/helpers` - Fix node selection for jobs requesting + changeable. + features with the '`|`' operator, which could prevent jobs from + running on some valid nodes. + + `node_features/helpers` - Fix inconsistent handling of '`&`' and '`|`', + where an AND'd feature was sometimes AND'd to all sets of features + instead of just the current set. E.g. "`foo|bar&baz`" was interpreted + as `{foo,baz}` or `{bar,baz}` instead of how it is documented: + "`{foo} or {bar,baz}`". + + Fix job accounting so that when a job is requeued its allocated node + count is cleared. After the requeue, sacct will correctly show that + the job has 0 `AllocNodes` while it is pending or if it is canceled + before restarting. + + `sacct` - `AllocCPUS` now correctly shows 0 if a job has not yet + received an allocation or if the job was canceled before getting one. + + Fix intel OneAPI autodetect: detect the `/dev/dri/renderD[0-9]+` GPUs, + and do not detect `/dev/dri/card[0-9]+`. + + Fix node selection for jobs that request `--gpus` and a number of + tasks fewer than GPUs, which resulted in incorrectly rejecting these jobs. + + Remove `MYSQL_OPT_RECONNECT` completely. + + Fix cloud nodes in `POWERING_UP` state disappearing (getting set + to `FUTURE`) + when an `scontrol reconfigure` happens. + + `openapi/dbv0.0.39` - Avoid assert / segfault on missing coordinators + list. + + `slurmrestd` - Correct memory leak while parsing OpenAPI specification + templates with server overrides. + + Fix overwriting user node reason with system message. + + Prevent deadlock when `rpc_queue` is enabled. + + `slurmrestd` - Correct OpenAPI specification generation bug where + fields with overlapping parent paths would not get generated. + + Fix memory leak as a result of a partition info query. + + Fix memory leak as a result of a job info query. + + For step allocations, fix `--gres=none` sometimes not ignoring gres + from the job. + + Fix `--exclusive` jobs incorrectly gang-scheduling where they shouldn't. + + Fix allocations with `CR_SOCKET`, gres not assigned to a specific + socket, and block core distribion potentially allocating more sockets + than required. + + Revert a change in 23.02.3 where Slurm would kill a script's process + group as soon as the script ended instead of waiting as long as any + process in that process group held the stdout/stderr file descriptors + open. That change broke some scripts that relied on the previous + behavior. Setting time limits for scripts (such as + `PrologEpilogTimeout`) is strongly encouraged to avoid Slurm waiting + indefinitely for scripts to finish. + + Fix `slurmdbd -R` not returning an error under certain conditions. + + `slurmdbd` - Avoid potential NULL pointer dereference in the mysql + plugin. + + Fix regression in 23.02.3 which broken X11 forwarding for hosts when + MUNGE sends a localhost address in the encode host field. This is caused + when the node hostname is mapped to 127.0.0.1 (or similar) in + `/etc/hosts`. + + `openapi/[db]v0.0.39` - fix memory leak on parsing error. + + `data_parser/v0.0.39` - fix updating qos for associations. + + `openapi/dbv0.0.39` - fix updating values for associations with null + users. + + Fix minor memory leak with `--tres-per-task` and licenses. + + Fix cyclic socket cpu distribution for tasks in a step where + `--cpus-per-task` < usable threads per core. + + `slurmrestd` - For '`GET /slurm/v0.0.39/node[s]`', change format of + node's energy field "`current_watts`" to a dictionary to account for + unset value instead of dumping 4294967294. + + `slurmrestd` - For '`GET /slurm/v0.0.39/qos`', change format of QOS's + field "priority" to a dictionary to account for unset value instead of + dumping 4294967294. + + slurmrestd - For '`GET /slurm/v0.0.39/job[s]`', the 'return code' + code field in `v0.0.39_job_exit`_code will be set to -127 instead of + being left unset where job does not have a relevant return code. + * Other Changes: + + Remove --uid / --gid options from salloc and srun commands. These options + did not work correctly since the CVE-2022-29500 fix in combination with + some changes made in 23.02.0. + + Add the `JobId` to `debug()` messages indicating when + `cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being automatically + adjusted. + + Change the log message warning for rate limited users from verbose to + info. + + `slurmstepd` - Cleanup per task generated environment for containers in + spooldir. + + Format batch, extern, interactive, and pending step ids into strings that + are human readable. + + `slurmrestd` - Reduce memory usage when printing out job CPU frequency. + + `data_parser/v0.0.39` - Add `required/memory_per_cpu` and + `required/memory_per_node` to `sacct --json` and `sacct --yaml` and + '`GET /slurmdb/v0.0.39/jobs`' from slurmrestd. + + `gpu/oneapi` - Store cores correctly so CPU affinity is tracked. + + Allow `slurmdbd -R` to work if the root assoc id is not 1. + + Limit periodic node registrations to 50 instead of the full `TreeWidth`. + Since unresolvable `cloud/dynamic` nodes must disable fanout by setting + `TreeWidth` to a large number, this would cause all nodes to register at + once. + ------------------------------------------------------------------- Mon Aug 21 09:43:08 UTC 2023 - Christian Goll diff --git a/slurm.spec b/slurm.spec index cc20151..ae2115d 100644 --- a/slurm.spec +++ b/slurm.spec @@ -18,7 +18,7 @@ # Check file META in sources: update so_version to (API_CURRENT - API_AGE) %define so_version 39 -%define ver 23.02.4 +%define ver 23.02.5 %define _ver _23_02 #%%define rc_v 0rc1 %define dl_ver %{ver}