From f0b994e220ea6a25d598879e27273bd416af03c505852fd3c36def3ac20c7d9d Mon Sep 17 00:00:00 2001 From: Egbert Eich Date: Mon, 18 Sep 2023 05:43:58 +0000 Subject: [PATCH] plugins makes use of the `MpiParams=ports=` option, and previously features with the `|` operator, which could prevent jobs from + `node_features/helpers` - Fix inconsistent handling of `&` and `|`, instead of just the current set. E.g. `foo|bar&baz` was interpreted `{foo} or {bar,baz}`. tasks fewer than GPUs, which resulted in incorrectly rejecting these jobs. + `slurmrestd` - For `GET /slurm/v0.0.39/node[s]`, change format of node's energy field `current_watts` to a dictionary to account for + `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of QOS's + slurmrestd - For `GET /slurm/v0.0.39/job[s]`, the 'return code' `GET /slurmdb/v0.0.39/jobs` from slurmrestd. were present in the log: `error: Attempt to change gres/gpu Count`. + Hold the job with `(Reservation ... invalid)` state reason if the OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=265 --- slurm.changes | 27 ++++++++++++++------------- slurm.spec | 2 +- 2 files changed, 15 insertions(+), 14 deletions(-) diff --git a/slurm.changes b/slurm.changes index 6c8006d..4676bba 100644 --- a/slurm.changes +++ b/slurm.changes @@ -9,7 +9,7 @@ Mon Sep 18 05:23:19 UTC 2023 - Egbert Eich accurate in more situations. + Change pmi2 plugin to honor the `SrunPortRange` option. This matches the new behavior of the pmix plugin in 23.02.0. Note that neither of these - plugins makes use of the "`MpiParams=ports=`" option, and previously + plugins makes use of the `MpiParams=ports=` option, and previously were only limited by the systems ephemeral port range. + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if a node features plugin is configured. @@ -44,13 +44,13 @@ Mon Sep 18 05:23:19 UTC 2023 - Egbert Eich federation before they have registered with the dbd. + `node_features/helpers` - Fix node selection for jobs requesting changeable. - features with the '`|`' operator, which could prevent jobs from + features with the `|` operator, which could prevent jobs from running on some valid nodes. - + `node_features/helpers` - Fix inconsistent handling of '`&`' and '`|`', + + `node_features/helpers` - Fix inconsistent handling of `&` and `|`, where an AND'd feature was sometimes AND'd to all sets of features - instead of just the current set. E.g. "`foo|bar&baz`" was interpreted + instead of just the current set. E.g. `foo|bar&baz` was interpreted as `{foo,baz}` or `{bar,baz}` instead of how it is documented: - "`{foo} or {bar,baz}`". + `{foo} or {bar,baz}`. + Fix job accounting so that when a job is requeued its allocated node count is cleared. After the requeue, sacct will correctly show that the job has 0 `AllocNodes` while it is pending or if it is canceled @@ -60,7 +60,8 @@ Mon Sep 18 05:23:19 UTC 2023 - Egbert Eich + Fix intel OneAPI autodetect: detect the `/dev/dri/renderD[0-9]+` GPUs, and do not detect `/dev/dri/card[0-9]+`. + Fix node selection for jobs that request `--gpus` and a number of - tasks fewer than GPUs, which resulted in incorrectly rejecting these jobs. + tasks fewer than GPUs, which resulted in incorrectly rejecting these + jobs. + Remove `MYSQL_OPT_RECONNECT` completely. + Fix cloud nodes in `POWERING_UP` state disappearing (getting set to `FUTURE`) @@ -102,13 +103,13 @@ Mon Sep 18 05:23:19 UTC 2023 - Egbert Eich + Fix minor memory leak with `--tres-per-task` and licenses. + Fix cyclic socket cpu distribution for tasks in a step where `--cpus-per-task` < usable threads per core. - + `slurmrestd` - For '`GET /slurm/v0.0.39/node[s]`', change format of - node's energy field "`current_watts`" to a dictionary to account for + + `slurmrestd` - For `GET /slurm/v0.0.39/node[s]`, change format of + node's energy field `current_watts` to a dictionary to account for unset value instead of dumping 4294967294. - + `slurmrestd` - For '`GET /slurm/v0.0.39/qos`', change format of QOS's + + `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of QOS's field "priority" to a dictionary to account for unset value instead of dumping 4294967294. - + slurmrestd - For '`GET /slurm/v0.0.39/job[s]`', the 'return code' + + slurmrestd - For `GET /slurm/v0.0.39/job[s]`, the 'return code' code field in `v0.0.39_job_exit`_code will be set to -127 instead of being left unset where job does not have a relevant return code. * Other Changes: @@ -127,7 +128,7 @@ Mon Sep 18 05:23:19 UTC 2023 - Egbert Eich + `slurmrestd` - Reduce memory usage when printing out job CPU frequency. + `data_parser/v0.0.39` - Add `required/memory_per_cpu` and `required/memory_per_node` to `sacct --json` and `sacct --yaml` and - '`GET /slurmdb/v0.0.39/jobs`' from slurmrestd. + `GET /slurmdb/v0.0.39/jobs` from slurmrestd. + `gpu/oneapi` - Store cores correctly so CPU affinity is tracked. + Allow `slurmdbd -R` to work if the root assoc id is not 1. + Limit periodic node registrations to 50 instead of the full `TreeWidth`. @@ -156,7 +157,7 @@ Mon Aug 21 09:43:08 UTC 2023 - Christian Goll + Fix regression in 23.02.2 when checking gres state on `slurmctld` startup or reconfigure. Gres changes in the configuration were not updated on slurmctld startup. On startup or reconfigure, these messages - were present in the log: `"error: Attempt to change gres/gpu Count`". + were present in the log: `error: Attempt to change gres/gpu Count`. + Fix potential double count of gres when dealing with limits. + Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf` + Fixed an issue where jobs requesting licenses were incorrectly rejected. @@ -300,7 +301,7 @@ Mon Aug 21 09:43:08 UTC 2023 - Christian Goll lookups. + `sacct` - when printing `PLANNED` time, use end time instead of start time for jobs cancelled before they started. - + Hold the job with "`(Reservation ... invalid)`" state reason if the + + Hold the job with `(Reservation ... invalid)` state reason if the reservation is not usable by the job. + `sbatch` - Added new `--export=NIL` option. - Removed: diff --git a/slurm.spec b/slurm.spec index ae2115d..15be0c0 100644 --- a/slurm.spec +++ b/slurm.spec @@ -1321,7 +1321,7 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \ %{_mandir}/man5/cgroup.* %{_mandir}/man5/gres.* %{_mandir}/man5/helpers.* -%{_mandir}/man5/nonstop.conf.5.* +#%%{_mandir}/man5/nonstop.conf.5.* %{_mandir}/man5/oci.conf.5.gz %{_mandir}/man5/topology.* %{_mandir}/man5/knl.conf.5.*