forked from pool/slurm
Accepting request 1111943 from network:cluster
- Updated to version 23.02.5 with the following changes: * Bug Fixes: + Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the job's environment when `--ntasks-per-node` was requested. The method that is is being set, however, is different and should be more accurate in more situations. + Change pmi2 plugin to honor the `SrunPortRange` option. This matches the new behavior of the pmix plugin in 23.02.0. Note that neither of these plugins makes use of the `MpiParams=ports=` option, and previously were only limited by the systems ephemeral port range. + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if a node features plugin is configured. + Fix and prevent reoccurring reservations from overlapping. + `job_container/tmpfs` - Avoid attempts to share BasePath between nodes. + With `CR_Cpu_Memory`, fix node selection for jobs that request gres and `--mem-per-cpu`. + Fix a regression from 22.05.7 in which some jobs were allocated too few nodes, thus overcommitting cpus to some tasks. + Fix a job being stuck in the completing state if the job ends while the primary controller is down or unresponsive and the backup controller has not yet taken over. + Fix `slurmctld` segfault when a node registers with a configured `CpuSpecList` while `slurmctld` configuration has the node without `CpuSpecList`. + Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after not registering by `ResumeTimeout`. + `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir getting skipped. + Fix scontrol segfault when 'completing' command requested repeatedly in interactive mode. OBS-URL: https://build.opensuse.org/request/show/1111943 OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=95
This commit is contained in:
commit
12bf38b1d0
@ -1,3 +0,0 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:6634f57991c6a1a7d140c4de2f50a3e66dd06abef6ef83a8571f6eaa2fe048c7
|
||||
size 7259848
|
3
slurm-23.02.5.tar.bz2
Normal file
3
slurm-23.02.5.tar.bz2
Normal file
@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:7620f1dd1134d14dff402a9127d5a36c340d7a2b69b55f67d8a44b3b8681a59d
|
||||
size 7274119
|
142
slurm.changes
142
slurm.changes
@ -1,3 +1,141 @@
|
||||
-------------------------------------------------------------------
|
||||
Mon Sep 18 05:23:19 UTC 2023 - Egbert Eich <eich@suse.com>
|
||||
|
||||
- Updated to version 23.02.5 with the following changes:
|
||||
* Bug Fixes:
|
||||
+ Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the
|
||||
job's environment when `--ntasks-per-node` was requested.
|
||||
The method that is is being set, however, is different and should be more
|
||||
accurate in more situations.
|
||||
+ Change pmi2 plugin to honor the `SrunPortRange` option. This matches the
|
||||
new behavior of the pmix plugin in 23.02.0. Note that neither of these
|
||||
plugins makes use of the `MpiParams=ports=` option, and previously
|
||||
were only limited by the systems ephemeral port range.
|
||||
+ Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
|
||||
a node features plugin is configured.
|
||||
+ Fix and prevent reoccurring reservations from overlapping.
|
||||
+ `job_container/tmpfs` - Avoid attempts to share BasePath between nodes.
|
||||
+ With `CR_Cpu_Memory`, fix node selection for jobs that request gres and
|
||||
`--mem-per-cpu`.
|
||||
+ Fix a regression from 22.05.7 in which some jobs were allocated too few
|
||||
nodes, thus overcommitting cpus to some tasks.
|
||||
+ Fix a job being stuck in the completing state if the job ends while the
|
||||
primary controller is down or unresponsive and the backup controller has
|
||||
not yet taken over.
|
||||
+ Fix `slurmctld` segfault when a node registers with a configured
|
||||
`CpuSpecList` while `slurmctld` configuration has the node without
|
||||
`CpuSpecList`.
|
||||
+ Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after
|
||||
not registering by `ResumeTimeout`.
|
||||
+ `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir
|
||||
getting skipped.
|
||||
+ Fix scontrol segfault when 'completing' command requested repeatedly in
|
||||
interactive mode.
|
||||
+ Properly handle a race condition between `bind()` and `listen()` calls
|
||||
in the network stack when running with SrunPortRange set.
|
||||
+ Federation - Fix revoked jobs being returned regardless of the
|
||||
`-a`/`--all` option for privileged users.
|
||||
+ Federation - Fix canceling pending federated jobs from non-origin
|
||||
clusters which could leave federated jobs orphaned from the origin
|
||||
cluster.
|
||||
+ Fix sinfo segfault when printing multiple clusters with `--noheader`
|
||||
option.
|
||||
+ Federation - fix clusters not syncing if clusters are added to a
|
||||
federation before they have registered with the dbd.
|
||||
+ `node_features/helpers` - Fix node selection for jobs requesting
|
||||
changeable.
|
||||
features with the `|` operator, which could prevent jobs from
|
||||
running on some valid nodes.
|
||||
+ `node_features/helpers` - Fix inconsistent handling of `&` and `|`,
|
||||
where an AND'd feature was sometimes AND'd to all sets of features
|
||||
instead of just the current set. E.g. `foo|bar&baz` was interpreted
|
||||
as `{foo,baz}` or `{bar,baz}` instead of how it is documented:
|
||||
`{foo} or {bar,baz}`.
|
||||
+ Fix job accounting so that when a job is requeued its allocated node
|
||||
count is cleared. After the requeue, sacct will correctly show that
|
||||
the job has 0 `AllocNodes` while it is pending or if it is canceled
|
||||
before restarting.
|
||||
+ `sacct` - `AllocCPUS` now correctly shows 0 if a job has not yet
|
||||
received an allocation or if the job was canceled before getting one.
|
||||
+ Fix intel OneAPI autodetect: detect the `/dev/dri/renderD[0-9]+` GPUs,
|
||||
and do not detect `/dev/dri/card[0-9]+`.
|
||||
+ Fix node selection for jobs that request `--gpus` and a number of
|
||||
tasks fewer than GPUs, which resulted in incorrectly rejecting these
|
||||
jobs.
|
||||
+ Remove `MYSQL_OPT_RECONNECT` completely.
|
||||
+ Fix cloud nodes in `POWERING_UP` state disappearing (getting set
|
||||
to `FUTURE`)
|
||||
when an `scontrol reconfigure` happens.
|
||||
+ `openapi/dbv0.0.39` - Avoid assert / segfault on missing coordinators
|
||||
list.
|
||||
+ `slurmrestd` - Correct memory leak while parsing OpenAPI specification
|
||||
templates with server overrides.
|
||||
+ Fix overwriting user node reason with system message.
|
||||
+ Prevent deadlock when `rpc_queue` is enabled.
|
||||
+ `slurmrestd` - Correct OpenAPI specification generation bug where
|
||||
fields with overlapping parent paths would not get generated.
|
||||
+ Fix memory leak as a result of a partition info query.
|
||||
+ Fix memory leak as a result of a job info query.
|
||||
+ For step allocations, fix `--gres=none` sometimes not ignoring gres
|
||||
from the job.
|
||||
+ Fix `--exclusive` jobs incorrectly gang-scheduling where they shouldn't.
|
||||
+ Fix allocations with `CR_SOCKET`, gres not assigned to a specific
|
||||
socket, and block core distribion potentially allocating more sockets
|
||||
than required.
|
||||
+ Revert a change in 23.02.3 where Slurm would kill a script's process
|
||||
group as soon as the script ended instead of waiting as long as any
|
||||
process in that process group held the stdout/stderr file descriptors
|
||||
open. That change broke some scripts that relied on the previous
|
||||
behavior. Setting time limits for scripts (such as
|
||||
`PrologEpilogTimeout`) is strongly encouraged to avoid Slurm waiting
|
||||
indefinitely for scripts to finish.
|
||||
+ Fix `slurmdbd -R` not returning an error under certain conditions.
|
||||
+ `slurmdbd` - Avoid potential NULL pointer dereference in the mysql
|
||||
plugin.
|
||||
+ Fix regression in 23.02.3 which broken X11 forwarding for hosts when
|
||||
MUNGE sends a localhost address in the encode host field. This is caused
|
||||
when the node hostname is mapped to 127.0.0.1 (or similar) in
|
||||
`/etc/hosts`.
|
||||
+ `openapi/[db]v0.0.39` - fix memory leak on parsing error.
|
||||
+ `data_parser/v0.0.39` - fix updating qos for associations.
|
||||
+ `openapi/dbv0.0.39` - fix updating values for associations with null
|
||||
users.
|
||||
+ Fix minor memory leak with `--tres-per-task` and licenses.
|
||||
+ Fix cyclic socket cpu distribution for tasks in a step where
|
||||
`--cpus-per-task` < usable threads per core.
|
||||
+ `slurmrestd` - For `GET /slurm/v0.0.39/node[s]`, change format of
|
||||
node's energy field `current_watts` to a dictionary to account for
|
||||
unset value instead of dumping 4294967294.
|
||||
+ `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of QOS's
|
||||
field "priority" to a dictionary to account for unset value instead of
|
||||
dumping 4294967294.
|
||||
+ slurmrestd - For `GET /slurm/v0.0.39/job[s]`, the 'return code'
|
||||
code field in `v0.0.39_job_exit`_code will be set to -127 instead of
|
||||
being left unset where job does not have a relevant return code.
|
||||
* Other Changes:
|
||||
+ Remove --uid / --gid options from salloc and srun commands. These options
|
||||
did not work correctly since the CVE-2022-29500 fix in combination with
|
||||
some changes made in 23.02.0.
|
||||
+ Add the `JobId` to `debug()` messages indicating when
|
||||
`cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being automatically
|
||||
adjusted.
|
||||
+ Change the log message warning for rate limited users from verbose to
|
||||
info.
|
||||
+ `slurmstepd` - Cleanup per task generated environment for containers in
|
||||
spooldir.
|
||||
+ Format batch, extern, interactive, and pending step ids into strings that
|
||||
are human readable.
|
||||
+ `slurmrestd` - Reduce memory usage when printing out job CPU frequency.
|
||||
+ `data_parser/v0.0.39` - Add `required/memory_per_cpu` and
|
||||
`required/memory_per_node` to `sacct --json` and `sacct --yaml` and
|
||||
`GET /slurmdb/v0.0.39/jobs` from slurmrestd.
|
||||
+ `gpu/oneapi` - Store cores correctly so CPU affinity is tracked.
|
||||
+ Allow `slurmdbd -R` to work if the root assoc id is not 1.
|
||||
+ Limit periodic node registrations to 50 instead of the full `TreeWidth`.
|
||||
Since unresolvable `cloud/dynamic` nodes must disable fanout by setting
|
||||
`TreeWidth` to a large number, this would cause all nodes to register at
|
||||
once.
|
||||
|
||||
-------------------------------------------------------------------
|
||||
Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
|
||||
|
||||
@ -19,7 +157,7 @@ Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
|
||||
+ Fix regression in 23.02.2 when checking gres state on `slurmctld`
|
||||
startup or reconfigure. Gres changes in the configuration were not
|
||||
updated on slurmctld startup. On startup or reconfigure, these messages
|
||||
were present in the log: `"error: Attempt to change gres/gpu Count`".
|
||||
were present in the log: `error: Attempt to change gres/gpu Count`.
|
||||
+ Fix potential double count of gres when dealing with limits.
|
||||
+ Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
|
||||
+ Fixed an issue where jobs requesting licenses were incorrectly rejected.
|
||||
@ -163,7 +301,7 @@ Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
|
||||
lookups.
|
||||
+ `sacct` - when printing `PLANNED` time, use end time instead of start
|
||||
time for jobs cancelled before they started.
|
||||
+ Hold the job with "`(Reservation ... invalid)`" state reason if the
|
||||
+ Hold the job with `(Reservation ... invalid)` state reason if the
|
||||
reservation is not usable by the job.
|
||||
+ `sbatch` - Added new `--export=NIL` option.
|
||||
- Removed:
|
||||
|
@ -18,7 +18,7 @@
|
||||
|
||||
# Check file META in sources: update so_version to (API_CURRENT - API_AGE)
|
||||
%define so_version 39
|
||||
%define ver 23.02.4
|
||||
%define ver 23.02.5
|
||||
%define _ver _23_02
|
||||
#%%define rc_v 0rc1
|
||||
%define dl_ver %{ver}
|
||||
@ -1321,7 +1321,7 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
|
||||
%{_mandir}/man5/cgroup.*
|
||||
%{_mandir}/man5/gres.*
|
||||
%{_mandir}/man5/helpers.*
|
||||
%{_mandir}/man5/nonstop.conf.5.*
|
||||
#%%{_mandir}/man5/nonstop.conf.5.*
|
||||
%{_mandir}/man5/oci.conf.5.gz
|
||||
%{_mandir}/man5/topology.*
|
||||
%{_mandir}/man5/knl.conf.5.*
|
||||
|
Loading…
Reference in New Issue
Block a user