Accepting request 1110259 from network:cluster

- Updated to 23.02.4 with the following changes:
  * Bug Fixes:
    + Fix main scheduler loop not starting after a failover to backup
      controller. Avoid slurmctld segfault when specifying
     `AccountingStorageExternalHost` (bsc#1214983).
    + Fix sbatch return code when `--wait` is requested on a job array.
    + Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
    + Fix `slurmrestd` handling of job hold/release operations.
    + Fix step running indefinitely when slurmctld takes more than
      `MessageTimeout` to respond. Now, `slurmctld` will cancel the step when
       detected, preventing following steps from getting stuck waiting for
       resources to be released.
    + Fix regression to make `job_desc.min_cpus` accurate again in `job_submit`
      when requesting a job with `--ntasks-per-node`.
    + Fix handling of `ArrayTaskThrottle` in backfill.
    + Fix regression in 23.02.2 when checking gres state on `slurmctld`
      startup  or reconfigure. Gres changes in the configuration were not
      updated on slurmctld startup. On startup or reconfigure, these messages
      were present in the log: `"error: Attempt to change gres/gpu Count`".
    + Fix potential double count of gres when dealing with limits.
    + Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
    + Fixed an issue where jobs requesting licenses were incorrectly rejected.
    + `scrontab` - Fix cutting off the final character of quoted variables.
    + `smail` - Fix issues where e-mails at job completion were not being sent.
    + `scontrol/slurmctld` - fix comma parsing when updating a reservation's
       nodes.
    + Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus
      having more tasks than they should and other gpus being unused.
    + Fix regression in 23.02 that causes slurmstepd to crash when `srun`
      requests more than `TreeWidth` nodes in a step and uses the pmi2 or

OBS-URL: https://build.opensuse.org/request/show/1110259
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=93
This commit is contained in:
Ana Guerrero 2023-09-11 19:22:19 +00:00 committed by Git OBS Bridge
commit 3bcde4bfd9

View File

@ -1,164 +1,171 @@
------------------------------------------------------------------- -------------------------------------------------------------------
Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com> Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
- Fixes since 23.02.03: - Updated to 23.02.4 with the following changes:
Highlights: * Bug Fixes:
* Fix main scheduler loop not starting after a failover to backup controller. + Fix main scheduler loop not starting after a failover to backup
* Avoid slurmctld segfault when specifying `AccountingStorageExternalHost` controller. Avoid slurmctld segfault when specifying
(bsc#1214983). `AccountingStorageExternalHost` (bsc#1214983).
Other: + Fix sbatch return code when `--wait` is requested on a job array.
* Fix sbatch return code when `--wait` is requested on a job array. + Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
* Fix collected `GPUUtilization` values for `acct_gather_profile` plugins. + Fix `slurmrestd` handling of job hold/release operations.
* Fix `slurmrestd` handling of job hold/release operations. + Fix step running indefinitely when slurmctld takes more than
* Make spank `S_JOB_ARGV` item value hold the requested command `argv` `MessageTimeout` to respond. Now, `slurmctld` will cancel the step when
instead of the `srun --bcast` value when `--bcast` requested (only in local detected, preventing following steps from getting stuck waiting for
context). resources to be released.
* Fix step running indefinitely when slurmctld takes more than + Fix regression to make `job_desc.min_cpus` accurate again in `job_submit`
`MessageTimeout` to respond. Now, slurmctld will cancel the step when when requesting a job with `--ntasks-per-node`.
detected, preventing following steps from getting stuck waiting for + Fix handling of `ArrayTaskThrottle` in backfill.
resources to be released. + Fix regression in 23.02.2 when checking gres state on `slurmctld`
* Fix regression to make `job_desc.min_cpus` accurate again in job_submit when startup or reconfigure. Gres changes in the configuration were not
requesting a job with `--ntasks-per-node`. updated on slurmctld startup. On startup or reconfigure, these messages
* Fix handling of `ArrayTaskThrottle` in backfill. were present in the log: `"error: Attempt to change gres/gpu Count`".
* Fix regression in 23.02.2 when checking gres state on `slurmctld` startup or + Fix potential double count of gres when dealing with limits.
reconfigure. Gres changes in the configuration were not updated on slurmctld + Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
startup. On startup or reconfigure, these messages were present in the log: + Fixed an issue where jobs requesting licenses were incorrectly rejected.
`"error: Attempt to change gres/gpu Count`". + `scrontab` - Fix cutting off the final character of quoted variables.
* Fix potential double count of gres when dealing with limits. + `smail` - Fix issues where e-mails at job completion were not being sent.
* Fix slurmstepd segfault when ContainerPath is not set in `oci.conf` + `scontrol/slurmctld` - fix comma parsing when updating a reservation's
* Fixed an issue where jobs requesting licenses were incorrectly rejected. nodes.
* `scrontab` - Fix cutting off the final character of quoted variables. + Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus
* `smail` - Fix issues where e-mails at job completion were not being sent. having more tasks than they should and other gpus being unused.
* `scontrol/slurmctld` - fix comma parsing when updating a reservation's + Fix regression in 23.02 that causes slurmstepd to crash when `srun`
nodes. requests more than `TreeWidth` nodes in a step and uses the pmi2 or
* Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus pmix plugin.
having more tasks than they should and other gpus being unused. + `job_container/tmpfs` - Fix `%h` and `%n` substitution in `BasePath`
* Fix regression in 23.02 that causes slurmstepd to crash when srun requests where `%h` was substituted as the NodeName instead of the hostname,
more than `TreeWidth` nodes in a step and uses the pmi2 or pmix plugin. and %n was substituted as an empty string.
* `job_container/tmpfs` - Fix `%h` and `%n` substitution in `BasePath` where + Fix regression where `--cpu-bind=verbose` would override
`%h` was substituted as the NodeName instead of the hostname, and %n was `TaskPluginParam`.
substituted as an empty string. + `scancel` - Fix `--clusters/-M` for federations. Only filtered jobs
* Fix regression where `--cpu-bind=verbose` would override `TaskPluginParam`. (e.g. `-A`, `-u`, `-p`, etc.) from the specified clusters will be
* `scancel` - Fix `--clusters/-M` for federations. Only filtered jobs (e.g. canceled, rather than all jobs in the federation. Specific jobids
`-A`, `-u`, `-p`, etc.) from the specified clusters will be canceled, will still be routed to the origin cluster for cancellation.
rather than all jobs in the federation. Specific jobids will still be * Other changes:
routed to the origin cluster for cancellation. + Make spank `S_JOB_ARGV` item value hold the requested command `argv`
- Fixes since 23.02.02 instead of the `srun --bcast` value when `--bcast` requested (only in
Highlight: local context).
* `slurmctld` - Fix backup slurmctld crash when it takes control multiple + `scontrol` - Permit changes to StdErr and StdIn for pending jobs.
times. + `scontrol` - Reset `std`{`err`,`in`,`out`} when set to empty string.
Other: + `slurmrestd` - mark environment as a required field for job submission
* Fix regression in 23.02.2 that ignored the partition `DefCpuPerGPU` setting descriptions.
on the first pass of scheduling a job requesting `--gpus --ntasks`. + `slurmrestd` - avoid dumping null in OpenAPI schema required fields.
* `srun` - fix issue creating regular and interactive steps because + `data_parser/v0.0.39` - avoid rejecting valid `memory_per_node` formatted
*_PACK_GROUP* environment variables were incorrectly set on non-HetSteps. as dictionary provided with a job description.
* Fix dynamic nodes getting stuck in allocated states when reconfiguring. + `data_parser/v0.0.39` - avoid rejecting valid `memory_per_cpu` formatted
* Fix regression in 23.02.2 that set the `SLURM_NTASKS` environment variable as dictionary provided with a job description.
in sbatch jobs from `--ntasks-per-node` when `--ntasks` was not requested. + `slurmrestd` - Return HTTP error code 404 when job query fails.
* Fix regression in 23.02 that caused sbatch jobs to set the wrong number + `slurmrestd` - Add return schema to error response to job and license
of tasks when requesting `--ntasks-per-node` without `--ntasks`, and also query.
requesting one of the following options: `--sockets-per-node`, + Change the log message warning for rate limited users from debug to
--cores-per-socket, --threads-per-core (or `--hint=nomultithread`), or verbose.
`-B,--extra-node-info`. + `cgroup/v2` - Avoid capturing log output for ebpf when constraining
* Fix double counting suspended job counts on nodes when reconfiguring, which devices,
prevented nodes with suspended jobs from being powered down or rebooted as this can lead to inadvertent failure if the log buffer is too small.
once the jobs completed. + Added error message when attempting to use sattach on batch or extern
* Fix backfill not scheduling jobs submitted with `--prefer` and steps.
`--constraint` properly. + Reject job `ArrayTaskThrottle` update requests from unprivileged users.
* mpi/pmix - fix regression introduced in 23.02.2 which caused PMIx shmem + `data_parser/v0.0.39` - populate description fields of property objects
backed files permissions to be incorrect. in generated OpenAPI specifications where defined.
* api/submit - fix memory leaks when submission of batch regular jobs or batch + `slurmstepd` - Avoid segfault caused by `ContainerPath` not being
HetJobs fails (response data is a return code). terminated by `/` in `oci.conf`.
* Fix regression in 23.02 leading to error() messages being sent at `INFO` + `data_parser/v0.0.39` - Change `v0.0.39_job_info` response to tag
instead of `ERR` in syslog. `exit_code` field as being complex instead of only an unsigned integer.
* Fix `TresUsageIn[Tot|Ave]` calculation for `gres/gpumem` and `gres/gpuutil`. - Updated to 23.02.3 with the following changes:
* Fix issue in the gpu plugins where gpu frequencies would only be set if both * Bug Fixes:
gpu memory and gpu frequencies were set, while one or the other suffices. + `slurmctld` - Fix backup slurmctld crash when it takes control
* Fix reservations group ACL's not working with the root group. multiple times.
* Fix updating a job with a ReqNodeList greater than the job's node count. + Fix regression in 23.02.2 that ignored the partition `DefCpuPerGPU`
* Fix inadvertent permission denied error for `--task-prolog` and setting on the first pass of scheduling a job requesting
`--task-epilog` with filesystems mounted with `root_squash`. `--gpus --ntasks`.
* Fix missing detailed cpu and gres information in json/yaml output from + `srun` - fix issue creating regular and interactive steps because
`scontrol`, `squeue` and `sinfo`. environment variables were incorrectly set on non-HetSteps.
* Fix regression in 23.02 that causes a failure to allocate job steps that + Fix dynamic nodes getting stuck in allocated states when reconfiguring.
request `--cpus-per-gpu` and gpus with types. + Fix regression in 23.02.2 that set the `SLURM_NTASKS` environment
* Fix potentially waiting indefinitely for a defunct process to finish, variable in sbatch jobs from `--ntasks-per-node` when `--ntasks` was not
which affects various scripts including `Prolog` and `Epilog`. This could requested.
have various symptoms, such as jobs getting stuck in a completing state. + Fix regression in 23.02 that caused sbatch jobs to set the wrong number
* Fix losing list of reservations on job when updating job with list of of tasks when requesting `--ntasks-per-node` without `--ntasks`, and also
reservations and restarting the controller. requesting one of the following options: `--sockets-per-node`,
* Fix nodes resuming after down and drain state update requests from `--cores-per-socket`, `--threads-per-core` (or `--hint=nomultithread`),
clients older than 23.02. or `-B,--extra-node-info`.
* Fix advanced reservation creation/update when an association that should + Fix double counting suspended job counts on nodes when reconfiguring,
have access to it is composed with partition(s). which prevented nodes with suspended jobs from being powered down or
* Fix job layout calculations with `--ntasks-per-gpu`, especially when rebooted once the jobs completed.
`--nodes` has not been explicitly provided. + Fix backfill not scheduling jobs submitted with `--prefer` and
* Fix X11 forwarding for jobs submitted from the slurmctld host. `--constraint` properly.
* When a job requests `--no-kill` and one or more nodes fail during the job, + mpi/pmix - fix regression introduced in 23.02.2 which caused PMIx shmem
fix subsequent job steps unable to use some of the remaining resources backed files permissions to be incorrect.
allocated to the job. + api/submit - fix memory leaks when submission of batch regular jobs
* Fix shared gres allocation when using `--tres-per-task` with tasks that span or batch HetJobs fails (response data is a return code).
multiple sockets. + Fix regression in 23.02 leading to error() messages being sent at `INFO`
- Other changes instead of `ERR` in syslog.
(since 23.02.3): + Fix `TresUsageIn[Tot|Ave]` calculation for `gres/gpumem` and
* `scontrol` - Permit changes to StdErr and StdIn for pending jobs. `gres/gpuutil`.
* `scontrol` - Reset std{err,in,out} when set to empty string. + Fix issue in the gpu plugins where gpu frequencies would only be set if
* `slurmrestd` - mark environment as a required field for job submission both gpu memory and gpu frequencies were set, while one or the other
descriptions. suffices.
* `slurmrestd` - avoid dumping null in OpenAPI schema required fields. + Fix reservations group ACL's not working with the root group.
* `data_parser/v0.0.39` - avoid rejecting valid memory_per_node formatted as + Fix updating a job with a ReqNodeList greater than the job's node count.
dictionary provided with a job description. + Fix inadvertent permission denied error for `--task-prolog` and
* `data_parser/v0.0.39` - avoid rejecting valid memory_per_cpu formatted as `--task-epilog` with filesystems mounted with `root_squash`.
dictionary provided with a job description. + Fix missing detailed cpu and gres information in json/yaml output from
* `slurmrestd` - Return HTTP error code 404 when job query fails. `scontrol`, `squeue` and `sinfo`.
* `slurmrestd` - Add return schema to error response to job and license query. + Fix regression in 23.02 that causes a failure to allocate job steps that
* Change the log message warning for rate limited users from debug to verbose. request `--cpus-per-gpu` and gpus with types.
* `cgroup/v2` - Avoid capturing log output for ebpf when constraining devices, + Fix potentially waiting indefinitely for a defunct process to finish,
as this can lead to inadvertent failure if the log buffer is too small. which affects various scripts including `Prolog` and `Epilog`. This could
* Added error message when attempting to use sattach on batch or extern steps. have various symptoms, such as jobs getting stuck in a completing state.
* Reject job ArrayTaskThrottle update requests from unprivileged users. + Fix losing list of reservations on job when updating job with list of
* `data_parser/v0.0.39` - populate description fields of property objects in reservations and restarting the controller.
generated OpenAPI specifications where defined. + Fix nodes resuming after down and drain state update requests from
* `slurmstepd` - Avoid segfault caused by ContainerPath not being terminated clients older than 23.02.
by '/' in oci.conf. + Fix advanced reservation creation/update when an association that should
* `data_parser/v0.0.39` - Change `v0.0.39_job_info` response to tag `exit_code` have access to it is composed with partition(s).
field as being complex instead of only an unsigned integer. + Fix job layout calculations with `--ntasks-per-gpu`, especially when
(since 23.02.2): `--nodes` has not been explicitly provided.
* `openapi/dbv0.0.39/users` - If a default account update failed, resulting + Fix X11 forwarding for jobs submitted from the slurmctld host.
in a no-op, the query returned success without any warning. Now a warning + When a job requests `--no-kill` and one or more nodes fail during the
is sent back to the client that the default account wasn't modified. job, fix subsequent job steps unable to use some of the remaining
* Avoid job write lock when nodes are dynamically added/removed. resources allocated to the job.
* burst_buffer/lua - allow jobs to get scheduled sooner after + Fix shared gres allocation when using `--tres-per-task` with tasks that
`slurm_bb_data_in` completes. span multiple sockets.
* `openapi/v0.0.39` - fix memory leak in `_job_post_het_submit()`. + `auth/jwt` - Fix memory leak.
* Avoid possible `slurmctld` segfault caused by race condition with already * Other changes:
completed `slurmdbd_conn` connections. + `openapi/dbv0.0.39/users` - If a default account update failed, resulting
* `Slurmdbd.conf` checks included conf files for 0600 permissions in a no-op, the query returned success without any warning. Now a warning
* `slurmrestd` - fix regression "oversubscribe" fields were removed from job is sent back to the client that the default account wasn't modified.
descriptions and submissions from v0.0.39 end points. + Avoid job write lock when nodes are dynamically added/removed.
* `accounting_storage/mysql` - Query for indiviual QOS correctly when you have + `burst_buffer/lua` - allow jobs to get scheduled sooner after
more than 10. `slurm_bb_data_in` completes.
* Add warning message about ignoring `--tres-per-tasks=license` when used + `openapi/v0.0.39` - fix memory leak in `_job_post_het_submit()`.
on a step. + Avoid possible `slurmctld` segfault caused by race condition with already
* `sshare` - Fix command to work when using priority/basic. completed `slurmdbd_conn` connections.
* Avoid loading `cli_filter` plugins outside of `salloc`/`sbatch`/`scron`/ + `Slurmdbd.conf` checks included conf files for 0600 permissions
`srun`. This fixes a number of missing symbol problems that can manifest + `slurmrestd` - fix regression "oversubscribe" fields were removed from
for executables linked against libslurm (and not `libslurmfull`). job descriptions and submissions from v0.0.39 end points.
* Allow cloud_reg_addrs to update dynamically registered node's addrs on + `accounting_storage/mysql` - Query for indiviual QOS correctly when you
subsequent registrations. have more than 10.
* Revert a change in 22.05.5 that prevented tasks from sharing a core if + Add warning message about ignoring `--tres-per-tasks=license` when used
`--cpus-per-task` > threads per core, but caused incorrect accounting and on a step.
cpu. + `sshare` - Fix command to work when using `priority/basic`.
binding. Instead, `--ntasks-per-core=1` may be requested to prevent tasks + Avoid loading `cli_filter` plugins outside of `salloc`/`sbatch`/`scron`/
from sharing a core. `srun`. This fixes a number of missing symbol problems that can manifest
* Correctly send `assoc_mgr` lock to mcs plugin. for executables linked against libslurm (and not `libslurmfull`).
* Avoid unnecessary gres/gpumem and gres/gpuutil TRES position lookups. + Allow cloud_reg_addrs to update dynamically registered node's addrs on
* `sacct` - when printing PLANNED time, use end time instead of start time for subsequent registrations.
jobs cancelled before they started. + Revert a change in 22.05.5 that prevented tasks from sharing a core if
* Hold the job with "(Reservation ... invalid)" state reason if the `--cpus-per-task` > threads per core, but caused incorrect accounting and
reservation is not usable by the job. cpu binding. Instead, `--ntasks-per-core=1` may be requested to prevent
* `auth/jwt` - Fix memory leak. tasks from sharing a core.
* `sbatch` - Added new `--export=NIL` option. + Correctly send `assoc_mgr` lock to mcs plugin.
+ Avoid unnecessary `gres/gpumem` and `gres/gpuutil` `TRES` position
lookups.
+ `sacct` - when printing `PLANNED` time, use end time instead of start
time for jobs cancelled before they started.
+ Hold the job with "`(Reservation ... invalid)`" state reason if the
reservation is not usable by the job.
+ `sbatch` - Added new `--export=NIL` option.
- Removed: - Removed:
* Fix-test-3.13.patch * Fix-test-3.13.patch
* Fix-test-38.11.patch as both tests changed upstream * Fix-test-38.11.patch as both tests changed upstream