SHA256
1
0
forked from pool/slurm

- Updated to 23.02.4 with the following changes:

* Bug Fixes:
    + Fix main scheduler loop not starting after a failover to backup
      controller. Avoid slurmctld segfault when specifying
     `AccountingStorageExternalHost` (bsc#1214983).
    + Fix sbatch return code when `--wait` is requested on a job array.
    + Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
    + Fix `slurmrestd` handling of job hold/release operations.
    + Fix step running indefinitely when slurmctld takes more than
      `MessageTimeout` to respond. Now, `slurmctld` will cancel the step when
       detected, preventing following steps from getting stuck waiting for
       resources to be released.
    + Fix regression to make `job_desc.min_cpus` accurate again in `job_submit`
      when requesting a job with `--ntasks-per-node`.
    + Fix handling of `ArrayTaskThrottle` in backfill.
    + Fix regression in 23.02.2 when checking gres state on `slurmctld`
      startup  or reconfigure. Gres changes in the configuration were not
      updated on slurmctld startup. On startup or reconfigure, these messages
      were present in the log: `"error: Attempt to change gres/gpu Count`".
    + Fix potential double count of gres when dealing with limits.
    + Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
    + Fixed an issue where jobs requesting licenses were incorrectly rejected.
    + `scrontab` - Fix cutting off the final character of quoted variables.
    + `smail` - Fix issues where e-mails at job completion were not being sent.
    + `scontrol/slurmctld` - fix comma parsing when updating a reservation's
       nodes.
    + Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus
      having more tasks than they should and other gpus being unused.
    + Fix regression in 23.02 that causes slurmstepd to crash when `srun`
      requests more than `TreeWidth` nodes in a step and uses the pmi2 or

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=260
This commit is contained in:
Egbert Eich 2023-09-11 07:21:32 +00:00 committed by Git OBS Bridge
parent c63b605916
commit f9646ba945

View File

@ -1,164 +1,171 @@
------------------------------------------------------------------- -------------------------------------------------------------------
Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com> Mon Aug 21 09:43:08 UTC 2023 - Christian Goll <cgoll@suse.com>
- Fixes since 23.02.03: - Updated to 23.02.4 with the following changes:
Highlights: * Bug Fixes:
* Fix main scheduler loop not starting after a failover to backup controller. + Fix main scheduler loop not starting after a failover to backup
* Avoid slurmctld segfault when specifying `AccountingStorageExternalHost` controller. Avoid slurmctld segfault when specifying
(bsc#1214983). `AccountingStorageExternalHost` (bsc#1214983).
Other: + Fix sbatch return code when `--wait` is requested on a job array.
* Fix sbatch return code when `--wait` is requested on a job array. + Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
* Fix collected `GPUUtilization` values for `acct_gather_profile` plugins. + Fix `slurmrestd` handling of job hold/release operations.
* Fix `slurmrestd` handling of job hold/release operations. + Fix step running indefinitely when slurmctld takes more than
* Make spank `S_JOB_ARGV` item value hold the requested command `argv` `MessageTimeout` to respond. Now, `slurmctld` will cancel the step when
instead of the `srun --bcast` value when `--bcast` requested (only in local
context).
* Fix step running indefinitely when slurmctld takes more than
`MessageTimeout` to respond. Now, slurmctld will cancel the step when
detected, preventing following steps from getting stuck waiting for detected, preventing following steps from getting stuck waiting for
resources to be released. resources to be released.
* Fix regression to make `job_desc.min_cpus` accurate again in job_submit when + Fix regression to make `job_desc.min_cpus` accurate again in `job_submit`
requesting a job with `--ntasks-per-node`. when requesting a job with `--ntasks-per-node`.
* Fix handling of `ArrayTaskThrottle` in backfill. + Fix handling of `ArrayTaskThrottle` in backfill.
* Fix regression in 23.02.2 when checking gres state on `slurmctld` startup or + Fix regression in 23.02.2 when checking gres state on `slurmctld`
reconfigure. Gres changes in the configuration were not updated on slurmctld startup or reconfigure. Gres changes in the configuration were not
startup. On startup or reconfigure, these messages were present in the log: updated on slurmctld startup. On startup or reconfigure, these messages
`"error: Attempt to change gres/gpu Count`". were present in the log: `"error: Attempt to change gres/gpu Count`".
* Fix potential double count of gres when dealing with limits. + Fix potential double count of gres when dealing with limits.
* Fix slurmstepd segfault when ContainerPath is not set in `oci.conf` + Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
* Fixed an issue where jobs requesting licenses were incorrectly rejected. + Fixed an issue where jobs requesting licenses were incorrectly rejected.
* `scrontab` - Fix cutting off the final character of quoted variables. + `scrontab` - Fix cutting off the final character of quoted variables.
* `smail` - Fix issues where e-mails at job completion were not being sent. + `smail` - Fix issues where e-mails at job completion were not being sent.
* `scontrol/slurmctld` - fix comma parsing when updating a reservation's + `scontrol/slurmctld` - fix comma parsing when updating a reservation's
nodes. nodes.
* Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus + Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus
having more tasks than they should and other gpus being unused. having more tasks than they should and other gpus being unused.
* Fix regression in 23.02 that causes slurmstepd to crash when srun requests + Fix regression in 23.02 that causes slurmstepd to crash when `srun`
more than `TreeWidth` nodes in a step and uses the pmi2 or pmix plugin. requests more than `TreeWidth` nodes in a step and uses the pmi2 or
* `job_container/tmpfs` - Fix `%h` and `%n` substitution in `BasePath` where pmix plugin.
`%h` was substituted as the NodeName instead of the hostname, and %n was + `job_container/tmpfs` - Fix `%h` and `%n` substitution in `BasePath`
substituted as an empty string. where `%h` was substituted as the NodeName instead of the hostname,
* Fix regression where `--cpu-bind=verbose` would override `TaskPluginParam`. and %n was substituted as an empty string.
* `scancel` - Fix `--clusters/-M` for federations. Only filtered jobs (e.g. + Fix regression where `--cpu-bind=verbose` would override
`-A`, `-u`, `-p`, etc.) from the specified clusters will be canceled, `TaskPluginParam`.
rather than all jobs in the federation. Specific jobids will still be + `scancel` - Fix `--clusters/-M` for federations. Only filtered jobs
routed to the origin cluster for cancellation. (e.g. `-A`, `-u`, `-p`, etc.) from the specified clusters will be
- Fixes since 23.02.02 canceled, rather than all jobs in the federation. Specific jobids
Highlight: will still be routed to the origin cluster for cancellation.
* `slurmctld` - Fix backup slurmctld crash when it takes control multiple * Other changes:
times. + Make spank `S_JOB_ARGV` item value hold the requested command `argv`
Other: instead of the `srun --bcast` value when `--bcast` requested (only in
* Fix regression in 23.02.2 that ignored the partition `DefCpuPerGPU` setting local context).
on the first pass of scheduling a job requesting `--gpus --ntasks`. + `scontrol` - Permit changes to StdErr and StdIn for pending jobs.
* `srun` - fix issue creating regular and interactive steps because + `scontrol` - Reset `std`{`err`,`in`,`out`} when set to empty string.
*_PACK_GROUP* environment variables were incorrectly set on non-HetSteps. + `slurmrestd` - mark environment as a required field for job submission
* Fix dynamic nodes getting stuck in allocated states when reconfiguring. descriptions.
* Fix regression in 23.02.2 that set the `SLURM_NTASKS` environment variable + `slurmrestd` - avoid dumping null in OpenAPI schema required fields.
in sbatch jobs from `--ntasks-per-node` when `--ntasks` was not requested. + `data_parser/v0.0.39` - avoid rejecting valid `memory_per_node` formatted
* Fix regression in 23.02 that caused sbatch jobs to set the wrong number as dictionary provided with a job description.
+ `data_parser/v0.0.39` - avoid rejecting valid `memory_per_cpu` formatted
as dictionary provided with a job description.
+ `slurmrestd` - Return HTTP error code 404 when job query fails.
+ `slurmrestd` - Add return schema to error response to job and license
query.
+ Change the log message warning for rate limited users from debug to
verbose.
+ `cgroup/v2` - Avoid capturing log output for ebpf when constraining
devices,
as this can lead to inadvertent failure if the log buffer is too small.
+ Added error message when attempting to use sattach on batch or extern
steps.
+ Reject job `ArrayTaskThrottle` update requests from unprivileged users.
+ `data_parser/v0.0.39` - populate description fields of property objects
in generated OpenAPI specifications where defined.
+ `slurmstepd` - Avoid segfault caused by `ContainerPath` not being
terminated by `/` in `oci.conf`.
+ `data_parser/v0.0.39` - Change `v0.0.39_job_info` response to tag
`exit_code` field as being complex instead of only an unsigned integer.
- Updated to 23.02.3 with the following changes:
* Bug Fixes:
+ `slurmctld` - Fix backup slurmctld crash when it takes control
multiple times.
+ Fix regression in 23.02.2 that ignored the partition `DefCpuPerGPU`
setting on the first pass of scheduling a job requesting
`--gpus --ntasks`.
+ `srun` - fix issue creating regular and interactive steps because
environment variables were incorrectly set on non-HetSteps.
+ Fix dynamic nodes getting stuck in allocated states when reconfiguring.
+ Fix regression in 23.02.2 that set the `SLURM_NTASKS` environment
variable in sbatch jobs from `--ntasks-per-node` when `--ntasks` was not
requested.
+ Fix regression in 23.02 that caused sbatch jobs to set the wrong number
of tasks when requesting `--ntasks-per-node` without `--ntasks`, and also of tasks when requesting `--ntasks-per-node` without `--ntasks`, and also
requesting one of the following options: `--sockets-per-node`, requesting one of the following options: `--sockets-per-node`,
--cores-per-socket, --threads-per-core (or `--hint=nomultithread`), or `--cores-per-socket`, `--threads-per-core` (or `--hint=nomultithread`),
`-B,--extra-node-info`. or `-B,--extra-node-info`.
* Fix double counting suspended job counts on nodes when reconfiguring, which + Fix double counting suspended job counts on nodes when reconfiguring,
prevented nodes with suspended jobs from being powered down or rebooted which prevented nodes with suspended jobs from being powered down or
once the jobs completed. rebooted once the jobs completed.
* Fix backfill not scheduling jobs submitted with `--prefer` and + Fix backfill not scheduling jobs submitted with `--prefer` and
`--constraint` properly. `--constraint` properly.
* mpi/pmix - fix regression introduced in 23.02.2 which caused PMIx shmem + mpi/pmix - fix regression introduced in 23.02.2 which caused PMIx shmem
backed files permissions to be incorrect. backed files permissions to be incorrect.
* api/submit - fix memory leaks when submission of batch regular jobs or batch + api/submit - fix memory leaks when submission of batch regular jobs
HetJobs fails (response data is a return code). or batch HetJobs fails (response data is a return code).
* Fix regression in 23.02 leading to error() messages being sent at `INFO` + Fix regression in 23.02 leading to error() messages being sent at `INFO`
instead of `ERR` in syslog. instead of `ERR` in syslog.
* Fix `TresUsageIn[Tot|Ave]` calculation for `gres/gpumem` and `gres/gpuutil`. + Fix `TresUsageIn[Tot|Ave]` calculation for `gres/gpumem` and
* Fix issue in the gpu plugins where gpu frequencies would only be set if both `gres/gpuutil`.
gpu memory and gpu frequencies were set, while one or the other suffices. + Fix issue in the gpu plugins where gpu frequencies would only be set if
* Fix reservations group ACL's not working with the root group. both gpu memory and gpu frequencies were set, while one or the other
* Fix updating a job with a ReqNodeList greater than the job's node count. suffices.
* Fix inadvertent permission denied error for `--task-prolog` and + Fix reservations group ACL's not working with the root group.
+ Fix updating a job with a ReqNodeList greater than the job's node count.
+ Fix inadvertent permission denied error for `--task-prolog` and
`--task-epilog` with filesystems mounted with `root_squash`. `--task-epilog` with filesystems mounted with `root_squash`.
* Fix missing detailed cpu and gres information in json/yaml output from + Fix missing detailed cpu and gres information in json/yaml output from
`scontrol`, `squeue` and `sinfo`. `scontrol`, `squeue` and `sinfo`.
* Fix regression in 23.02 that causes a failure to allocate job steps that + Fix regression in 23.02 that causes a failure to allocate job steps that
request `--cpus-per-gpu` and gpus with types. request `--cpus-per-gpu` and gpus with types.
* Fix potentially waiting indefinitely for a defunct process to finish, + Fix potentially waiting indefinitely for a defunct process to finish,
which affects various scripts including `Prolog` and `Epilog`. This could which affects various scripts including `Prolog` and `Epilog`. This could
have various symptoms, such as jobs getting stuck in a completing state. have various symptoms, such as jobs getting stuck in a completing state.
* Fix losing list of reservations on job when updating job with list of + Fix losing list of reservations on job when updating job with list of
reservations and restarting the controller. reservations and restarting the controller.
* Fix nodes resuming after down and drain state update requests from + Fix nodes resuming after down and drain state update requests from
clients older than 23.02. clients older than 23.02.
* Fix advanced reservation creation/update when an association that should + Fix advanced reservation creation/update when an association that should
have access to it is composed with partition(s). have access to it is composed with partition(s).
* Fix job layout calculations with `--ntasks-per-gpu`, especially when + Fix job layout calculations with `--ntasks-per-gpu`, especially when
`--nodes` has not been explicitly provided. `--nodes` has not been explicitly provided.
* Fix X11 forwarding for jobs submitted from the slurmctld host. + Fix X11 forwarding for jobs submitted from the slurmctld host.
* When a job requests `--no-kill` and one or more nodes fail during the job, + When a job requests `--no-kill` and one or more nodes fail during the
fix subsequent job steps unable to use some of the remaining resources job, fix subsequent job steps unable to use some of the remaining
allocated to the job. resources allocated to the job.
* Fix shared gres allocation when using `--tres-per-task` with tasks that span + Fix shared gres allocation when using `--tres-per-task` with tasks that
multiple sockets. span multiple sockets.
- Other changes + `auth/jwt` - Fix memory leak.
(since 23.02.3): * Other changes:
* `scontrol` - Permit changes to StdErr and StdIn for pending jobs. + `openapi/dbv0.0.39/users` - If a default account update failed, resulting
* `scontrol` - Reset std{err,in,out} when set to empty string.
* `slurmrestd` - mark environment as a required field for job submission
descriptions.
* `slurmrestd` - avoid dumping null in OpenAPI schema required fields.
* `data_parser/v0.0.39` - avoid rejecting valid memory_per_node formatted as
dictionary provided with a job description.
* `data_parser/v0.0.39` - avoid rejecting valid memory_per_cpu formatted as
dictionary provided with a job description.
* `slurmrestd` - Return HTTP error code 404 when job query fails.
* `slurmrestd` - Add return schema to error response to job and license query.
* Change the log message warning for rate limited users from debug to verbose.
* `cgroup/v2` - Avoid capturing log output for ebpf when constraining devices,
as this can lead to inadvertent failure if the log buffer is too small.
* Added error message when attempting to use sattach on batch or extern steps.
* Reject job ArrayTaskThrottle update requests from unprivileged users.
* `data_parser/v0.0.39` - populate description fields of property objects in
generated OpenAPI specifications where defined.
* `slurmstepd` - Avoid segfault caused by ContainerPath not being terminated
by '/' in oci.conf.
* `data_parser/v0.0.39` - Change `v0.0.39_job_info` response to tag `exit_code`
field as being complex instead of only an unsigned integer.
(since 23.02.2):
* `openapi/dbv0.0.39/users` - If a default account update failed, resulting
in a no-op, the query returned success without any warning. Now a warning in a no-op, the query returned success without any warning. Now a warning
is sent back to the client that the default account wasn't modified. is sent back to the client that the default account wasn't modified.
* Avoid job write lock when nodes are dynamically added/removed. + Avoid job write lock when nodes are dynamically added/removed.
* burst_buffer/lua - allow jobs to get scheduled sooner after + `burst_buffer/lua` - allow jobs to get scheduled sooner after
`slurm_bb_data_in` completes. `slurm_bb_data_in` completes.
* `openapi/v0.0.39` - fix memory leak in `_job_post_het_submit()`. + `openapi/v0.0.39` - fix memory leak in `_job_post_het_submit()`.
* Avoid possible `slurmctld` segfault caused by race condition with already + Avoid possible `slurmctld` segfault caused by race condition with already
completed `slurmdbd_conn` connections. completed `slurmdbd_conn` connections.
* `Slurmdbd.conf` checks included conf files for 0600 permissions + `Slurmdbd.conf` checks included conf files for 0600 permissions
* `slurmrestd` - fix regression "oversubscribe" fields were removed from job + `slurmrestd` - fix regression "oversubscribe" fields were removed from
descriptions and submissions from v0.0.39 end points. job descriptions and submissions from v0.0.39 end points.
* `accounting_storage/mysql` - Query for indiviual QOS correctly when you have + `accounting_storage/mysql` - Query for indiviual QOS correctly when you
more than 10. have more than 10.
* Add warning message about ignoring `--tres-per-tasks=license` when used + Add warning message about ignoring `--tres-per-tasks=license` when used
on a step. on a step.
* `sshare` - Fix command to work when using priority/basic. + `sshare` - Fix command to work when using `priority/basic`.
* Avoid loading `cli_filter` plugins outside of `salloc`/`sbatch`/`scron`/ + Avoid loading `cli_filter` plugins outside of `salloc`/`sbatch`/`scron`/
`srun`. This fixes a number of missing symbol problems that can manifest `srun`. This fixes a number of missing symbol problems that can manifest
for executables linked against libslurm (and not `libslurmfull`). for executables linked against libslurm (and not `libslurmfull`).
* Allow cloud_reg_addrs to update dynamically registered node's addrs on + Allow cloud_reg_addrs to update dynamically registered node's addrs on
subsequent registrations. subsequent registrations.
* Revert a change in 22.05.5 that prevented tasks from sharing a core if + Revert a change in 22.05.5 that prevented tasks from sharing a core if
`--cpus-per-task` > threads per core, but caused incorrect accounting and `--cpus-per-task` > threads per core, but caused incorrect accounting and
cpu. cpu binding. Instead, `--ntasks-per-core=1` may be requested to prevent
binding. Instead, `--ntasks-per-core=1` may be requested to prevent tasks tasks from sharing a core.
from sharing a core. + Correctly send `assoc_mgr` lock to mcs plugin.
* Correctly send `assoc_mgr` lock to mcs plugin. + Avoid unnecessary `gres/gpumem` and `gres/gpuutil` `TRES` position
* Avoid unnecessary gres/gpumem and gres/gpuutil TRES position lookups. lookups.
* `sacct` - when printing PLANNED time, use end time instead of start time for + `sacct` - when printing `PLANNED` time, use end time instead of start
jobs cancelled before they started. time for jobs cancelled before they started.
* Hold the job with "(Reservation ... invalid)" state reason if the + Hold the job with "`(Reservation ... invalid)`" state reason if the
reservation is not usable by the job. reservation is not usable by the job.
* `auth/jwt` - Fix memory leak. + `sbatch` - Added new `--export=NIL` option.
* `sbatch` - Added new `--export=NIL` option.
- Removed: - Removed:
* Fix-test-3.13.patch * Fix-test-3.13.patch
* Fix-test-38.11.patch as both tests changed upstream * Fix-test-38.11.patch as both tests changed upstream