* Fix security issue where a coordinator could add a user with
elevated privileges. CVE-2025-43904.
* Return error to `scontrol` reboot on bad nodelists.
* `slurmrestd` - Report an error when QOS resolution fails for
v0.0.40 endpoints.
* `slurmrestd` - Report an error when QOS resolution fails for
v0.0.41 endpoints.
* `slurmrestd` - Report an error when QOS resolution fails for
v0.0.42 endpoints.
* `data_parser/v0.0.42` - Added `+inline_enums` flag which
modifies the output when generating OpenAPI specification.
It causes enum arrays to not be defined in their own schema
with references (`$ref`) to them. Instead they will be dumped
inline.
* Fix binding error with `tres-bind map/mask` on partial node
allocations.
* Fix `stepmgr` enabled steps being able to request features.
* Reject step creation if requested feature is not available
in job.
* `slurmd` - Restrict listening for new incoming RPC requests
further into startup.
* `slurmd` - Avoid `auth/slurm` related hangs of CLI commands
during startup and shutdown.
* `slurmctld` - Restrict processing new incoming RPC requests
further into startup. Stop processing requests sooner during
shutdown.
* `slurmcltd` - Avoid auth/slurm related hangs of CLI commands
during startup and shutdown.
* `slurmctld` - Avoid race condition during shutdown or
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=315
Update to version 24.11.1:
* With client commands `MIN_MEMORY` will show `mem_per_tres` if
specified.
* Fix errno message about bad constraint.
* `slurmctld` - Fix crash and possible split brain issue if the
backup controller handles an scontrol reconfigure while in control
before the primary resumes operation.
* Fix `stepmgr` not getting dynamic node addrs from the controller
* `stepmgr` - avoid "`Unexpected missing socket`" errors.
* Fix `scontrol show steps` with dynamic stepmgr.
* Deny jobs using the "`R:`" option of `--signal` if `PreemptMode=OFF`
globally.
* Force jobs using the "`R:`" option of `--signal` to be
preemptable.
by requeue or cancel only. If `PreemptMode` on the partition or
QOS is off or suspend, the job will default to using
`PreemptMode=cancel`.
* If `--mem-per-cpu` exceeds `MaxMemPerCPU`, the number of CPUs
per task will always be increased even if --cpus-per-task was
specified. This is needed to ensure each task gets the expected
amount of memory.
* Fix compilation issue on OpenSUSE Leap 15.
* Fix jobs using more nodes than needed when not using `-N`.
* Fix issue with allocation being allocated less resources.
than needed when using `--gres-flags=enforce-binding`.
* `select/cons_tres` - Fix errors with `MaxCpusPerSocket`
partition limit. Used CPUs/cores weren't counted properly,
nor limiting free ones to avail, when the socket was partially
allocated, or the job request went beyond this limit.
* Fix issue when jobs were preempted for licenses even if there (forwarded request 1244326 from eeich)
OBS-URL: https://build.opensuse.org/request/show/1244329
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=112
* With client commands `MIN_MEMORY` will show `mem_per_tres` if
specified.
* Fix errno message about bad constraint.
* `slurmctld` - Fix crash and possible split brain issue if the
backup controller handles an scontrol reconfigure while in control
before the primary resumes operation.
* Fix `stepmgr` not getting dynamic node addrs from the controller
* `stepmgr` - avoid "`Unexpected missing socket`" errors.
* Fix `scontrol show steps` with dynamic stepmgr.
* Deny jobs using the "`R:`" option of `--signal` if `PreemptMode=OFF`
globally.
* Force jobs using the "`R:`" option of `--signal` to be
preemptable.
by requeue or cancel only. If `PreemptMode` on the partition or
QOS is off or suspend, the job will default to using
`PreemptMode=cancel`.
* If `--mem-per-cpu` exceeds `MaxMemPerCPU`, the number of CPUs
per task will always be increased even if --cpus-per-task was
specified. This is needed to ensure each task gets the expected
amount of memory.
* Fix compilation issue on OpenSUSE Leap 15.
* Fix jobs using more nodes than needed when not using `-N`.
* Fix issue with allocation being allocated less resources.
than needed when using `--gres-flags=enforce-binding`.
* `select/cons_tres` - Fix errors with `MaxCpusPerSocket`
partition limit. Used CPUs/cores weren't counted properly,
nor limiting free ones to avail, when the socket was partially
allocated, or the job request went beyond this limit.
* Fix issue when jobs were preempted for licenses even if there
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=309
* `slurmrestd` - Remove deprecated fields from the following
`.result` from `POST /slurm/v0.0.42/job/submit`.
`.job_id`, `.step_id`, `.job_submit_user_msg` from `POST /slurm/v0.0.42/job/{job_id}`.
`.job.exclusive`, `.jobs[].exclusive` to `POST /slurm/v0.0.42/job/submit`.
`.jobs[].exclusive` from `GET /slurm/v0.0.42/job/{job_id}`.
`.jobs[].exclusive` from `GET /slurm/v0.0.42/jobs`.
`.job.oversubscribe`, `.jobs[].oversubscribe` to `POST /slurm/v0.0.42/job/submit`.
`.jobs[].oversubscribe` from `GET /slurm/v0.0.42/job/{job_id}`.
`.jobs[].oversubscribe` from `GET /slurm/v0.0.42/jobs`.
`DELETE /slurm/v0.0.40/jobs`
`DELETE /slurm/v0.0.41/jobs`
`DELETE /slurm/v0.0.42/jobs`
allocation is granted.
`job|socket|task` or `cpus|mem` per GRES.
node update whereas previously only single nodes could be
updated through `/node/<nodename>` endpoint:
`POST /slurm/v0.0.42/nodes`
partition as this is a cluster-wide option.
`REQUEST_NODE_INFO RPC`.
the db server is not reachable.
(`.jobs[].priority_by_partition`) to JSON and YAML output.
connection` error if the error was the result of an
authentication failure.
errors with the `SLURM_PROTOCOL_AUTHENTICATION_ERROR` error
code.
of `Unspecified error` if querying the following endpoints
fails:
`GET /slurm/v0.0.40/diag/`
`GET /slurm/v0.0.41/diag/`
`GET /slurm/v0.0.42/diag/` (forwarded request 1238576 from eeich)
OBS-URL: https://build.opensuse.org/request/show/1238577
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=111
`.result` from `POST /slurm/v0.0.42/job/submit`.
`.job_id`, `.step_id`, `.job_submit_user_msg` from `POST /slurm/v0.0.42/job/{job_id}`.
`.job.exclusive`, `.jobs[].exclusive` to `POST /slurm/v0.0.42/job/submit`.
`.jobs[].exclusive` from `GET /slurm/v0.0.42/job/{job_id}`.
`.jobs[].exclusive` from `GET /slurm/v0.0.42/jobs`.
`.job.oversubscribe`, `.jobs[].oversubscribe` to `POST /slurm/v0.0.42/job/submit`.
`.jobs[].oversubscribe` from `GET /slurm/v0.0.42/job/{job_id}`.
`.jobs[].oversubscribe` from `GET /slurm/v0.0.42/jobs`.
`DELETE /slurm/v0.0.40/jobs`
`DELETE /slurm/v0.0.41/jobs`
`DELETE /slurm/v0.0.42/jobs`
allocation is granted.
`job|socket|task` or `cpus|mem` per GRES.
node update whereas previously only single nodes could be
updated through `/node/<nodename>` endpoint:
`POST /slurm/v0.0.42/nodes`
partition as this is a cluster-wide option.
`REQUEST_NODE_INFO RPC`.
the db server is not reachable.
(`.jobs[].priority_by_partition`) to JSON and YAML output.
connection` error if the error was the result of an
authentication failure.
errors with the `SLURM_PROTOCOL_AUTHENTICATION_ERROR` error
code.
of `Unspecified error` if querying the following endpoints
fails:
`GET /slurm/v0.0.40/diag/`
`GET /slurm/v0.0.41/diag/`
`GET /slurm/v0.0.42/diag/`
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=307
- Update to version 24.11
* `slurmctld` - Reject arbitrary distribution jobs that do not
specifying a task count.
* Fix backwards compatibility of the `RESPONSE_JOB_INFO RPC`
(used by `squeue`, `scontrol show job`, etc.) with Slurm clients
version 24.05 and below. This was a regression in 24.11.0rc1.
* Do not let `slurmctld`/`slurmd` start if there are more nodes
defined in `slurm.conf` than the maximum supported amount
(64k nodes).
* `slurmctld` - Set job's exit code to 1 when a job fails with
state `JOB_NODE_FAIL`. This fixes `sbatch --wait` not being able
to exit with error code when a job fails for this reason in
some cases.
* Fix certain reservation updates requested from 23.02 clients.
* `slurmrestd` - Fix populating non-required object fields of
objects as `{}` in JSON/YAML instead of `null` causing compiled
OpenAPI clients to reject the response to
`GET /slurm/v0.0.40/jobs` due to validation failure of
`.jobs[].job_resources`.
* Fix issue where older versions of Slurm talking to a 24.11 dbd
could loose step accounting.
* Fix minor memory leaks.
* Fix bad memory reference when `xstrchr` fails to find char.
* Remove duplicate checks for a data structure.
* Fix race condition in `stepmgr` step completion handling.
* `slurm.spec` - add ability to specify patches to apply on the
command line.
* `slurm.spec` - add ability to supply extra version information.
* Fix 24.11 HA issues.
* Fix requeued jobs keeping their priority until the decay thread (forwarded request 1235783 from eeich)
OBS-URL: https://build.opensuse.org/request/show/1235784
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=109
* `slurmctld` - Reject arbitrary distribution jobs that do not
specifying a task count.
* Fix backwards compatibility of the `RESPONSE_JOB_INFO RPC`
(used by `squeue`, `scontrol show job`, etc.) with Slurm clients
version 24.05 and below. This was a regression in 24.11.0rc1.
* Do not let `slurmctld`/`slurmd` start if there are more nodes
defined in `slurm.conf` than the maximum supported amount
(64k nodes).
* `slurmctld` - Set job's exit code to 1 when a job fails with
state `JOB_NODE_FAIL`. This fixes `sbatch --wait` not being able
to exit with error code when a job fails for this reason in
some cases.
* Fix certain reservation updates requested from 23.02 clients.
* `slurmrestd` - Fix populating non-required object fields of
objects as `{}` in JSON/YAML instead of `null` causing compiled
OpenAPI clients to reject the response to
`GET /slurm/v0.0.40/jobs` due to validation failure of
`.jobs[].job_resources`.
* Fix issue where older versions of Slurm talking to a 24.11 dbd
could loose step accounting.
* Fix minor memory leaks.
* Fix bad memory reference when `xstrchr` fails to find char.
* Remove duplicate checks for a data structure.
* Fix race condition in `stepmgr` step completion handling.
* `slurm.spec` - add ability to specify patches to apply on the
command line.
* `slurm.spec` - add ability to supply extra version information.
* Fix 24.11 HA issues.
* Fix requeued jobs keeping their priority until the decay thread
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=302
- Update to version 24.05.4 & fix for CVE-2024-48936.
* Fix generic int sort functions.
* Fix user look up using possible unrealized uid in the dbd.
* `slurmrestd` - Fix regressions that allowed `slurmrestd` to
be run as SlurmUser when `SlurmUser` was not root.
* mpi/pmix fix race conditions with het jobs at step start/end
which could make srun to hang.
* Fix not showing some `SelectTypeParameters` in `scontrol show
config`.
* Avoid assert when dumping removed certain fields in JSON/YAML.
* Improve how shards are scheduled with affinity in mind.
* Fix `MaxJobsAccruePU` not being respected when `MaxJobsAccruePA`
is set in the same QOS.
* Prevent backfill from planning jobs that use overlapping
resources for the same time slot if the job's time limit is
less than `bf_resolution`.
* Fix memory leak when requesting typed gres and
`--[cpus|mem]-per-gpu`.
* Prevent backfill from breaking out due to "system state
changed" every 30 seconds if reservations use `REPLACE` or
`REPLACE_DOWN` flags.
* `slurmrestd` - Make sure that scheduler_unset parameter defaults
to true even when the following flags are also set:
`show_duplicates`, `skip_steps`, `disable_truncate_usage_time`,
`run_away_jobs`, `whole_hetjob`, `disable_whole_hetjob`,
`disable_wait_for_result`, `usage_time_as_submit_time`,
`show_batch_script`, and or `show_job_environment`. Additionaly,
always make sure show_duplicates and
`disable_truncate_usage_time` default to true when the following
flags are also set: `scheduler_unset`, `scheduled_on_submit`, (forwarded request 1220075 from eeich)
OBS-URL: https://build.opensuse.org/request/show/1220076
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=108
* Fix generic int sort functions.
* Fix user look up using possible unrealized uid in the dbd.
* `slurmrestd` - Fix regressions that allowed `slurmrestd` to
be run as SlurmUser when `SlurmUser` was not root.
* mpi/pmix fix race conditions with het jobs at step start/end
which could make srun to hang.
* Fix not showing some `SelectTypeParameters` in `scontrol show
config`.
* Avoid assert when dumping removed certain fields in JSON/YAML.
* Improve how shards are scheduled with affinity in mind.
* Fix `MaxJobsAccruePU` not being respected when `MaxJobsAccruePA`
is set in the same QOS.
* Prevent backfill from planning jobs that use overlapping
resources for the same time slot if the job's time limit is
less than `bf_resolution`.
* Fix memory leak when requesting typed gres and
`--[cpus|mem]-per-gpu`.
* Prevent backfill from breaking out due to "system state
changed" every 30 seconds if reservations use `REPLACE` or
`REPLACE_DOWN` flags.
* `slurmrestd` - Make sure that scheduler_unset parameter defaults
to true even when the following flags are also set:
`show_duplicates`, `skip_steps`, `disable_truncate_usage_time`,
`run_away_jobs`, `whole_hetjob`, `disable_whole_hetjob`,
`disable_wait_for_result`, `usage_time_as_submit_time`,
`show_batch_script`, and or `show_job_environment`. Additionaly,
always make sure show_duplicates and
`disable_truncate_usage_time` default to true when the following
flags are also set: `scheduler_unset`, `scheduled_on_submit`,
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=300
- Update to version 24.05.3
* `data_parser/v0.0.40` - Added field descriptions.
* `slurmrestd` - Avoid creating new slurmdbd connection per request
to `* /slurm/slurmctld/*/*` endpoints.
* Fix compilation issue with `switch/hpe_slingshot` plugin.
* Fix gres per task allocation with threads-per-core.
* `data_parser/v0.0.41` - Added field descriptions.
* `slurmrestd` - Change back generated OpenAPI schema for
`DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using
parameters for request. `slurmrestd` will continue accept endpoint
requests via `RequestBody` or HTTP query.
* `topology/tree` - Fix issues with switch distance optimization.
* Fix potential segfault of secondary `slurmctld` when falling back
to the primary when running with a `JobComp` plugin.
* Enable `--json`/`--yaml=v0.0.39` options on client commands to
dump data using data_parser/v0.0.39 instead or outputting nothing.
* `switch/hpe_slingshot` - Fix issue that could result in a 0 length
state file.
* Fix unnecessary message protocol downgrade for unregistered nodes.
* Fix unnecessarily packing alias addrs when terminating jobs with
a mix of non-cloud/dynamic nodes and powered down cloud/dynamic
nodes.
* `accounting_storage/mysql` - Fix issue when deleting a qos that
could remove too many commas from the qos and/or delta_qos fields
of the assoc table.
* `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU.
* Fix allowing access to reservations without `MaxStartDelay` set.
* Fix regression introduced in 24.05.0rc1 breaking
`srun --send-libs` parsing.
* Fix slurmd vsize memory leak when using job submission/allocation
OBS-URL: https://build.opensuse.org/request/show/1208086
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=106
* `data_parser/v0.0.40` - Added field descriptions.
* `slurmrestd` - Avoid creating new slurmdbd connection per request
to `* /slurm/slurmctld/*/*` endpoints.
* Fix compilation issue with `switch/hpe_slingshot` plugin.
* Fix gres per task allocation with threads-per-core.
* `data_parser/v0.0.41` - Added field descriptions.
* `slurmrestd` - Change back generated OpenAPI schema for
`DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using
parameters for request. `slurmrestd` will continue accept endpoint
requests via `RequestBody` or HTTP query.
* `topology/tree` - Fix issues with switch distance optimization.
* Fix potential segfault of secondary `slurmctld` when falling back
to the primary when running with a `JobComp` plugin.
* Enable `--json`/`--yaml=v0.0.39` options on client commands to
dump data using data_parser/v0.0.39 instead or outputting nothing.
* `switch/hpe_slingshot` - Fix issue that could result in a 0 length
state file.
* Fix unnecessary message protocol downgrade for unregistered nodes.
* Fix unnecessarily packing alias addrs when terminating jobs with
a mix of non-cloud/dynamic nodes and powered down cloud/dynamic
nodes.
* `accounting_storage/mysql` - Fix issue when deleting a qos that
could remove too many commas from the qos and/or delta_qos fields
of the assoc table.
* `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU.
* Fix allowing access to reservations without `MaxStartDelay` set.
* Fix regression introduced in 24.05.0rc1 breaking
`srun --send-libs` parsing.
* Fix slurmd vsize memory leak when using job submission/allocation
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=295
- IMPORTANT NOTES:
If using the slurmdbd (Slurm DataBase Daemon) you must update
this first. NOTE: If using a backup DBD you must start the
primary first to do any database conversion, the backup will not
start until this has happened. The 24.05 slurmdbd will work
with Slurm daemons of version 23.02 and above. You will not
need to update all clusters at the same time, but it is very
important to update slurmdbd first and having it running before
updating any other clusters making use of it.
- HIGHLIGHTS
* Federation - allow client command operation when slurmdbd is
unavailable.
* burst_buffer/lua - Added two new hooks: slurm_bb_test_data_in
and slurm_bb_test_data_out. The syntax and use of the new hooks
are documented in etc/burst_buffer.lua.example. These are
required to exist. slurmctld now checks on startup if the
burst_buffer.lua script loads and contains all required hooks;
slurmctld will exit with a fatal error if this is not
successful. Added PollInterval to burst_buffer.conf. Removed
the arbitrary limit of 512 copies of the script running
simultaneously.
* Add QOS limit MaxTRESRunMinsPerAccount.
* Add QOS limit MaxTRESRunMinsPerUser.
* Add ELIGIBLE environment variable to jobcomp/script plugin.
* Always use the QOS name for SLURM_JOB_QOS environment variables.
Previously the batch environment would use the description field,
which was usually equivalent to the name.
* cgroup/v2 - Require dbus-1 version >= 1.11.16.
* Allow NodeSet names to be used in SuspendExcNodes.
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=294
- removed Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch
as incoperated upstream
* Changes in Slurm 23.02.5
* Add the JobId to debug() messages indicating when cpus_per_task/mem_per_cpu
or pn_min_cpus are being automatically adjusted.
* Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
a node features plugin is configured.
* Fix and prevent reoccurring reservations from overlapping.
* job_container/tmpfs - Avoid attempts to share BasePath between nodes.
* Change the log message warning for rate limited users from verbose to info.
* With CR_Cpu_Memory, fix node selection for jobs that request gres and
*-mem-per-cpu.
* Fix a regression from 22.05.7 in which some jobs were allocated too few
nodes, thus overcommitting cpus to some tasks.
* Fix a job being stuck in the completing state if the job ends while the
primary controller is down or unresponsive and the backup controller has
not yet taken over.
* Fix slurmctld segfault when a node registers with a configured CpuSpecList
while slurmctld configuration has the node without CpuSpecList.
* Fix cloud nodes getting stuck in POWERED_DOWN+NO_RESPOND state after not
registering by ResumeTimeout.
* slurmstepd - Avoid cleanup of config.json-less containers spooldir getting
skipped.
* slurmstepd - Cleanup per task generated environment for containers in
spooldir.
* Fix scontrol segfault when 'completing' command requested repeatedly in
interactive mode.
* Properly handle a race condition between bind() and listen() calls in the
network stack when running with SrunPortRange set.
* Federation - Fix revoked jobs being returned regardless of the -a/--all
OBS-URL: https://build.opensuse.org/request/show/1161499
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=292
- Update to version 23.11.03
* slurmrestd - Reject single http query with multiple path
requests.
* Fix launching Singularity v4.x containers with
`srun --container` by setting .process.terminal to true in
generated `config.json` when step has pseudoterminal (`--pty`)
requested.
* Fix loading in `dyanmic/cloud` node jobs after `net_cred`
expired.
* Fix cgroup null path error on `slurmd/slurmstepd` tear down.
* `data_parser/v0.0.40` - Prevent failure if accounting is
disabled, instead issue a warning if needed data from the
database can not be retrieved.
* `openapi/slurmctld` - Prevent failure if accounting is disabled.
* Prevent `slurmscriptd` processing delays from blocking other
threads in `slurmctld` while trying to launch various scripts.
This is additional work for a fix in 23.02.6.
* Fix memory leak when receiving alias addrs from controller.
* `scontrol` - Accept `scontrol token lifespan=infinite` to
create tokens that effectively do not expire.
* Avoid errors when Slurmdb accounting disabled when `--json` or
`--yaml` is invoked with CLI commands and `slurmrestd`. Add
warnings when query would have populated data from Slurmdb
instead of errors.
* Fix `slurmctld` memory leak when running job with
`--tres-per-task=gres:shard:#`
* Fix backfill trying to start jobs outside of backfill window.
* Fix oversubscription on partitions with `PreemptMode=OFF`.
* Preserve node reason on power up if the node is downed
or drained. (forwarded request 1150524 from eeich)
OBS-URL: https://build.opensuse.org/request/show/1151965
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=104
- Update to version 23.11.03
* slurmrestd - Reject single http query with multiple path
requests.
* Fix launching Singularity v4.x containers with
`srun --container` by setting .process.terminal to true in
generated `config.json` when step has pseudoterminal (`--pty`)
requested.
* Fix loading in `dyanmic/cloud` node jobs after `net_cred`
expired.
* Fix cgroup null path error on `slurmd/slurmstepd` tear down.
* `data_parser/v0.0.40` - Prevent failure if accounting is
disabled, instead issue a warning if needed data from the
database can not be retrieved.
* `openapi/slurmctld` - Prevent failure if accounting is disabled.
* Prevent `slurmscriptd` processing delays from blocking other
threads in `slurmctld` while trying to launch various scripts.
This is additional work for a fix in 23.02.6.
* Fix memory leak when receiving alias addrs from controller.
* `scontrol` - Accept `scontrol token lifespan=infinite` to
create tokens that effectively do not expire.
* Avoid errors when Slurmdb accounting disabled when `--json` or
`--yaml` is invoked with CLI commands and `slurmrestd`. Add
warnings when query would have populated data from Slurmdb
instead of errors.
* Fix `slurmctld` memory leak when running job with
`--tres-per-task=gres:shard:#`
* Fix backfill trying to start jobs outside of backfill window.
* Fix oversubscription on partitions with `PreemptMode=OFF`.
* Preserve node reason on power up if the node is downed
or drained.
OBS-URL: https://build.opensuse.org/request/show/1150524
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=289
- Update to 23.11.1 with following major improvements and fixing
CVE-2023-49933, CVE-2023-49934, CVE-2023-49935, CVE-2023-49936
and CVE-2023-49937
* Substantially overhauled the SlurmDBD association management
code. For clusters updated to 23.11, account and user
additions or removals are significantly faster than in prior
releases.
* Overhauled `scontrol reconfigure` to prevent configuration
mistakes from disabling slurmctld and slurmd. Instead, an
error will be returned, and the running configuration will
persist. This does require updates to the systemd service
files to use the `--systemd` option to `slurmctld` and `slurmd`.
* Added a new internal `auth/cred` plugin - `auth/slurm`. This
builds off the prior `auth/jwt` model, and permits operation
of the `slurmdbd` and `slurmctld` without access to full
directory information with a suitable configuration.
* Added a new `--external-launcher` option to `srun`, which is
automatically set by common MPI launcher implementations and
ensures processes using those non-srun launchers have full
access to all resources allocated on each node.
* Reworked the dynamic/cloud modes of operation to allow for
"fanout" - where Slurm communication can be automatically
offloaded to compute nodes for increased cluster scalability.
* Overhauled and extended the Reservation subsystem to allow
for most of the same resource requirements as are placed on
the job. Notably, this permits reservations to now reserve
GRES directly.
- Details of changes:
* Fix `scontrol update job=... TimeLimit+=/-=` when used with a
raw JobId of job array element.
OBS-URL: https://build.opensuse.org/request/show/1141442
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=103
and CVE-2023-49937
* Substantially overhauled the SlurmDBD association management
code. For clusters updated to 23.11, account and user
additions or removals are significantly faster than in prior
releases.
* Overhauled `scontrol reconfigure` to prevent configuration
mistakes from disabling slurmctld and slurmd. Instead, an
error will be returned, and the running configuration will
persist. This does require updates to the systemd service
files to use the `--systemd` option to `slurmctld` and `slurmd`.
* Added a new internal `auth/cred` plugin - `auth/slurm`. This
builds off the prior `auth/jwt` model, and permits operation
of the `slurmdbd` and `slurmctld` without access to full
directory information with a suitable configuration.
* Added a new `--external-launcher` option to `srun`, which is
automatically set by common MPI launcher implementations and
ensures processes using those non-srun launchers have full
access to all resources allocated on each node.
* Reworked the dynamic/cloud modes of operation to allow for
"fanout" - where Slurm communication can be automatically
offloaded to compute nodes for increased cluster scalability.
* Overhauled and extended the Reservation subsystem to allow
for most of the same resource requirements as are placed on
the job. Notably, this permits reservations to now reserve
GRES directly.
* Fix `scontrol update job=... TimeLimit+=/-=` when used with a
raw JobId of job array element.
* Reject `TimeLimit` increment/decrement when called on job with
`TimeLimit=UNLIMITED`.
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=285
- Update to 23.11.1 with following major improvements and fixing
CVE-2023-49933, CVE-2023-49934, CVE-2023-49935, CVE-2023-49936 and
CVE-2023-49937
* Substantially overhauled the SlurmDBD association management code. For
clusters updated to 23.11, account and user additions or removals are
significantly faster than in prior releases.
* Overhauled 'scontrol reconfigure' to prevent configuration mistakes from
disabling slurmctld and slurmd. Instead, an error will be returned, and the
running configuration will persist. This does require updates to the
systemd service files to use the --systemd option to slurmctld and slurmd.
* Added a new internal auth/cred plugin - "auth/slurm". This builds off the
prior auth/jwt model, and permits operation of the slurmdbd and slurmctld
without access to full directory information with a suitable configuration.
* Added a new --external-launcher option to srun, which is automatically set
by common MPI launcher implementations and ensures processes using those
non-srun launchers have full access to all resources allocated on each
node.
* Reworked the dynamic/cloud modes of operation to allow for "fanout" - where
Slurm communication can be automatically offloaded to compute nodes for
increased cluster scalability.
Added initial official Debian packaging support.
* Overhauled and extended the Reservation subsystem to allow for most of the
same resource requirements as are placed on the job. Notably, this permits
reservations to now reserve GRES directly.
- Details of changes:
* Fix scontrol update job=... TimeLimit+=/-= when used with a raw JobId of job
array element.
* Reject TimeLimit increment/decrement when called on job with
TimeLimit=UNLIMITED.
* Fix issue with requesting a job with *licenses as well as
OBS-URL: https://build.opensuse.org/request/show/1138332
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=284
- Update to 23.02.6 to fix (CVE-2023-49933 - bsc#1218046, CVE-2023-49935 -
bsc#1218049, CVE-2023-49936 - bsc#1218050, CVE-2023-49937 - bsc#1218051,
CVE-2023-49938 - bsc#1218053)
* Security Fixes:
+ Add `JobAcctGatherParams=DisableGPUAcct` to disable gpu accounting.
+ `acct_gather_energy/ipmi` - Improve logging of DCMI issues.
+ `gpu/oneapi` - Add support for new env vars `ZE_FLAT_DEVICE_HIERARCHY`
and `ZE_ENABLE_PCI_ID_DEVICE_ORDER`.
+ `data_parser/v0.0.39` - skip empty string when parsing QOS ids.
+ Remove error message from `assoc_mgr_update_assocs` when purposefully
resetting the default QOS.
* Bug Fixes:
+ `libslurm_nss` - Avoid causing glibc to assert due to an unexpected
return from slurm_nss due to an error during lookup.
+ Fix job requests with `--tres-per-task` sometimes resulting in bad
allocations that cannot run subsequent job steps.
+ Fix issue with `slurmd` where `srun` fails to be warned when a node
prolog script runs beyond `MsgTimeout` set in `slurm.conf`.
+ `gres/shard` - Fix plugin functions to have matching parameter orders.
+ `gpu/nvml` - Fix issue that resulted in the wrong MIG devices being
constrained to a job
+ `gpu/nvml` - Fix linking issue with MIGs that prevented multiple MIGs
being used in a single job for certain MIG configurations
+ Fix file descriptor leak in slurmd when using `acct_gather_energy/ipmi`
with DCMI devices.
+ `sview` - avoid crash when job has a node list string > 49 characters.
+ Prevent `slurmctld` crash during reconfigure when packing job start
messages.
+ Preserve reason uid on reconfig.
+ Update node reason with updated `INVAL` state reason if different from (forwarded request 1136624 from eeich)
OBS-URL: https://build.opensuse.org/request/show/1137045
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=102
- Update to 23.02.6 to fix (CVE-2023-49933 - bsc#1218046, CVE-2023-49935 -
bsc#1218049, CVE-2023-49936 - bsc#1218050, CVE-2023-49937 - bsc#1218051,
CVE-2023-49938 - bsc#1218053)
* Security Fixes:
+ Add `JobAcctGatherParams=DisableGPUAcct` to disable gpu accounting.
+ `acct_gather_energy/ipmi` - Improve logging of DCMI issues.
+ `gpu/oneapi` - Add support for new env vars `ZE_FLAT_DEVICE_HIERARCHY`
and `ZE_ENABLE_PCI_ID_DEVICE_ORDER`.
+ `data_parser/v0.0.39` - skip empty string when parsing QOS ids.
+ Remove error message from `assoc_mgr_update_assocs` when purposefully
resetting the default QOS.
* Bug Fixes:
+ `libslurm_nss` - Avoid causing glibc to assert due to an unexpected
return from slurm_nss due to an error during lookup.
+ Fix job requests with `--tres-per-task` sometimes resulting in bad
allocations that cannot run subsequent job steps.
+ Fix issue with `slurmd` where `srun` fails to be warned when a node
prolog script runs beyond `MsgTimeout` set in `slurm.conf`.
+ `gres/shard` - Fix plugin functions to have matching parameter orders.
+ `gpu/nvml` - Fix issue that resulted in the wrong MIG devices being
constrained to a job
+ `gpu/nvml` - Fix linking issue with MIGs that prevented multiple MIGs
being used in a single job for certain MIG configurations
+ Fix file descriptor leak in slurmd when using `acct_gather_energy/ipmi`
with DCMI devices.
+ `sview` - avoid crash when job has a node list string > 49 characters.
+ Prevent `slurmctld` crash during reconfigure when packing job start
messages.
+ Preserve reason uid on reconfig.
+ Update node reason with updated `INVAL` state reason if different from
OBS-URL: https://build.opensuse.org/request/show/1136624
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=282
- Explicitly create an Obsoletes: entry for each package version
that is obsoleted by the present version. These are all published
versions of the last two major releases as well as all minor
versions of the present release lower than the current one
(bsc#1216869 2nd part).
This prevents the current version to upgrade a old Slurm version
for which no upgrade path exists.
OBS-URL: https://build.opensuse.org/request/show/1129638
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=279
- update to 23.02.6 to fix (CVE-2023-41914)
* Removed Fix-test-32.8.patch as fixed upstream
* Bug Fixes:
+ Fix `CpusPerTres=` not upgreadable with scontrol update
+ Fix unintentional gres removal when validating the gres job state.
+ Fix `--without-hpe-slingshot` configure option.
+ Fix cgroup v2 memory calculations when transparent huge pages are used.
+ Fix parsing of `sgather --timeout` option.
+ Fix regression from 22.05.0 that caused `srun --cpu-bind "=verbose"`
and `"=v"` options give different CPU bind masks.
+ Fix "_find_node_record: lookup failure for node" error message appearing
for all dynamic nodes during reconfigure.
+ Avoid segfault if loading serializer plugin fails.
+ `slurmrestd` - Correct OpenAPI format for `GET /slurm/v0.0.39/licenses`.
+ `slurmrestd` - Correct OpenAPI format for
`GET /slurm/v0.0.39/job/{job_id}`.
+ `slurmrestd` - Change format to multiple fields in
'GET /slurmdb/v0.0.39/assocations` and `GET /slurmdb/v0.0.39/qos` to
handle infinite and unset states.
+ When a node fails in a job with `--no-kill`, preserve the extern step on the
remaining nodes to avoid breaking features that rely on the extern step
such as `pam_slurm_adopt`, `x11`, and `job_container/tmpfs`.
+ `auth/jwt` - Ignore `x5c` field in JWKS files.
+ `auth/jwt` - Treat 'alg' field as optional in JWKS files.
+ Allow job_desc.selinux_context to be read from the job_submit.lua script.
+ Skip check in slurmstepd that causes a large number of errors in the
munge log: "Unauthorized credential for client UID=0 GID=0".
This error will still appear on `slurmd`/`slurmctld`/`slurmdbd` start up
and is not a cause for concern.
+ `slurmctld` - Allow startup with zero partitions.
OBS-URL: https://build.opensuse.org/request/show/1117163
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=96