* `slurmrestd` - Remove deprecated fields from the following
`.result` from `POST /slurm/v0.0.42/job/submit`.
`.job_id`, `.step_id`, `.job_submit_user_msg` from `POST /slurm/v0.0.42/job/{job_id}`.
`.job.exclusive`, `.jobs[].exclusive` to `POST /slurm/v0.0.42/job/submit`.
`.jobs[].exclusive` from `GET /slurm/v0.0.42/job/{job_id}`.
`.jobs[].exclusive` from `GET /slurm/v0.0.42/jobs`.
`.job.oversubscribe`, `.jobs[].oversubscribe` to `POST /slurm/v0.0.42/job/submit`.
`.jobs[].oversubscribe` from `GET /slurm/v0.0.42/job/{job_id}`.
`.jobs[].oversubscribe` from `GET /slurm/v0.0.42/jobs`.
`DELETE /slurm/v0.0.40/jobs`
`DELETE /slurm/v0.0.41/jobs`
`DELETE /slurm/v0.0.42/jobs`
allocation is granted.
`job|socket|task` or `cpus|mem` per GRES.
node update whereas previously only single nodes could be
updated through `/node/<nodename>` endpoint:
`POST /slurm/v0.0.42/nodes`
partition as this is a cluster-wide option.
`REQUEST_NODE_INFO RPC`.
the db server is not reachable.
(`.jobs[].priority_by_partition`) to JSON and YAML output.
connection` error if the error was the result of an
authentication failure.
errors with the `SLURM_PROTOCOL_AUTHENTICATION_ERROR` error
code.
of `Unspecified error` if querying the following endpoints
fails:
`GET /slurm/v0.0.40/diag/`
`GET /slurm/v0.0.41/diag/`
`GET /slurm/v0.0.42/diag/` (forwarded request 1238576 from eeich)
OBS-URL: https://build.opensuse.org/request/show/1238577
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=111
`.result` from `POST /slurm/v0.0.42/job/submit`.
`.job_id`, `.step_id`, `.job_submit_user_msg` from `POST /slurm/v0.0.42/job/{job_id}`.
`.job.exclusive`, `.jobs[].exclusive` to `POST /slurm/v0.0.42/job/submit`.
`.jobs[].exclusive` from `GET /slurm/v0.0.42/job/{job_id}`.
`.jobs[].exclusive` from `GET /slurm/v0.0.42/jobs`.
`.job.oversubscribe`, `.jobs[].oversubscribe` to `POST /slurm/v0.0.42/job/submit`.
`.jobs[].oversubscribe` from `GET /slurm/v0.0.42/job/{job_id}`.
`.jobs[].oversubscribe` from `GET /slurm/v0.0.42/jobs`.
`DELETE /slurm/v0.0.40/jobs`
`DELETE /slurm/v0.0.41/jobs`
`DELETE /slurm/v0.0.42/jobs`
allocation is granted.
`job|socket|task` or `cpus|mem` per GRES.
node update whereas previously only single nodes could be
updated through `/node/<nodename>` endpoint:
`POST /slurm/v0.0.42/nodes`
partition as this is a cluster-wide option.
`REQUEST_NODE_INFO RPC`.
the db server is not reachable.
(`.jobs[].priority_by_partition`) to JSON and YAML output.
connection` error if the error was the result of an
authentication failure.
errors with the `SLURM_PROTOCOL_AUTHENTICATION_ERROR` error
code.
of `Unspecified error` if querying the following endpoints
fails:
`GET /slurm/v0.0.40/diag/`
`GET /slurm/v0.0.41/diag/`
`GET /slurm/v0.0.42/diag/`
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=307
- Update to version 24.11
* `slurmctld` - Reject arbitrary distribution jobs that do not
specifying a task count.
* Fix backwards compatibility of the `RESPONSE_JOB_INFO RPC`
(used by `squeue`, `scontrol show job`, etc.) with Slurm clients
version 24.05 and below. This was a regression in 24.11.0rc1.
* Do not let `slurmctld`/`slurmd` start if there are more nodes
defined in `slurm.conf` than the maximum supported amount
(64k nodes).
* `slurmctld` - Set job's exit code to 1 when a job fails with
state `JOB_NODE_FAIL`. This fixes `sbatch --wait` not being able
to exit with error code when a job fails for this reason in
some cases.
* Fix certain reservation updates requested from 23.02 clients.
* `slurmrestd` - Fix populating non-required object fields of
objects as `{}` in JSON/YAML instead of `null` causing compiled
OpenAPI clients to reject the response to
`GET /slurm/v0.0.40/jobs` due to validation failure of
`.jobs[].job_resources`.
* Fix issue where older versions of Slurm talking to a 24.11 dbd
could loose step accounting.
* Fix minor memory leaks.
* Fix bad memory reference when `xstrchr` fails to find char.
* Remove duplicate checks for a data structure.
* Fix race condition in `stepmgr` step completion handling.
* `slurm.spec` - add ability to specify patches to apply on the
command line.
* `slurm.spec` - add ability to supply extra version information.
* Fix 24.11 HA issues.
* Fix requeued jobs keeping their priority until the decay thread (forwarded request 1235783 from eeich)
OBS-URL: https://build.opensuse.org/request/show/1235784
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=109
* `slurmctld` - Reject arbitrary distribution jobs that do not
specifying a task count.
* Fix backwards compatibility of the `RESPONSE_JOB_INFO RPC`
(used by `squeue`, `scontrol show job`, etc.) with Slurm clients
version 24.05 and below. This was a regression in 24.11.0rc1.
* Do not let `slurmctld`/`slurmd` start if there are more nodes
defined in `slurm.conf` than the maximum supported amount
(64k nodes).
* `slurmctld` - Set job's exit code to 1 when a job fails with
state `JOB_NODE_FAIL`. This fixes `sbatch --wait` not being able
to exit with error code when a job fails for this reason in
some cases.
* Fix certain reservation updates requested from 23.02 clients.
* `slurmrestd` - Fix populating non-required object fields of
objects as `{}` in JSON/YAML instead of `null` causing compiled
OpenAPI clients to reject the response to
`GET /slurm/v0.0.40/jobs` due to validation failure of
`.jobs[].job_resources`.
* Fix issue where older versions of Slurm talking to a 24.11 dbd
could loose step accounting.
* Fix minor memory leaks.
* Fix bad memory reference when `xstrchr` fails to find char.
* Remove duplicate checks for a data structure.
* Fix race condition in `stepmgr` step completion handling.
* `slurm.spec` - add ability to specify patches to apply on the
command line.
* `slurm.spec` - add ability to supply extra version information.
* Fix 24.11 HA issues.
* Fix requeued jobs keeping their priority until the decay thread
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=302
- Update to version 24.05.4 & fix for CVE-2024-48936.
* Fix generic int sort functions.
* Fix user look up using possible unrealized uid in the dbd.
* `slurmrestd` - Fix regressions that allowed `slurmrestd` to
be run as SlurmUser when `SlurmUser` was not root.
* mpi/pmix fix race conditions with het jobs at step start/end
which could make srun to hang.
* Fix not showing some `SelectTypeParameters` in `scontrol show
config`.
* Avoid assert when dumping removed certain fields in JSON/YAML.
* Improve how shards are scheduled with affinity in mind.
* Fix `MaxJobsAccruePU` not being respected when `MaxJobsAccruePA`
is set in the same QOS.
* Prevent backfill from planning jobs that use overlapping
resources for the same time slot if the job's time limit is
less than `bf_resolution`.
* Fix memory leak when requesting typed gres and
`--[cpus|mem]-per-gpu`.
* Prevent backfill from breaking out due to "system state
changed" every 30 seconds if reservations use `REPLACE` or
`REPLACE_DOWN` flags.
* `slurmrestd` - Make sure that scheduler_unset parameter defaults
to true even when the following flags are also set:
`show_duplicates`, `skip_steps`, `disable_truncate_usage_time`,
`run_away_jobs`, `whole_hetjob`, `disable_whole_hetjob`,
`disable_wait_for_result`, `usage_time_as_submit_time`,
`show_batch_script`, and or `show_job_environment`. Additionaly,
always make sure show_duplicates and
`disable_truncate_usage_time` default to true when the following
flags are also set: `scheduler_unset`, `scheduled_on_submit`, (forwarded request 1220075 from eeich)
OBS-URL: https://build.opensuse.org/request/show/1220076
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=108
* Fix generic int sort functions.
* Fix user look up using possible unrealized uid in the dbd.
* `slurmrestd` - Fix regressions that allowed `slurmrestd` to
be run as SlurmUser when `SlurmUser` was not root.
* mpi/pmix fix race conditions with het jobs at step start/end
which could make srun to hang.
* Fix not showing some `SelectTypeParameters` in `scontrol show
config`.
* Avoid assert when dumping removed certain fields in JSON/YAML.
* Improve how shards are scheduled with affinity in mind.
* Fix `MaxJobsAccruePU` not being respected when `MaxJobsAccruePA`
is set in the same QOS.
* Prevent backfill from planning jobs that use overlapping
resources for the same time slot if the job's time limit is
less than `bf_resolution`.
* Fix memory leak when requesting typed gres and
`--[cpus|mem]-per-gpu`.
* Prevent backfill from breaking out due to "system state
changed" every 30 seconds if reservations use `REPLACE` or
`REPLACE_DOWN` flags.
* `slurmrestd` - Make sure that scheduler_unset parameter defaults
to true even when the following flags are also set:
`show_duplicates`, `skip_steps`, `disable_truncate_usage_time`,
`run_away_jobs`, `whole_hetjob`, `disable_whole_hetjob`,
`disable_wait_for_result`, `usage_time_as_submit_time`,
`show_batch_script`, and or `show_job_environment`. Additionaly,
always make sure show_duplicates and
`disable_truncate_usage_time` default to true when the following
flags are also set: `scheduler_unset`, `scheduled_on_submit`,
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=300
- Update to version 24.05.3
* `data_parser/v0.0.40` - Added field descriptions.
* `slurmrestd` - Avoid creating new slurmdbd connection per request
to `* /slurm/slurmctld/*/*` endpoints.
* Fix compilation issue with `switch/hpe_slingshot` plugin.
* Fix gres per task allocation with threads-per-core.
* `data_parser/v0.0.41` - Added field descriptions.
* `slurmrestd` - Change back generated OpenAPI schema for
`DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using
parameters for request. `slurmrestd` will continue accept endpoint
requests via `RequestBody` or HTTP query.
* `topology/tree` - Fix issues with switch distance optimization.
* Fix potential segfault of secondary `slurmctld` when falling back
to the primary when running with a `JobComp` plugin.
* Enable `--json`/`--yaml=v0.0.39` options on client commands to
dump data using data_parser/v0.0.39 instead or outputting nothing.
* `switch/hpe_slingshot` - Fix issue that could result in a 0 length
state file.
* Fix unnecessary message protocol downgrade for unregistered nodes.
* Fix unnecessarily packing alias addrs when terminating jobs with
a mix of non-cloud/dynamic nodes and powered down cloud/dynamic
nodes.
* `accounting_storage/mysql` - Fix issue when deleting a qos that
could remove too many commas from the qos and/or delta_qos fields
of the assoc table.
* `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU.
* Fix allowing access to reservations without `MaxStartDelay` set.
* Fix regression introduced in 24.05.0rc1 breaking
`srun --send-libs` parsing.
* Fix slurmd vsize memory leak when using job submission/allocation
OBS-URL: https://build.opensuse.org/request/show/1208086
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=106
* `data_parser/v0.0.40` - Added field descriptions.
* `slurmrestd` - Avoid creating new slurmdbd connection per request
to `* /slurm/slurmctld/*/*` endpoints.
* Fix compilation issue with `switch/hpe_slingshot` plugin.
* Fix gres per task allocation with threads-per-core.
* `data_parser/v0.0.41` - Added field descriptions.
* `slurmrestd` - Change back generated OpenAPI schema for
`DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using
parameters for request. `slurmrestd` will continue accept endpoint
requests via `RequestBody` or HTTP query.
* `topology/tree` - Fix issues with switch distance optimization.
* Fix potential segfault of secondary `slurmctld` when falling back
to the primary when running with a `JobComp` plugin.
* Enable `--json`/`--yaml=v0.0.39` options on client commands to
dump data using data_parser/v0.0.39 instead or outputting nothing.
* `switch/hpe_slingshot` - Fix issue that could result in a 0 length
state file.
* Fix unnecessary message protocol downgrade for unregistered nodes.
* Fix unnecessarily packing alias addrs when terminating jobs with
a mix of non-cloud/dynamic nodes and powered down cloud/dynamic
nodes.
* `accounting_storage/mysql` - Fix issue when deleting a qos that
could remove too many commas from the qos and/or delta_qos fields
of the assoc table.
* `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU.
* Fix allowing access to reservations without `MaxStartDelay` set.
* Fix regression introduced in 24.05.0rc1 breaking
`srun --send-libs` parsing.
* Fix slurmd vsize memory leak when using job submission/allocation
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=295
- IMPORTANT NOTES:
If using the slurmdbd (Slurm DataBase Daemon) you must update
this first. NOTE: If using a backup DBD you must start the
primary first to do any database conversion, the backup will not
start until this has happened. The 24.05 slurmdbd will work
with Slurm daemons of version 23.02 and above. You will not
need to update all clusters at the same time, but it is very
important to update slurmdbd first and having it running before
updating any other clusters making use of it.
- HIGHLIGHTS
* Federation - allow client command operation when slurmdbd is
unavailable.
* burst_buffer/lua - Added two new hooks: slurm_bb_test_data_in
and slurm_bb_test_data_out. The syntax and use of the new hooks
are documented in etc/burst_buffer.lua.example. These are
required to exist. slurmctld now checks on startup if the
burst_buffer.lua script loads and contains all required hooks;
slurmctld will exit with a fatal error if this is not
successful. Added PollInterval to burst_buffer.conf. Removed
the arbitrary limit of 512 copies of the script running
simultaneously.
* Add QOS limit MaxTRESRunMinsPerAccount.
* Add QOS limit MaxTRESRunMinsPerUser.
* Add ELIGIBLE environment variable to jobcomp/script plugin.
* Always use the QOS name for SLURM_JOB_QOS environment variables.
Previously the batch environment would use the description field,
which was usually equivalent to the name.
* cgroup/v2 - Require dbus-1 version >= 1.11.16.
* Allow NodeSet names to be used in SuspendExcNodes.
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=294
- removed Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch
as incoperated upstream
* Changes in Slurm 23.02.5
* Add the JobId to debug() messages indicating when cpus_per_task/mem_per_cpu
or pn_min_cpus are being automatically adjusted.
* Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
a node features plugin is configured.
* Fix and prevent reoccurring reservations from overlapping.
* job_container/tmpfs - Avoid attempts to share BasePath between nodes.
* Change the log message warning for rate limited users from verbose to info.
* With CR_Cpu_Memory, fix node selection for jobs that request gres and
*-mem-per-cpu.
* Fix a regression from 22.05.7 in which some jobs were allocated too few
nodes, thus overcommitting cpus to some tasks.
* Fix a job being stuck in the completing state if the job ends while the
primary controller is down or unresponsive and the backup controller has
not yet taken over.
* Fix slurmctld segfault when a node registers with a configured CpuSpecList
while slurmctld configuration has the node without CpuSpecList.
* Fix cloud nodes getting stuck in POWERED_DOWN+NO_RESPOND state after not
registering by ResumeTimeout.
* slurmstepd - Avoid cleanup of config.json-less containers spooldir getting
skipped.
* slurmstepd - Cleanup per task generated environment for containers in
spooldir.
* Fix scontrol segfault when 'completing' command requested repeatedly in
interactive mode.
* Properly handle a race condition between bind() and listen() calls in the
network stack when running with SrunPortRange set.
* Federation - Fix revoked jobs being returned regardless of the -a/--all
OBS-URL: https://build.opensuse.org/request/show/1161499
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=292
- Update to version 23.11.03
* slurmrestd - Reject single http query with multiple path
requests.
* Fix launching Singularity v4.x containers with
`srun --container` by setting .process.terminal to true in
generated `config.json` when step has pseudoterminal (`--pty`)
requested.
* Fix loading in `dyanmic/cloud` node jobs after `net_cred`
expired.
* Fix cgroup null path error on `slurmd/slurmstepd` tear down.
* `data_parser/v0.0.40` - Prevent failure if accounting is
disabled, instead issue a warning if needed data from the
database can not be retrieved.
* `openapi/slurmctld` - Prevent failure if accounting is disabled.
* Prevent `slurmscriptd` processing delays from blocking other
threads in `slurmctld` while trying to launch various scripts.
This is additional work for a fix in 23.02.6.
* Fix memory leak when receiving alias addrs from controller.
* `scontrol` - Accept `scontrol token lifespan=infinite` to
create tokens that effectively do not expire.
* Avoid errors when Slurmdb accounting disabled when `--json` or
`--yaml` is invoked with CLI commands and `slurmrestd`. Add
warnings when query would have populated data from Slurmdb
instead of errors.
* Fix `slurmctld` memory leak when running job with
`--tres-per-task=gres:shard:#`
* Fix backfill trying to start jobs outside of backfill window.
* Fix oversubscription on partitions with `PreemptMode=OFF`.
* Preserve node reason on power up if the node is downed
or drained. (forwarded request 1150524 from eeich)
OBS-URL: https://build.opensuse.org/request/show/1151965
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=104
- Update to version 23.11.03
* slurmrestd - Reject single http query with multiple path
requests.
* Fix launching Singularity v4.x containers with
`srun --container` by setting .process.terminal to true in
generated `config.json` when step has pseudoterminal (`--pty`)
requested.
* Fix loading in `dyanmic/cloud` node jobs after `net_cred`
expired.
* Fix cgroup null path error on `slurmd/slurmstepd` tear down.
* `data_parser/v0.0.40` - Prevent failure if accounting is
disabled, instead issue a warning if needed data from the
database can not be retrieved.
* `openapi/slurmctld` - Prevent failure if accounting is disabled.
* Prevent `slurmscriptd` processing delays from blocking other
threads in `slurmctld` while trying to launch various scripts.
This is additional work for a fix in 23.02.6.
* Fix memory leak when receiving alias addrs from controller.
* `scontrol` - Accept `scontrol token lifespan=infinite` to
create tokens that effectively do not expire.
* Avoid errors when Slurmdb accounting disabled when `--json` or
`--yaml` is invoked with CLI commands and `slurmrestd`. Add
warnings when query would have populated data from Slurmdb
instead of errors.
* Fix `slurmctld` memory leak when running job with
`--tres-per-task=gres:shard:#`
* Fix backfill trying to start jobs outside of backfill window.
* Fix oversubscription on partitions with `PreemptMode=OFF`.
* Preserve node reason on power up if the node is downed
or drained.
OBS-URL: https://build.opensuse.org/request/show/1150524
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=289
- Update to 23.11.1 with following major improvements and fixing
CVE-2023-49933, CVE-2023-49934, CVE-2023-49935, CVE-2023-49936
and CVE-2023-49937
* Substantially overhauled the SlurmDBD association management
code. For clusters updated to 23.11, account and user
additions or removals are significantly faster than in prior
releases.
* Overhauled `scontrol reconfigure` to prevent configuration
mistakes from disabling slurmctld and slurmd. Instead, an
error will be returned, and the running configuration will
persist. This does require updates to the systemd service
files to use the `--systemd` option to `slurmctld` and `slurmd`.
* Added a new internal `auth/cred` plugin - `auth/slurm`. This
builds off the prior `auth/jwt` model, and permits operation
of the `slurmdbd` and `slurmctld` without access to full
directory information with a suitable configuration.
* Added a new `--external-launcher` option to `srun`, which is
automatically set by common MPI launcher implementations and
ensures processes using those non-srun launchers have full
access to all resources allocated on each node.
* Reworked the dynamic/cloud modes of operation to allow for
"fanout" - where Slurm communication can be automatically
offloaded to compute nodes for increased cluster scalability.
* Overhauled and extended the Reservation subsystem to allow
for most of the same resource requirements as are placed on
the job. Notably, this permits reservations to now reserve
GRES directly.
- Details of changes:
* Fix `scontrol update job=... TimeLimit+=/-=` when used with a
raw JobId of job array element.
OBS-URL: https://build.opensuse.org/request/show/1141442
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=103
and CVE-2023-49937
* Substantially overhauled the SlurmDBD association management
code. For clusters updated to 23.11, account and user
additions or removals are significantly faster than in prior
releases.
* Overhauled `scontrol reconfigure` to prevent configuration
mistakes from disabling slurmctld and slurmd. Instead, an
error will be returned, and the running configuration will
persist. This does require updates to the systemd service
files to use the `--systemd` option to `slurmctld` and `slurmd`.
* Added a new internal `auth/cred` plugin - `auth/slurm`. This
builds off the prior `auth/jwt` model, and permits operation
of the `slurmdbd` and `slurmctld` without access to full
directory information with a suitable configuration.
* Added a new `--external-launcher` option to `srun`, which is
automatically set by common MPI launcher implementations and
ensures processes using those non-srun launchers have full
access to all resources allocated on each node.
* Reworked the dynamic/cloud modes of operation to allow for
"fanout" - where Slurm communication can be automatically
offloaded to compute nodes for increased cluster scalability.
* Overhauled and extended the Reservation subsystem to allow
for most of the same resource requirements as are placed on
the job. Notably, this permits reservations to now reserve
GRES directly.
* Fix `scontrol update job=... TimeLimit+=/-=` when used with a
raw JobId of job array element.
* Reject `TimeLimit` increment/decrement when called on job with
`TimeLimit=UNLIMITED`.
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=285
- Update to 23.11.1 with following major improvements and fixing
CVE-2023-49933, CVE-2023-49934, CVE-2023-49935, CVE-2023-49936 and
CVE-2023-49937
* Substantially overhauled the SlurmDBD association management code. For
clusters updated to 23.11, account and user additions or removals are
significantly faster than in prior releases.
* Overhauled 'scontrol reconfigure' to prevent configuration mistakes from
disabling slurmctld and slurmd. Instead, an error will be returned, and the
running configuration will persist. This does require updates to the
systemd service files to use the --systemd option to slurmctld and slurmd.
* Added a new internal auth/cred plugin - "auth/slurm". This builds off the
prior auth/jwt model, and permits operation of the slurmdbd and slurmctld
without access to full directory information with a suitable configuration.
* Added a new --external-launcher option to srun, which is automatically set
by common MPI launcher implementations and ensures processes using those
non-srun launchers have full access to all resources allocated on each
node.
* Reworked the dynamic/cloud modes of operation to allow for "fanout" - where
Slurm communication can be automatically offloaded to compute nodes for
increased cluster scalability.
Added initial official Debian packaging support.
* Overhauled and extended the Reservation subsystem to allow for most of the
same resource requirements as are placed on the job. Notably, this permits
reservations to now reserve GRES directly.
- Details of changes:
* Fix scontrol update job=... TimeLimit+=/-= when used with a raw JobId of job
array element.
* Reject TimeLimit increment/decrement when called on job with
TimeLimit=UNLIMITED.
* Fix issue with requesting a job with *licenses as well as
OBS-URL: https://build.opensuse.org/request/show/1138332
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=284
- Update to 23.02.6 to fix (CVE-2023-49933 - bsc#1218046, CVE-2023-49935 -
bsc#1218049, CVE-2023-49936 - bsc#1218050, CVE-2023-49937 - bsc#1218051,
CVE-2023-49938 - bsc#1218053)
* Security Fixes:
+ Add `JobAcctGatherParams=DisableGPUAcct` to disable gpu accounting.
+ `acct_gather_energy/ipmi` - Improve logging of DCMI issues.
+ `gpu/oneapi` - Add support for new env vars `ZE_FLAT_DEVICE_HIERARCHY`
and `ZE_ENABLE_PCI_ID_DEVICE_ORDER`.
+ `data_parser/v0.0.39` - skip empty string when parsing QOS ids.
+ Remove error message from `assoc_mgr_update_assocs` when purposefully
resetting the default QOS.
* Bug Fixes:
+ `libslurm_nss` - Avoid causing glibc to assert due to an unexpected
return from slurm_nss due to an error during lookup.
+ Fix job requests with `--tres-per-task` sometimes resulting in bad
allocations that cannot run subsequent job steps.
+ Fix issue with `slurmd` where `srun` fails to be warned when a node
prolog script runs beyond `MsgTimeout` set in `slurm.conf`.
+ `gres/shard` - Fix plugin functions to have matching parameter orders.
+ `gpu/nvml` - Fix issue that resulted in the wrong MIG devices being
constrained to a job
+ `gpu/nvml` - Fix linking issue with MIGs that prevented multiple MIGs
being used in a single job for certain MIG configurations
+ Fix file descriptor leak in slurmd when using `acct_gather_energy/ipmi`
with DCMI devices.
+ `sview` - avoid crash when job has a node list string > 49 characters.
+ Prevent `slurmctld` crash during reconfigure when packing job start
messages.
+ Preserve reason uid on reconfig.
+ Update node reason with updated `INVAL` state reason if different from (forwarded request 1136624 from eeich)
OBS-URL: https://build.opensuse.org/request/show/1137045
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=102
- Update to 23.02.6 to fix (CVE-2023-49933 - bsc#1218046, CVE-2023-49935 -
bsc#1218049, CVE-2023-49936 - bsc#1218050, CVE-2023-49937 - bsc#1218051,
CVE-2023-49938 - bsc#1218053)
* Security Fixes:
+ Add `JobAcctGatherParams=DisableGPUAcct` to disable gpu accounting.
+ `acct_gather_energy/ipmi` - Improve logging of DCMI issues.
+ `gpu/oneapi` - Add support for new env vars `ZE_FLAT_DEVICE_HIERARCHY`
and `ZE_ENABLE_PCI_ID_DEVICE_ORDER`.
+ `data_parser/v0.0.39` - skip empty string when parsing QOS ids.
+ Remove error message from `assoc_mgr_update_assocs` when purposefully
resetting the default QOS.
* Bug Fixes:
+ `libslurm_nss` - Avoid causing glibc to assert due to an unexpected
return from slurm_nss due to an error during lookup.
+ Fix job requests with `--tres-per-task` sometimes resulting in bad
allocations that cannot run subsequent job steps.
+ Fix issue with `slurmd` where `srun` fails to be warned when a node
prolog script runs beyond `MsgTimeout` set in `slurm.conf`.
+ `gres/shard` - Fix plugin functions to have matching parameter orders.
+ `gpu/nvml` - Fix issue that resulted in the wrong MIG devices being
constrained to a job
+ `gpu/nvml` - Fix linking issue with MIGs that prevented multiple MIGs
being used in a single job for certain MIG configurations
+ Fix file descriptor leak in slurmd when using `acct_gather_energy/ipmi`
with DCMI devices.
+ `sview` - avoid crash when job has a node list string > 49 characters.
+ Prevent `slurmctld` crash during reconfigure when packing job start
messages.
+ Preserve reason uid on reconfig.
+ Update node reason with updated `INVAL` state reason if different from
OBS-URL: https://build.opensuse.org/request/show/1136624
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=282
- Explicitly create an Obsoletes: entry for each package version
that is obsoleted by the present version. These are all published
versions of the last two major releases as well as all minor
versions of the present release lower than the current one
(bsc#1216869 2nd part).
This prevents the current version to upgrade a old Slurm version
for which no upgrade path exists.
OBS-URL: https://build.opensuse.org/request/show/1129638
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=279
- update to 23.02.6 to fix (CVE-2023-41914)
* Removed Fix-test-32.8.patch as fixed upstream
* Bug Fixes:
+ Fix `CpusPerTres=` not upgreadable with scontrol update
+ Fix unintentional gres removal when validating the gres job state.
+ Fix `--without-hpe-slingshot` configure option.
+ Fix cgroup v2 memory calculations when transparent huge pages are used.
+ Fix parsing of `sgather --timeout` option.
+ Fix regression from 22.05.0 that caused `srun --cpu-bind "=verbose"`
and `"=v"` options give different CPU bind masks.
+ Fix "_find_node_record: lookup failure for node" error message appearing
for all dynamic nodes during reconfigure.
+ Avoid segfault if loading serializer plugin fails.
+ `slurmrestd` - Correct OpenAPI format for `GET /slurm/v0.0.39/licenses`.
+ `slurmrestd` - Correct OpenAPI format for
`GET /slurm/v0.0.39/job/{job_id}`.
+ `slurmrestd` - Change format to multiple fields in
'GET /slurmdb/v0.0.39/assocations` and `GET /slurmdb/v0.0.39/qos` to
handle infinite and unset states.
+ When a node fails in a job with `--no-kill`, preserve the extern step on the
remaining nodes to avoid breaking features that rely on the extern step
such as `pam_slurm_adopt`, `x11`, and `job_container/tmpfs`.
+ `auth/jwt` - Ignore `x5c` field in JWKS files.
+ `auth/jwt` - Treat 'alg' field as optional in JWKS files.
+ Allow job_desc.selinux_context to be read from the job_submit.lua script.
+ Skip check in slurmstepd that causes a large number of errors in the
munge log: "Unauthorized credential for client UID=0 GID=0".
This error will still appear on `slurmd`/`slurmctld`/`slurmdbd` start up
and is not a cause for concern.
+ `slurmctld` - Allow startup with zero partitions.
OBS-URL: https://build.opensuse.org/request/show/1117163
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=96
* Bug Fixes:
+ Fix CpusPerTres= not upgreadable with scontrol update
+ Fix unintentional gres removal when validating the gres job state.
+ Fix --without-hpe-slingshot configure option.
+ Fix cgroup v2 memory calculations when transparent huge pages are used.
+ Fix parsing of sgather --timeout option.
+ Fix regression from 22.05.0 that caused srun --cpu-bind "=verbose" and "=v"
options give different CPU bind masks.
+ Fix "_find_node_record: lookup failure for node" error message appearing
for all dynamic nodes during reconfigure.
+ Avoid segfault if loading serializer plugin fails.
+ slurmrestd - Correct OpenAPI format for 'GET /slurm/v0.0.39/licenses'.
+ slurmrestd - Correct OpenAPI format for 'GET /slurm/v0.0.39/job/{job_id}'.
+ slurmrestd - Change format to multiple fields in 'GET
/slurmdb/v0.0.39/assocations' and 'GET /slurmdb/v0.0.39/qos' to handle
infinite and unset states.
+ When a node fails in a job with --no-kill, preserve the extern step on the
remaining nodes to avoid breaking features that rely on the extern step
such as pam_slurm_adopt, x11, and job_container/tmpfs.
+ auth/jwt - Ignore 'x5c' field in JWKS files.
+ auth/jwt - Treat 'alg' field as optional in JWKS files.
+ Allow job_desc.selinux_context to be read from the job_submit.lua script.
+ Skip check in slurmstepd that causes a large number of errors in the munge
log: "Unauthorized credential for client UID=0 GID=0". This error will
still appear on slurmd/slurmctld/slurmdbd start up and is not a cause for
concern.
+ slurmctld - Allow startup with zero partitions.
+ Fix some mig profile names in slurm not matching nvidia mig profiles.
+ Prevent slurmscriptd processing delays from blocking other threads in
slurmctld while trying to launch {Prolog|Epilog}Slurmctld.
OBS-URL: https://build.opensuse.org/request/show/1117145
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=268
- Updated to version 23.02.5 with the following changes:
* Bug Fixes:
+ Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the
job's environment when `--ntasks-per-node` was requested.
The method that is is being set, however, is different and should be more
accurate in more situations.
+ Change pmi2 plugin to honor the `SrunPortRange` option. This matches the
new behavior of the pmix plugin in 23.02.0. Note that neither of these
plugins makes use of the `MpiParams=ports=` option, and previously
were only limited by the systems ephemeral port range.
+ Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
a node features plugin is configured.
+ Fix and prevent reoccurring reservations from overlapping.
+ `job_container/tmpfs` - Avoid attempts to share BasePath between nodes.
+ With `CR_Cpu_Memory`, fix node selection for jobs that request gres and
`--mem-per-cpu`.
+ Fix a regression from 22.05.7 in which some jobs were allocated too few
nodes, thus overcommitting cpus to some tasks.
+ Fix a job being stuck in the completing state if the job ends while the
primary controller is down or unresponsive and the backup controller has
not yet taken over.
+ Fix `slurmctld` segfault when a node registers with a configured
`CpuSpecList` while `slurmctld` configuration has the node without
`CpuSpecList`.
+ Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after
not registering by `ResumeTimeout`.
+ `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir
getting skipped.
+ Fix scontrol segfault when 'completing' command requested repeatedly in
interactive mode.
OBS-URL: https://build.opensuse.org/request/show/1111943
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=95
features with the `|` operator, which could prevent jobs from
+ `node_features/helpers` - Fix inconsistent handling of `&` and `|`,
instead of just the current set. E.g. `foo|bar&baz` was interpreted
`{foo} or {bar,baz}`.
tasks fewer than GPUs, which resulted in incorrectly rejecting these
jobs.
+ `slurmrestd` - For `GET /slurm/v0.0.39/node[s]`, change format of
node's energy field `current_watts` to a dictionary to account for
+ `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of QOS's
+ slurmrestd - For `GET /slurm/v0.0.39/job[s]`, the 'return code'
`GET /slurmdb/v0.0.39/jobs` from slurmrestd.
were present in the log: `error: Attempt to change gres/gpu Count`.
+ Hold the job with `(Reservation ... invalid)` state reason if the
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=265
* Bug Fixes:
+ Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the
job's environment when `--ntasks-per-node` was requested.
The method that is is being set, however, is different and should be more
accurate in more situations.
+ Change pmi2 plugin to honor the `SrunPortRange` option. This matches the
new behavior of the pmix plugin in 23.02.0. Note that neither of these
plugins makes use of the "`MpiParams=ports=`" option, and previously
were only limited by the systems ephemeral port range.
+ Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
a node features plugin is configured.
+ Fix and prevent reoccurring reservations from overlapping.
+ `job_container/tmpfs` - Avoid attempts to share BasePath between nodes.
+ With `CR_Cpu_Memory`, fix node selection for jobs that request gres and
`--mem-per-cpu`.
+ Fix a regression from 22.05.7 in which some jobs were allocated too few
nodes, thus overcommitting cpus to some tasks.
+ Fix a job being stuck in the completing state if the job ends while the
primary controller is down or unresponsive and the backup controller has
not yet taken over.
+ Fix `slurmctld` segfault when a node registers with a configured
`CpuSpecList` while `slurmctld` configuration has the node without
`CpuSpecList`.
+ Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after
not registering by `ResumeTimeout`.
+ `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir
getting skipped.
+ Fix scontrol segfault when 'completing' command requested repeatedly in
interactive mode.
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=264
- Updated to 23.02.4 with the following changes:
* Bug Fixes:
+ Fix main scheduler loop not starting after a failover to backup
controller. Avoid slurmctld segfault when specifying
`AccountingStorageExternalHost` (bsc#1214983).
+ Fix sbatch return code when `--wait` is requested on a job array.
+ Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
+ Fix `slurmrestd` handling of job hold/release operations.
+ Fix step running indefinitely when slurmctld takes more than
`MessageTimeout` to respond. Now, `slurmctld` will cancel the step when
detected, preventing following steps from getting stuck waiting for
resources to be released.
+ Fix regression to make `job_desc.min_cpus` accurate again in `job_submit`
when requesting a job with `--ntasks-per-node`.
+ Fix handling of `ArrayTaskThrottle` in backfill.
+ Fix regression in 23.02.2 when checking gres state on `slurmctld`
startup or reconfigure. Gres changes in the configuration were not
updated on slurmctld startup. On startup or reconfigure, these messages
were present in the log: `"error: Attempt to change gres/gpu Count`".
+ Fix potential double count of gres when dealing with limits.
+ Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
+ Fixed an issue where jobs requesting licenses were incorrectly rejected.
+ `scrontab` - Fix cutting off the final character of quoted variables.
+ `smail` - Fix issues where e-mails at job completion were not being sent.
+ `scontrol/slurmctld` - fix comma parsing when updating a reservation's
nodes.
+ Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus
having more tasks than they should and other gpus being unused.
+ Fix regression in 23.02 that causes slurmstepd to crash when `srun`
requests more than `TreeWidth` nodes in a step and uses the pmi2 or
OBS-URL: https://build.opensuse.org/request/show/1110259
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=93
* Bug Fixes:
+ Fix main scheduler loop not starting after a failover to backup
controller. Avoid slurmctld segfault when specifying
`AccountingStorageExternalHost` (bsc#1214983).
+ Fix sbatch return code when `--wait` is requested on a job array.
+ Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
+ Fix `slurmrestd` handling of job hold/release operations.
+ Fix step running indefinitely when slurmctld takes more than
`MessageTimeout` to respond. Now, `slurmctld` will cancel the step when
detected, preventing following steps from getting stuck waiting for
resources to be released.
+ Fix regression to make `job_desc.min_cpus` accurate again in `job_submit`
when requesting a job with `--ntasks-per-node`.
+ Fix handling of `ArrayTaskThrottle` in backfill.
+ Fix regression in 23.02.2 when checking gres state on `slurmctld`
startup or reconfigure. Gres changes in the configuration were not
updated on slurmctld startup. On startup or reconfigure, these messages
were present in the log: `"error: Attempt to change gres/gpu Count`".
+ Fix potential double count of gres when dealing with limits.
+ Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
+ Fixed an issue where jobs requesting licenses were incorrectly rejected.
+ `scrontab` - Fix cutting off the final character of quoted variables.
+ `smail` - Fix issues where e-mails at job completion were not being sent.
+ `scontrol/slurmctld` - fix comma parsing when updating a reservation's
nodes.
+ Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus
having more tasks than they should and other gpus being unused.
+ Fix regression in 23.02 that causes slurmstepd to crash when `srun`
requests more than `TreeWidth` nodes in a step and uses the pmi2 or
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=260
- Fixes since 23.02.03:
Highlights:
* Fix main scheduler loop not starting after a failover to backup controller.
* Avoid slurmctld segfault when specifying `AccountingStorageExternalHost`
(bsc#1214983).
Other:
* Fix sbatch return code when `--wait` is requested on a job array.
* Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
* Fix `slurmrestd` handling of job hold/release operations.
* Make spank `S_JOB_ARGV` item value hold the requested command `argv`
instead of the `srun --bcast` value when `--bcast` requested (only in local
context).
* Fix step running indefinitely when slurmctld takes more than
`MessageTimeout` to respond. Now, slurmctld will cancel the step when
detected, preventing following steps from getting stuck waiting for
resources to be released.
* Fix regression to make `job_desc.min_cpus` accurate again in job_submit when
requesting a job with `--ntasks-per-node`.
* Fix handling of `ArrayTaskThrottle` in backfill.
* Fix regression in 23.02.2 when checking gres state on `slurmctld` startup or
reconfigure. Gres changes in the configuration were not updated on slurmctld
startup. On startup or reconfigure, these messages were present in the log:
`"error: Attempt to change gres/gpu Count`".
* Fix potential double count of gres when dealing with limits.
* Fix slurmstepd segfault when ContainerPath is not set in `oci.conf`
* Fixed an issue where jobs requesting licenses were incorrectly rejected.
* `scrontab` - Fix cutting off the final character of quoted variables.
* `smail` - Fix issues where e-mails at job completion were not being sent.
* `scontrol/slurmctld` - fix comma parsing when updating a reservation's
nodes.
OBS-URL: https://build.opensuse.org/request/show/1109308
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=92