- Update to version 24.05.3

* `data_parser/v0.0.40` - Added field descriptions.
  * `slurmrestd` - Avoid creating new slurmdbd connection per request
    to `* /slurm/slurmctld/*/*` endpoints.
  * Fix compilation issue with `switch/hpe_slingshot` plugin.
  * Fix gres per task allocation with threads-per-core.
  * `data_parser/v0.0.41` - Added field descriptions.
  * `slurmrestd` - Change back generated OpenAPI schema for
    `DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using
    parameters for request. `slurmrestd` will continue accept endpoint
    requests via `RequestBody` or HTTP query.
  * `topology/tree` - Fix issues with switch distance optimization.
  * Fix potential segfault of secondary `slurmctld` when falling back
    to the primary when running with a `JobComp` plugin.
  * Enable `--json`/`--yaml=v0.0.39` options on client commands to
    dump data using data_parser/v0.0.39 instead or outputting nothing.
  * `switch/hpe_slingshot` - Fix issue that could result in a 0 length
    state file.
  * Fix unnecessary message protocol downgrade for unregistered nodes.
  * Fix unnecessarily packing alias addrs when terminating jobs with
    a mix of non-cloud/dynamic nodes and powered down cloud/dynamic
    nodes.
  * `accounting_storage/mysql` - Fix issue when deleting a qos that
    could remove too many commas from the qos and/or delta_qos fields
    of the assoc table.
  * `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU.
  * Fix allowing access to reservations without `MaxStartDelay` set.
  * Fix regression introduced in 24.05.0rc1 breaking
    `srun --send-libs` parsing.
  * Fix slurmd vsize memory leak when using job submission/allocation

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=295
This commit is contained in:
Egbert Eich 2024-10-15 06:51:09 +00:00 committed by Git OBS Bridge
parent fc209e050f
commit b2f6e848a1
4 changed files with 629 additions and 302 deletions

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:a6d3e95f2bbda3c9567060efc3d7090ad8eac257fa3578798c89321957946e49
size 7117445

3
slurm-24.05.3.tar.bz2 Normal file
View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:b0b40513e9b6ae867ddb95d60b950bcb980c15b735b5d0dea37a9a00cc64ae24
size 7189600

View File

@ -1,8 +1,275 @@
-------------------------------------------------------------------
Mon Oct 14 10:40:10 UTC 2024 - Egbert Eich <eich@suse.com>
- Update to version 24.05.3
* `data_parser/v0.0.40` - Added field descriptions.
* `slurmrestd` - Avoid creating new slurmdbd connection per request
to `* /slurm/slurmctld/*/*` endpoints.
* Fix compilation issue with `switch/hpe_slingshot` plugin.
* Fix gres per task allocation with threads-per-core.
* `data_parser/v0.0.41` - Added field descriptions.
* `slurmrestd` - Change back generated OpenAPI schema for
`DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using
parameters for request. `slurmrestd` will continue accept endpoint
requests via `RequestBody` or HTTP query.
* `topology/tree` - Fix issues with switch distance optimization.
* Fix potential segfault of secondary `slurmctld` when falling back
to the primary when running with a `JobComp` plugin.
* Enable `--json`/`--yaml=v0.0.39` options on client commands to
dump data using data_parser/v0.0.39 instead or outputting nothing.
* `switch/hpe_slingshot` - Fix issue that could result in a 0 length
state file.
* Fix unnecessary message protocol downgrade for unregistered nodes.
* Fix unnecessarily packing alias addrs when terminating jobs with
a mix of non-cloud/dynamic nodes and powered down cloud/dynamic
nodes.
* `accounting_storage/mysql` - Fix issue when deleting a qos that
could remove too many commas from the qos and/or delta_qos fields
of the assoc table.
* `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU.
* Fix allowing access to reservations without `MaxStartDelay` set.
* Fix regression introduced in 24.05.0rc1 breaking
`srun --send-libs` parsing.
* Fix slurmd vsize memory leak when using job submission/allocation
commands that implicitly or explicitly use --get-user-env.
* `slurmd` - Fix node going into invalid state when using
`CPUSpecList` and setting CPUs to the # of cores on a
multithreaded node.
* Fix reboot asap nodes being considered in backfill after a restart.
* Fix `--clusters`/`-M queries` for clusters outside of a
federation when `fed_display` is configured.
* Fix `scontrol` allowing updating job with bad cpus-per-task value.
* `sattach` - Fix regression from 24.05.2 security fix leading to
crash.
* `mpi/pmix` - Fix assertion when built under `--enable-debug`.
- Changes from Slurm 24.05.2
* Fix energy gathering rpc counter underflow in
`_rpc_acct_gather_energy` when more than 10 threads try to get
energy at the same time. This prevented the possibility to get
energy from slurmd by any step until slurmd was restarted,
so losing energy accounting metrics in the node.
* `accounting_storage/mysql` - Fix issue where new user with `wckey`
did not have a default wckey sent to the slurmctld.
* `slurmrestd` - Prevent slurmrestd segfault when handling the
following endpoints when none of the optional parameters are
specified:
`DELETE /slurm/v0.0.40/jobs`
`DELETE /slurm/v0.0.41/jobs`
`GET /slurm/v0.0.40/shares`
`GET /slurm/v0.0.41/shares`
`GET /slurmdb/v0.0.40/instance`
`GET /slurmdb/v0.0.41/instance`
`GET /slurmdb/v0.0.40/instances`
`GET /slurmdb/v0.0.41/instances`
`POST /slurm/v0.0.40/job/{job_id}`
`POST /slurm/v0.0.41/job/{job_id}`
* Fix IPMI energy gathering when no IPMIPowerSensors are specified
in `acct_gather.conf`. This situation resulted in an accounted
energy of 0 for job steps.
* Fix a minor memory leak in slurmctld when updating a job dependency.
* `scontrol`,`squeue` - Fix regression that caused incorrect values
for multisocket nodes at `.jobs[].job_resources.nodes.allocation`
for `scontrol show jobs --(json|yaml)` and `squeue --(json|yaml)`.
* `slurmrestd` - Fix regression that caused incorrect values for
multisocket nodes at `.jobs[].job_resources.nodes.allocation` to
be dumped with endpoints:
`GET /slurm/v0.0.41/job/{job_id}`
`GET /slurm/v0.0.41/jobs`
* `jobcomp/filetxt` - Fix truncation of job record lines > 1024
characters.
* `switch/hpe_slingshot` - Drain node on failure to delete CXI
services.
* Fix a performance regression from 23.11.0 in cpu frequency
handling when no `CpuFreqDef` is defined.
* Fix one-task-per-sharing not working across multiple nodes.
* Fix inconsistent number of cpus when creating a reservation
using the TRESPerNode option.
* `data_parser/v0.0.40+` - Fix job state parsing which could
break filtering.
* Prevent `cpus-per-task` to be modified in jobs where a `-c`
value has been explicitly specified and the requested memory
constraints implicitly increase the number of CPUs to allocate.
* `slurmrestd` - Fix regression where args `-s v0.0.39,dbv0.0.39`
and `-d v0.0.39` would result in `GET /openapi/v3` not
registering as a valid possible query resulting in 404 errors.
* `slurmrestd` - Fix memory leak for dbv0.0.39 jobs query which
occurred if the query parameters specified account, association,
cluster, constraints, format, groups, job_name, partition, qos,
reason, reservation, state, users, or wckey. This affects the
following endpoints:
`GET /slurmdb/v0.0.39/jobs`
* `slurmrestd` - In the case the slurmdbd does not respond to a
persistent connection init message, prevent the closed fd from
being used, and instead emit an error or warning depending on
if the connection was required.
* Fix 24.05.0 regression that caused the slurmdbd not to send back
an error message if there is an error initializing a persistent
connection.
* Reduce latency of forwarded x11 packets.
* Add `curr_dependency` (representing the current dependency of
the job).
and `orig_dependency` (representing the original requested
dependency of the job) fields to the job record in
`job_submit.lua` (for job update) and `jobcomp.lua`.
* Fix potential segfault of slurmctld configured with
`SlurmctldParameters=enable_rpc_queue` from happening on
reconfigure.
* Fix potential segfault of slurmctld on its shutdown when rate
limitting is enabled.
* `slurmrestd` - Fix missing job environment for `SLURM_JOB_NAME`,
`SLURM_OPEN_MODE`, `SLURM_JOB_DEPENDENCY`, `SLURM_PROFILE`,
`SLURM_ACCTG_FREQ`, `SLURM_NETWORK` and `SLURM_CPU_FREQ_REQ` to
match sbatch.
* Fix GRES environment variable indices being incorrect when only
using a subset of all GPUs on a node and the
`--gres-flags=allow-task-sharing` option.
* Prevent `scontrol` from segfaulting when requesting scontrol
show reservation `--json` or `--yaml` if there is an error
retrieving reservations from the `slurmctld`.
* `switch/hpe_slingshot` - Fix security issue around managing VNI
access. CVE-2024-42511.
* `switch/nvidia_imex` - Fix security issue managing IMEX channel
access. CVE-2024-42511.
* `switch/nvidia_imex` - Allow for compatibility with
`job_container/tmpfs`.
- Changes in Slurm 24.05.1
* Fix `slurmctld` and `slurmdbd` potentially stopping instead of
performing a logrotate when recieving `SIGUSR2` when using
`auth/slurm`.
* `switch/hpe_slingshot` - Fix slurmctld crash when upgrading
from 23.02.
* Fix "Could not find group" errors from `validate_group()` when
using `AllowGroups` with large `/etc/group` files.
* Add `AccountingStoreFlags=no_stdio` which allows to not record
the stdio paths of the job when set.
* `slurmrestd` - Prevent a slurmrestd segfault when parsing the
`crontab` field, which was never usable. Now it explicitly
ignores the value and emits a warning if it is used for the
following endpoints:
`POST /slurm/v0.0.39/job/{job_id}`
`POST /slurm/v0.0.39/job/submit`
`POST /slurm/v0.0.40/job/{job_id}`
`POST /slurm/v0.0.40/job/submit`
`POST /slurm/v0.0.41/job/{job_id}`
`POST /slurm/v0.0.41/job/submit`
`POST /slurm/v0.0.41/job/allocate`
* `mpi/pmi2` - Fix communication issue leading to task launch
failure with "`invalid kvs seq from node`".
* Fix getting user environment when using sbatch with
`--get-user-env` or `--export=` when there is a user profile
script that reads `/proc`.
* Prevent slurmd from crashing if `acct_gather_energy/gpu` is
configured but `GresTypes` is not configured.
* Do not log the following errors when `AcctGatherEnergyType`
plugins are used but a node does not have or cannot find sensors:
"`error: _get_joules_task: can't get info from slurmd`"
"`error: slurm_get_node_energy: Zero Bytes were transmitted or
received`"
However, the following error will continue to be logged:
"`error: Can't get energy data. No power sensors are available.
Try later`"
* `sbatch`, `srun` - Set `SLURM_NETWORK` environment variable if
`--network` is set.
* Fix cloud nodes not being able to forward to nodes that restarted
with new IP addresses.
* Fix cwd not being set correctly when running a SPANK plugin with a
`spank_user_init()` hook and the new "`contain_spank`" option set.
* `slurmctld` - Avoid deadlock during shutdown when `auth/slurm`
is active.
* Fix segfault in `slurmctld` with `topology/block`.
* `sacct` - Fix printing of job group for job steps.
* `scrun` - Log when an invalid environment variable causes the
job submission to be rejected.
* `accounting_storage/mysql` - Fix problem where listing or
modifying an association when specifying a qos list could hang
or take a very long time.
* `gpu/nvml` - Fix `gpuutil/gpumem` only tracking last GPU in step.
Now, `gpuutil/gpumem` will record sums of all GPUS in the step.
* Fix error in `scrontab` jobs when using
`slurm.conf:PropagatePrioProcess=1`.
* Fix `slurmctld` crash on a batch job submission with
`--nodes 0,...`.
* Fix dynamic IP address fanout forwarding when using `auth/slurm`.
* Restrict listening sockets in the `mpi/pmix` plugin and `sattach`
to the `SrunPortRange`.
* `slurmrestd` - Limit mime types returned from query to
`GET /openapi/v3` to only return one mime type per serializer
plugin to fix issues with OpenAPI client generators that are
unable to handle multiple mime type aliases.
* Fix many commands possibly reporting an "`Unexpected Message
Received`" when in reality the connection timed out.
* Prevent slurmctld from starting if there is not a json
serializer present and the `extra_constraints` feature is enabled.
* Fix heterogeneous job components not being signaled with
`scancel --ctld` and `DELETE slurm/v0.0.40/jobs` if the job ids
are not explicitly given, the heterogeneous job components match
the given filters, and the heterogeneous job leader does not
match the given filters.
* Fix regression from 23.02 impeding job licenses from being cleared.
* Move error to `log_flag` which made `_get_joules_task` error to
be logged to the user when too many rpcs were queued in slurmd
for gathering energy.
* For `scancel --ctld` and the associated rest api endpoints:
`DELETE /slurm/v0.0.40/jobs`
`DELETE /slurm/v0.0.41/jobs`
Fix canceling the final array task in a job array when the task
is pending and all array tasks have been split into separate job
records. Previously this task was not canceled.
* Fix `power_save operation` after recovering from a failed
reconfigure.
* `slurmctld` - Skip removing the pidfile when running under
systemd. In that situation it is never created in the first place.
* Fix issue where altering the flags on a Slurm account
(`UsersAreCoords`) several limits on the account's association
would be set to 0 in Slurm's internal cache.
* Fix memory leak in the controller when relaying `stepmgr` step
accounting to the dbd.
* Fix segfault when submitting stepmgr jobs within an existing
allocation.
* Added `disable_slurm_hydra_bootstrap` as a possible `MpiParams`
parameter in `slurm.conf`. Using this will disable env variable
injection to allocations for the following variables:
`I_MPI_HYDRA_BOOTSTRAP,` `I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS`,
`HYDRA_BOOTSTRAP`, `HYDRA_LAUNCHER_EXTRA_ARGS`.
* `scrun` - Delay shutdown until after start requested.
This caused `scrun` to never start or shutdown and hung forever
when using `--tty`.
* Fix backup `slurmctld` potentially not running the agent when
taking over as the primary controller.
* Fix primary controller not running the agent when a reconfigure
of the `slurmctld` fails.
* `slurmd` - fix premature timeout waiting for
`REQUEST_LAUNCH_PROLOG` with large array jobs causing node to
drain.
* `jobcomp/{elasticsearch,kafka}` - Avoid sending fields with
invalid date/time.
* `jobcomp/elasticsearch` - Fix `slurmctld` memory leak from
curl usage.
* `acct_gather_profile/influxdb` - Fix slurmstepd memory leak from
curl usage
* Fix 24.05.0 regression not deleting job hash dirs after
`MinJobAge`.
* Fix filtering arguments being ignored when using squeue `--json`.
* `switch/nvidia_imex` - Move setup call after `spank_init()` to
allow namespace manipulation within the SPANK plugin.
* `switch/nvidia_imex` - Skip plugin operation if
`nvidia-caps-imex-channels` device is not present rather than
preventing slurmd from starting.
* `switch/nvidia_imex` - Skip plugin operation if
`job_container/tmpfs` is configured due to incompatibility.
* `switch/nvidia_imex` - Remove any pre-existing channels when
`slurmd` starts.
* `rpc_queue` - Add support for an optional `rpc_queue.yaml`
configuration file.
* `slurmrestd` - Add new +prefer_refs flag to `data_parser/v0.0.41`
plugin. This flag will avoid inlining single referenced schemas
in the OpenAPI schema.
------------------------------------------------------------------- -------------------------------------------------------------------
Tue Jun 4 09:36:54 UTC 2024 - Christian Goll <cgoll@suse.com> Tue Jun 4 09:36:54 UTC 2024 - Christian Goll <cgoll@suse.com>
- updated to new release 24.05.0 with following major changes - Updated to new release 24.05.0 with following major changes
- IMPORTANT NOTES: * Important Notes:
If using the slurmdbd (Slurm DataBase Daemon) you must update If using the slurmdbd (Slurm DataBase Daemon) you must update
this first. NOTE: If using a backup DBD you must start the this first. NOTE: If using a backup DBD you must start the
primary first to do any database conversion, the backup will not primary first to do any database conversion, the backup will not
@ -11,302 +278,360 @@ Tue Jun 4 09:36:54 UTC 2024 - Christian Goll <cgoll@suse.com>
need to update all clusters at the same time, but it is very need to update all clusters at the same time, but it is very
important to update slurmdbd first and having it running before important to update slurmdbd first and having it running before
updating any other clusters making use of it. updating any other clusters making use of it.
- HIGHLIGHTS * Highlights
* Federation - allow client command operation when slurmdbd is + Federation - allow client command operation when slurmdbd is
unavailable. unavailable.
* burst_buffer/lua - Added two new hooks: slurm_bb_test_data_in + `burst_buffer/lua` - Added two new hooks: `slurm_bb_test_data_in`
and slurm_bb_test_data_out. The syntax and use of the new hooks and `slurm_bb_test_data_out`. The syntax and use of the new hooks
are documented in etc/burst_buffer.lua.example. These are are documented in `etc/burst_buffer.lua.example`. These are
required to exist. slurmctld now checks on startup if the required to exist. slurmctld now checks on startup if the
burst_buffer.lua script loads and contains all required hooks; `burst_buffer.lua` script loads and contains all required hooks;
slurmctld will exit with a fatal error if this is not `slurmctld` will exit with a fatal error if this is not
successful. Added PollInterval to burst_buffer.conf. Removed successful. Added `PollInterval` to `burst_buffer.conf`. Removed
the arbitrary limit of 512 copies of the script running the arbitrary limit of 512 copies of the script running
simultaneously. simultaneously.
* Add QOS limit MaxTRESRunMinsPerAccount. + Add QOS limit `MaxTRESRunMinsPerAccount`.
* Add QOS limit MaxTRESRunMinsPerUser. + Add QOS limit `MaxTRESRunMinsPerUser`.
* Add ELIGIBLE environment variable to jobcomp/script plugin. + Add `ELIGIBLE` environment variable to `jobcomp/script` plugin.
* Always use the QOS name for SLURM_JOB_QOS environment variables. + Always use the QOS name for `SLURM_JOB_QOS` environment variables.
Previously the batch environment would use the description field, Previously the batch environment would use the description field,
which was usually equivalent to the name. which was usually equivalent to the name.
* cgroup/v2 - Require dbus-1 version >= 1.11.16. + `cgroup/v2` - Require dbus-1 version >= 1.11.16.
* Allow NodeSet names to be used in SuspendExcNodes. + Allow `NodeSet` names to be used in SuspendExcNodes.
* SuspendExcNodes=<nodes>:N now counts allocated nodes in N. The + `SuspendExcNodes=<nodes>:N` now counts allocated nodes in `N`.
first N powered up nodes in <nodes> are protected from being The first `N` powered up nodes in <nodes> are protected from
suspended. being suspended.
* Store job output, input and error paths in SlurmDBD. + Store job output, input and error paths in `SlurmDBD`.
* Add USER_DELETE reservation flag to allow users with access to + Add `USER_DELETE` reservation flag to allow users with access
a reservation to delete it. to a reservation to delete it.
* Add SlurmctldParameters=enable_stepmgr to enable step + Add `SlurmctldParameters=enable_stepmgr` to enable step
management through the slurmstepd instead of the controller. management through the `slurmstepd` instead of the controller.
* Added PrologFlags=RunInJob to make prolog and epilog run + Added `PrologFlags=RunInJob` to make prolog and epilog run
inside the job extern step to include it in the job's cgroup. inside the job extern step to include it in the job's cgroup.
* Add ability to reserve MPI ports at the job level for stepmgr + Add ability to reserve MPI ports at the job level for stepmgr
jobs and subdivide them at the step level. jobs and subdivide them at the step level.
* slurmrestd - Add --generate-openapi-spec argument. + `slurmrestd` - Add `--generate-openapi-spec argument`.
- CONFIGURATION FILE CHANGES (see appropriate man page for details) * Configuration File Changes (see appropriate man page for details)
* CoreSpecPlugin has been removed. + `CoreSpecPlugin` has been removed.
* Removed TopologyPlugin tree and dragonfly support from + Removed `TopologyPlugin` tree and dragonfly support from
select/linear. If those topology plugins are desired please switch to `select/linear`. If those topology plugins are desired please
select/cons_tres. switch to `select/cons_tres`.
* Changed the default value for UnkillableStepTimeout to 60 + Changed the default value for `UnkillableStepTimeout` to 60
seconds or five times the value of MessageTimeout, whichever is greater. seconds or five times the value of `MessageTimeout`, whichever
* An error log has been added if JobAcctGatherParams 'UsePss' or is greater.
'NoShare' are configured with a plugin other than jobacct_gather/linux. + An error log has been added if `JobAcctGatherParams` '`UsePss`'
In such case these parameters are ignored. or '`NoShare`' are configured with a plugin other than
* helpers.conf - Added Flags=rebootless parameter allowing feature changes `jobacct_gather/linux`. In such case these parameters are ignored.
without rebooting compute nodes. + `helpers.conf` - Added `Flags=rebootless` parameter allowing
* topology/block - Replaced the BlockLevels with BlockSizes in topology.conf. feature changes without rebooting compute nodes.
* Add contain_spank option to SlurmdParameters. When set, spank_user_init(), + `topology/block` - Replaced the `BlockLevels` with `BlockSizes`
spank_task_post_fork(), and spank_task_exit() will execute within the in `topology.conf`.
job_container/tmpfs plugin namespace. + Add `contain_spank` option to `SlurmdParameters`. When set,
* Add SlurmctldParameters=max_powered_nodes=N, which prevents powering up `spank_user_init()`, `spank_task_post_fork()`, and
nodes after the max is reached. `spank_task_exit()` will execute within the
* Add ExclusiveTopo to a partition definition in slurm.conf. `job_container/tmpfs` plugin namespace.
* Add AccountingStorageParameters=max_step_records to limit how many steps + Add `SlurmctldParameters=max_powered_nodes=N`, which prevents
are recorded in the database for each job *- excluding batc powering up nodes after the max is reached.
- COMMAND CHANGES (see man pages for details) + Add `ExclusiveTopo` to a partition definition in `slurm.conf`.
* Add support for "elevenses" as an additional time specification. + Add `AccountingStorageParameters=max_step_records` to limit how
* Add support for sbcast --preserve when job_container/tmpfs configured many steps are recorded in the database for each job - excluding
(previously documented as unsupported). batch.
* scontrol - Add new subcommand 'power' for node power control. * Command Changes (see man pages for details)
* squeue - Adjust StdErr, StdOut, and StdIn output formats. These will now + Add support for "elevenses" as an additional time specification.
consistently print "(null)" if a value is unavailable. StdErr will no + Add support for `sbcast --preserve` when `job_container/tmpfs`
longer display StdOut if it is not distinctly set. StdOut will now configured (previously documented as unsupported).
correctly display the default filename pattern for job arrays, and no + `scontrol` - Add new subcommand `power` for node power control.
longer show it for non*batch jobs. However, the expansion patterns will + `squeue` - Adjust `StdErr`, `StdOut`, and `StdIn` output formats.
These will now consistently print "`(null)`" if a value is
unavailable. `StdErr` will no longer display `StdOut` if it is
not distinctly set. `StdOut` will now correctly display the
default filename pattern for job arrays, and no longer show it
for non-batch jobs. However, the expansion patterns will
no longer be substituted by default. no longer be substituted by default.
* Add --segment to job allocation to be used in topology/block. + Add `--segment` to job allocation to be used in topology/block.
* Add --exclusive=topo for use with topology/block. + Add `--exclusive=topo` for use with topology/block.
* squeue - Add --expand-patterns option to expand StdErr, StdOut, StdIn + `squeue` - Add `--expand-patterns` option to expand `StdErr`,
filename patterns as best as possible. `StdOut`, `StdIn` filename patterns as best as possible.
* sacct - Add --expand-patterns option to expand StdErr, StdOut, StdIn + `sacct` - Add `--expand-patterns` option to expand `StdErr`,
filename patterns as best as possible. `StdOut`, `StdIn` filename patterns as best as possible.
* sreport - Requesting format=Planned will now return the expected Planned + `sreport` - Requesting `format=Planned` will now return the
time as documented, instead of PlannedDown. To request Planned Down, expected `Planned` time as documented, instead of `PlannedDown`.
one must use now format=PLNDDown or format=PlannedDown explicitly. The To request `Planned Down`, one must use now `format=PLNDDown`
abbreviations "Pl" or "Pla" will now make reference to Planned instead of or `format=PlannedDown` explicitly. The abbreviations
PlannedDown. "`Pl`" or "`Pla`" will now make reference to Planned instead
- API CHANGES of `PlannedDown`.
* Removed ListIterator type from <slurm/slurm.h>. * API Changes
* Removed slurm_xlate_job_id() from <slurm/slurm.h> + Removed `ListIterator` type from `<slurm/slurm.h>`.
- SLURMRESTD CHANGES + Removed `slurm_xlate_job_id()` from `<slurm/slurm.h>`
* openapi/dbv0.0.38 and openapi/v0.0.38 plugins have been removed. * SLURMRESTD Changes
* openapi/dbv0.0.39 and openapi/v0.0.39 plugins have been tagged as + `openapi/dbv0.0.38` and `openapi/v0.0.38` plugins have been
deprecated to warn of their removal in the next release. removed.
* Changed slurmrestd.service to only listen on TCP socket by default. + `openapi/dbv0.0.39` and `openapi/v0.0.39` plugins have been
Environments with existing drop*in units for the service may need tagged as deprecated to warn of their removal in the next release.
further adjustments to work after upgrading. + Changed `slurmrestd.service` to only listen on TCP socket by
* slurmrestd - Tagged `script` field as deprecated in default. Environments with existing drop-in units for the
'POST /slurm/v0.0.41/job/submit' in anticipation of removal in future service may need further adjustments to work after upgrading.
OpenAPI plugin versions. Job submissions should set the `job.script` (or + `slurmrestd` - Tagged `script` field as deprecated in
`jobs[0].script` for HetJobs) fields instead. `POST /slurm/v0.0.41/job/submit` in anticipation of removal in
* slurmrestd - Attempt to automatically convert enumerated string arrays with future OpenAPI plugin versions. Job submissions should set the
incoming non*string values into strings. Add warning when incoming value for `job.script` (or `jobs[0].script` for HetJobs) fields instead.
enumerated string arrays can not be converted to string and silently ignore + `slurmrestd` - Attempt to automatically convert enumerated
instead of rejecting entire request. This change affects any endpoint that string arrays with incoming non-string values into strings.
uses an enunmerated string as given in the OpenAPI specification. An Add warning when incoming value for enumerated string arrays
example of this conversion would be to 'POST /slurm/v0.0.41/job/submit' with can not be converted to string and silently ignore instead of
'.job.exclusive = true'. While the JSON (boolean) true value matches a rejecting entire request. This change affects any endpoint that
possible enumeration, it is not the expected "true" string. This change uses an enunmerated string as given in the OpenAPI specification.
automatically converts the (boolean) true to (string) "true" avoiding a An example of this conversion would be to
parsing failure. `POST /slurm/v0.0.41/job/submit` with `.job.exclusive = true`.
* slurmrestd - Add 'POST /slurm/v0.0.41/job/allocate' endpoint. This endpoint While the JSON (boolean) true value matches a possible
will create a new job allocation without any steps. The allocation will need enumeration, it is not the expected "true" string. This change
to be ended via signaling the job or it will run to the timelimit. automatically converts the (boolean) `true` to (string) "`true`"
* slurmrestd - Allow startup when slurmdbd is not configured and avoid loading avoiding a parsing failure.
slurmdbd specific plugins. + `slurmrestd` - Add `POST /slurm/v0.0.41/job/allocate` endpoint.
- MPI/PMI2 CHANGES This endpoint will create a new job allocation without any steps.
* Jobs submitted with the SLURM_HOSTFILE environment variable set implies The allocation will need to be ended via signaling the job or
using an arbitrary distribution. Nevertheless, the logic used in PMI2 when it will run to the timelimit.
generating their associated PMI_process_mapping values has been changed and + `slurmrestd` - Allow startup when `slurmdbd` is not configured
will now be the same used for the plane distribution, as if "-m plane" were and avoid loading `slurmdbd` specific plugins.
used. This has been changed because the original arbitrary distribution * MPI/PMI2 Changes
implementation did not account for multiple instances of the same host being + Jobs submitted with the `SLURM_HOSTFILE` environment variable
present in SLURM_HOSTFILE, providing an incorrect process mapping in such set implies using an arbitrary distribution. Nevertheless, the
case. This change also enables distributing tasks in blocks when using logic used in PMI2 when generating their associated
arbitrary distribution, which was not the case before. This only affects `PMI_process_mapping` values has been changed and will now be
mpi/pmi2 plugin. the same used for the plane distribution, as if `-m plane` were
- removed Fix-test-21.41.patch as upstream test changed used. This has been changed because the original arbitrary
distribution implementation did not account for multiple
instances of the same host being present in `SLURM_HOSTFILE`,
providing an incorrect process mapping in such case. This
change also enables distributing tasks in blocks when using
arbitrary distribution, which was not the case before. This
only affects `mpi`/`pmi2` plugin.
* Removed Fix-test-21.41.patch as upstream test changed.
------------------------------------------------------------------- -------------------------------------------------------------------
Mon Mar 25 15:16:44 UTC 2024 - Christian Goll <cgoll@suse.com> Mon Mar 25 15:16:44 UTC 2024 - Christian Goll <cgoll@suse.com>
- removed Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch - removed Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch
as incoperated upstream as incoperated upstream
* Changes in Slurm 23.02.5 - Changes in Slurm 23.02.5
* Add the JobId to debug() messages indicating when cpus_per_task/mem_per_cpu * Add the `JobId` to `debug()` messages indicating when
or pn_min_cpus are being automatically adjusted. `cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being
* Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if automatically adjusted.
a node features plugin is configured. * Fix regression in 23.02.2 that caused `slurmctld -R` to crash on
startup if a node features plugin is configured.
* Fix and prevent reoccurring reservations from overlapping. * Fix and prevent reoccurring reservations from overlapping.
* job_container/tmpfs - Avoid attempts to share BasePath between nodes. * `job_container/tmpfs` - Avoid attempts to share `BasePath`
* Change the log message warning for rate limited users from verbose to info. between nodes.
* With CR_Cpu_Memory, fix node selection for jobs that request gres and * Change the log message warning for rate limited users from
*-mem-per-cpu. verbose to info.
* Fix a regression from 22.05.7 in which some jobs were allocated too few * With `CR_Cpu_Memory`, fix node selection for jobs that request
nodes, thus overcommitting cpus to some tasks. gres and `--mem-per-cpu`.
* Fix a job being stuck in the completing state if the job ends while the * Fix a regression from 22.05.7 in which some jobs were allocated
primary controller is down or unresponsive and the backup controller has too few nodes, thus overcommitting cpus to some tasks.
not yet taken over. * Fix a job being stuck in the completing state if the job ends
* Fix slurmctld segfault when a node registers with a configured CpuSpecList while the primary controller is down or unresponsive and the
while slurmctld configuration has the node without CpuSpecList. backup controller has not yet taken over.
* Fix cloud nodes getting stuck in POWERED_DOWN+NO_RESPOND state after not * Fix `slurmctld` segfault when a node registers with a configured
registering by ResumeTimeout. `CpuSpecList` while slurmctld configuration has the node without
* slurmstepd - Avoid cleanup of config.json-less containers spooldir getting `CpuSpecList`.
skipped. * Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state
* slurmstepd - Cleanup per task generated environment for containers in after not registering by `ResumeTimeout`.
spooldir. * `slurmstepd` - Avoid cleanup of `config.json`-less containers
* Fix scontrol segfault when 'completing' command requested repeatedly in spooldir getting skipped.
interactive mode. * `slurmstepd` - Cleanup per task generated environment for
* Properly handle a race condition between bind() and listen() calls in the containers in spooldir.
network stack when running with SrunPortRange set. * Fix `scontrol segfault` when 'completing' command requested
* Federation - Fix revoked jobs being returned regardless of the -a/--all repeatedly in interactive mode.
option for privileged users. * Properly handle a race condition between `bind()` and `listen()`
* Federation - Fix canceling pending federated jobs from non-origin clusters calls in the network stack when running with `SrunPortRange` set.
which could leave federated jobs orphaned from the origin cluster. * Federation - Fix revoked jobs being returned regardless of the
* Fix sinfo segfault when printing multiple clusters with --noheader option. `-a`/`--all` option for privileged users.
* Federation - fix clusters not syncing if clusters are added to a federation * Federation - Fix canceling pending federated jobs from non-origin
before they have registered with the dbd. clusters which could leave federated jobs orphaned from the origin
* Change pmi2 plugin to honor the SrunPortRange option. This matches the new cluster.
behavior of the pmix plugin in 23.02.0. Note that neither of these plugins * Fix sinfo segfault when printing multiple clusters with
makes use of the "MpiParams=ports=" option, and previously were only limited `--noheader` option.
by the systems ephemeral port range. * Federation - fix clusters not syncing if clusters are added to
* node_features/helpers - Fix node selection for jobs requesting changeable a federation before they have registered with the dbd.
features with the '|' operator, which could prevent jobs from running on * Change `pmi2` plugin to honor the `SrunPortRange` option. This
some valid nodes. matches the new behavior of the pmix plugin in 23.02.0. Note that
* node_features/helpers - Fix inconsistent handling of '&' and '|', where an neither of these plugins makes use of the "`MpiParams=ports=`"
AND'd feature was sometimes AND'd to all sets of features instead of just option, and previously were only limited by the systems ephemeral
the current set. E.g. "foo|bar&baz" was interpreted as {foo,baz} or port range.
{bar,baz} instead of how it is documented: "{foo} or {bar,baz}". * `node_features/helpers` - Fix node selection for jobs requesting
* Fix job accounting so that when a job is requeued its allocated node count changeable features with the '`|`' operator, which could prevent
is cleared. After the requeue, sacct will correctly show that the job has jobs from running on some valid nodes.
0 AllocNodes while it is pending or if it is canceled before restarting. * `node_features/helpers` - Fix inconsistent handling of '`&`' and
* sacct - AllocCPUS now correctly shows 0 if a job has not yet received an '`|`', where an AND'd feature was sometimes AND'd to all sets of
allocation or if the job was canceled before getting one. features instead of just the current set. E.g. "`foo|bar&baz`" was
* Fix intel oneapi autodetect: detect the /dev/dri/renderD[0-9]+ gpus, and do interpreted as `{foo,baz}` or `{bar,baz}` instead of how it is
not detect /dev/dri/card[0*9]+. documented: "`{foo} or {bar,baz}`".
* Format batch, extern, interactive, and pending step ids into strings that * Fix job accounting so that when a job is requeued its allocated
are human readable. node count is cleared. After the requeue, sacct will correctly
* Fix node selection for jobs that request --gpus and a number of tasks fewer show that the job has 0 `AllocNodes` while it is pending or if
than gpus, which resulted in incorrectly rejecting these jobs. it is canceled before restarting.
* Remove MYSQL_OPT_RECONNECT completely. * `sacct` - `AllocCPUS` now correctly shows 0 if a job has not yet
* Fix cloud nodes in POWERING_UP state disappearing (getting set to FUTURE) received an allocation or if the job was canceled before getting
when an `scontrol reconfigure` happens. one.
* openapi/dbv0.0.39 - Avoid assert / segfault on missing coordinators list. * Fix intel oneapi autodetect: detect the `/dev/dri/renderD[0-9]+`
* slurmrestd - Correct memory leak while parsing OpenAPI specification gpus, and do not detect `/dev/dri/card[0-9]+`.
templates with server overrides. * Format batch, extern, interactive, and pending step ids into
* slurmrestd - Reduce memory usage when printing out job CPU frequency. strings that are human readable.
* Fix node selection for jobs that request `--gpus` and a number
of tasks fewer than gpus, which resulted in incorrectly rejecting
these jobs.
* Remove `MYSQL_OPT_RECONNECT` completely.
* Fix cloud nodes in `POWERING_UP` state disappearing (getting set
to `FUTURE`) when an `scontrol reconfigure` happens.
* `openapi/dbv0.0.39` - Avoid assert / segfault on missing
coordinators list.
* `slurmrestd` - Correct memory leak while parsing OpenAPI
specification templates with server overrides.
* `slurmrestd` - Reduce memory usage when printing out job CPU
frequency.
* Fix overwriting user node reason with system message. * Fix overwriting user node reason with system message.
* Remove --uid / --gid options from salloc and srun commands. * Remove `--uid` / `--gid` options from salloc and srun commands.
* Prevent deadlock when rpc_queue is enabled. * Prevent deadlock when rpc_queue is enabled.
* slurmrestd - Correct OpenAPI specification generation bug where fields with * `slurmrestd` - Correct OpenAPI specification generation bug where
overlapping parent paths would not get generated. fields with overlapping parent paths would not get generated.
* Fix memory leak as a result of a partition info query. * Fix memory leak as a result of a partition info query.
* Fix memory leak as a result of a job info query. * Fix memory leak as a result of a job info query.
* slurmrestd - For 'GET /slurm/v0.0.39/node[s]', change format of node's * slurmrestd - For `GET /slurm/v0.0.39/node[s]`, change format of
energy field "current_watts" to a dictionary to account for unset value node's energy field `current_watts` to a dictionary to account
instead of dumping 4294967294. for unset value instead of dumping `4294967294`.
* slurmrestd - For 'GET /slurm/v0.0.39/qos', change format of QOS's * `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of
field "priority" to a dictionary to account for unset value instead of QOS's field `priority` to a dictionary to account for unset
dumping 4294967294. value instead of dumping `4294967294`.
* slurmrestd - For 'GET /slurm/v0.0.39/job[s]', the 'return code' code field * `slurmrestd` - For `GET /slurm/v0.0.39/job[s]`, the `return code`
in v0.0.39_job_exit_code will be set to *127 instead of being left unset code field in `v0.0.39_job_exit_code` will be set to 127 instead
where job does not have a relevant return code. of being left unset where job does not have a relevant return code.
* data_parser/v0.0.39 - Add required/memory_per_cpu and * `data_parser/v0.0.39` - Add `required/memory_per_cpu` and
required/memory_per_node to `sacct *-json` and `sacct --yaml` and required/memory_per_node to `sacct --json` and `sacct --yaml` and
'GET /slurmdb/v0.0.39/jobs' from slurmrestd. `GET /slurmdb/v0.0.39/jobs` from `slurmrestd`.
* For step allocations, fix --gres=none sometimes not ignoring gres from the * For step allocations, fix `--gres=none` sometimes not ignoring
job. gres from the job.
* Fix --exclusive jobs incorrectly gang-scheduling where they shouldn't. * Fix `--exclusive` jobs incorrectly gang-scheduling where they
* Fix allocations with CR_SOCKET, gres not assigned to a specific socket, and shouldn't.
block core distribion potentially allocating more sockets than required. * Fix allocations with `CR_SOCKET`, gres not assigned to a specific
* gpu/oneapi - Store cores correctly so CPU affinity is tracked. socket, and block core distribion potentially allocating more
* Revert a change in 23.02.3 where Slurm would kill a script's process group sockets than required.
as soon as the script ended instead of waiting as long as any process in * `gpu/oneapi` - Store cores correctly so CPU affinity is tracked.
that process group held the stdout/stderr file descriptors open. That change * Revert a change in 23.02.3 where Slurm would kill a script's
broke some scripts that relied on the previous behavior. Setting time limits process group as soon as the script ended instead of waiting as
for scripts (such as PrologEpilogTimeout) is strongly encouraged to avoid long as any process in
Slurm waiting indefinitely for scripts to finish. that process group held the stdout/stderr file descriptors open.
That change broke some scripts that relied on the previous
behavior. Setting time limits for scripts (such as
`PrologEpilogTimeout`) is strongly encouraged to avoid Slurm
waiting indefinitely for scripts to finish.
* Allow slurmdbd -R to work if the root assoc id is not 1. * Allow slurmdbd -R to work if the root assoc id is not 1.
* Fix slurmdbd -R not returning an error under certain conditions. * Fix `slurmdbd -R` not returning an error under certain conditions.
* slurmdbd - Avoid potential NULL pointer dereference in the mysql plugin. * `slurmdbd` - Avoid potential NULL pointer dereference in the
* Revert a change in 23.02 where SLURM_NTASKS was no longer set in the job's mysql plugin.
environment when *-ntasks-per-node was requested. * Revert a change in 23.02 where `SLURM_NTASKS` was no longer
* Limit periodic node registrations to 50 instead of the full TreeWidth. set in the job's environment when `--ntasks-per-node` was
Since unresolvable cloud/dynamic nodes must disable fanout by setting requested.
TreeWidth to a large number, this would cause all nodes to register at * Limit periodic node registrations to 50 instead of the full
once. `TreeWidth`.
* Fix regression in 23.02.3 which broken x11 forwarding for hosts when Since unresolvable `cloud/dynamic` nodes must disable fanout by
MUNGE sends a localhost address in the encode host field. This is caused setting `TreeWidth` to a large number, this would cause all nodes
when the node hostname is mapped to 127.0.0.1 (or similar) in /etc/hosts. to register at once.
* openapi/[db]v0.0.39 - fix memory leak on parsing error. * Fix regression in 23.02.3 which broken x11 forwarding for hosts
* data_parser/v0.0.39 - fix updating qos for associations. when `MUNGE` sends a localhost address in the encode host field.
* openapi/dbv0.0.39 - fix updating values for associations with null users. This is caused when the node hostname is mapped to 127.0.0.1
* Fix minor memory leak with --tres-per-task and licenses. (or similar) in `/etc/hosts`.
* `openapi/[db]v0.0.39` - fix memory leak on parsing error.
* `data_parser/v0.0.39` - fix updating qos for associations.
* `openapi/dbv0.0.39` - fix updating values for associations with
null users.
* Fix minor memory leak with `--tres-per-task` and licenses.
* Fix cyclic socket cpu distribution for tasks in a step where * Fix cyclic socket cpu distribution for tasks in a step where
--cpus-per-task < usable threads per core. `--cpus-per-task` < usable threads per core.
- Changes in Slurm 23.02.4 - Changes in Slurm 23.02.4
* Fix sbatch return code when **wait is requested on a job array. * Fix `sbatch` return code when --wait is requested on a job array.
* switch/hpe_slingshot * avoid segfault when running with old libcxi. * `switch/hpe_slingshot` - avoid segfault when running with old
* Avoid slurmctld segfault when specifying AccountingStorageExternalHost. libcxi.
* Fix collected GPUUtilization values for acct_gather_profile plugins. * Avoid slurmctld segfault when specifying
`AccountingStorageExternalHost`.
* Fix collected `GPUUtilization` values for `acct_gather_profile`
plugins.
* Fix slurmrestd handling of job hold/release operations. * Fix slurmrestd handling of job hold/release operations.
* Make spank S_JOB_ARGV item value hold the requested command argv instead of * Make spank `S_JOB_ARGV` item value hold the requested command
the srun **bcast value when **bcast requested (only in local context). argv instead of the srun `--bcast` value when `--bcast` requested
* Fix step running indefinitely when slurmctld takes more than MessageTimeout (only in local context).
to respond. Now, slurmctld will cancel the step when detected, preventing * Fix step running indefinitely when slurmctld takes more than
following steps from getting stuck waiting for resources to be released. `MessageTimeout` to respond. Now, `slurmctld` will cancel the
* Fix regression to make job_desc.min_cpus accurate again in job_submit when step when detected, preventing following steps from getting stuck
requesting a job with **ntasks*per*node. waiting for resources to be released.
* scontrol * Permit changes to StdErr and StdIn for pending jobs. * Fix regression to make job_desc.min_cpus accurate again in
* scontrol * Reset std{err,in,out} when set to empty string. job_submit when requesting a job with `--ntasks-per-node`.
* slurmrestd * mark environment as a required field for job submission * `scontrol` - Permit changes to `StdErr` and `StdIn` for pending
descriptions. jobs.
* slurmrestd * avoid dumping null in OpenAPI schema required fields. * `scontrol` - Reset std{err,in,out} when set to empty string.
* data_parser/v0.0.39 * avoid rejecting valid memory_per_node formatted as * `slurmrestd` - mark environment as a required field for job
dictionary provided with a job description. submission descriptions.
* data_parser/v0.0.39 * avoid rejecting valid memory_per_cpu formatted as * `slurmrestd` - avoid dumping null in OpenAPI schema required
dictionary provided with a job description. fields.
* slurmrestd * Return HTTP error code 404 when job query fails. `data_parser/v0.0.39` - avoid rejecting valid `memory_per_node`
* slurmrestd * Add return schema to error response to job and license query. formatted as dictionary provided with a job description.
* `data_parser/v0.0.39` - avoid rejecting valid `memory_per_cpu`
formatted as dictionary provided with a job description.
* `slurmrestd` - Return HTTP error code 404 when job query fails.
* `slurmrestd` - Add return schema to error response to job and
license query.
* Fix handling of ArrayTaskThrottle in backfill. * Fix handling of ArrayTaskThrottle in backfill.
* Fix regression in 23.02.2 when checking gres state on slurmctld startup or * Fix regression in 23.02.2 when checking gres state on `slurmctld`
reconfigure. Gres changes in the configuration were not updated on slurmctld startup or reconfigure. Gres changes in the configuration were
startup. On startup or reconfigure, these messages were present in the log: not updated on `slurmctld` startup. On startup or reconfigure,
"error: Attempt to change gres/gpu Count". these messages were present in the log:
"`error: Attempt to change gres/gpu Count`".
* Fix potential double count of gres when dealing with limits. * Fix potential double count of gres when dealing with limits.
* switch/hpe_slingshot * support alternate traffic class names with "TC_" * `switch/hpe_slingshot` - support alternate traffic class names
prefix. with "`TC_`" prefix.
* scrontab * Fix cutting off the final character of quoted variables. * `scrontab` - Fix cutting off the final character of quoted
* Fix slurmstepd segfault when ContainerPath is not set in oci.conf variables.
* Change the log message warning for rate limited users from debug to verbose. * Fix `slurmstepd` segfault when `ContainerPath` is not set in
* Fixed an issue where jobs requesting licenses were incorrectly rejected. `oci.conf`.
* smail * Fix issues where e*mails at job completion were not being sent. * Change the log message warning for rate limited users from
* scontrol/slurmctld * fix comma parsing when updating a reservation's nodes. debug to verbose.
* cgroup/v2 * Avoid capturing log output for ebpf when constraining devices, * Fixed an issue where jobs requesting licenses were incorrectly
as this can lead to inadvertent failure if the log buffer is too small. rejected.
* Fix **gpu*bind=single binding tasks to wrong gpus, leading to some gpus * `smail` - Fix issues where emails at job completion were not
having more tasks than they should and other gpus being unused. being sent.
* Fix main scheduler loop not starting after failover to backup controller. * `scontrol/slurmctld` - fix comma parsing when updating a
* Added error message when attempting to use sattach on batch or extern steps. reservation's nodes.
* Fix regression in 23.02 that causes slurmstepd to crash when srun requests * `cgroup/v2` - Avoid capturing log output for ebpf when
more than TreeWidth nodes in a step and uses the pmi2 or pmix plugin. constraining devices, as this can lead to inadvertent failure
* Reject job ArrayTaskThrottle update requests from unprivileged users. if the log buffer is too small.
* data_parser/v0.0.39 * populate description fields of property objects in * Fix --gpu-bind=single binding tasks to wrong gpus, leading to
generated OpenAPI specifications where defined. some gpus having more tasks than they should and other gpus being
* slurmstepd * Avoid segfault caused by ContainerPath not being terminated by unused.
'/' in oci.conf. * Fix main scheduler loop not starting after failover to backup
* data_parser/v0.0.39 * Change v0.0.39_job_info response to tag exit_code controller.
field as being complex instead of only an unsigned integer. * Added error message when attempting to use sattach on batch or
* job_container/tmpfs * Fix %h and %n substitution in BasePath where %h was extern steps.
substituted as the NodeName instead of the hostname, and %n was substituted * Fix regression in 23.02 that causes slurmstepd to crash when
as an empty string. `srun` requests more than `TreeWidth` nodes in a step and uses
* Fix regression where **cpu*bind=verbose would override TaskPluginParam. the `pmi2` or `pmix` plugin.
* scancel * Fix **clusters/*M for federations. Only filtered jobs (e.g. *A, * Reject job `ArrayTaskThrottle` update requests from unprivileged
*u, *p, etc.) from the specified clusters will be canceled, rather than all users.
jobs in the federation. Specific jobids will still be routed to the origin * `data_parser/v0.0.39` - populate description fields of property
cluster for cancellation. objects in generated OpenAPI specifications where defined.
* `slurmstepd` - Avoid segfault caused by ContainerPath not being
terminated by '`/`' in `oci.conf`.
* `data_parser/v0.0.39` - Change `v0.0.39_job_info` response to tag
`exit_code` field as being complex instead of only an unsigned
integer.
* `job_container/tmpfs` - Fix %h and %n substitution in `BasePath`
where `%h` was substituted as the `NodeName` instead of the
hostname, and `%n` was substituted as an empty string.
* Fix regression where --cpu-bind=verbose would override
`TaskPluginParam`.
* `scancel` - Fix `--clusters`/`-M` for federations. Only filtered
jobs (e.g. -A, -u, -p, etc.) from the specified clusters will be
canceled, rather than all jobs in the federation.
Specific jobids will still be routed to the origin cluster
for cancellation.
------------------------------------------------------------------- -------------------------------------------------------------------
Mon Jan 29 13:47:55 UTC 2024 - Egbert Eich <eich@suse.com> Mon Jan 29 13:47:55 UTC 2024 - Egbert Eich <eich@suse.com>
@ -2337,7 +2662,6 @@ Fri Jul 2 08:01:32 UTC 2021 - Christian Goll <cgoll@suse.com>
- Updated to 20.11.8: - Updated to 20.11.8:
* slurmctld - fix erroneous "StepId=CORRUPT" messages in error logs. * slurmctld - fix erroneous "StepId=CORRUPT" messages in error logs.
* Correct the error given when auth plugin fails to pack a credential. * Correct the error given when auth plugin fails to pack a credential.
* Fix unused-variable compiler warning on FreeBSD in fd_resolve_path().
* acct_gather_filesystem/lustre - only emit collection error once per step. * acct_gather_filesystem/lustre - only emit collection error once per step.
* Add GRES environment variables (e.g., CUDA_VISIBLE_DEVICES) into the * Add GRES environment variables (e.g., CUDA_VISIBLE_DEVICES) into the
interactive step, the same as is done for the batch step. interactive step, the same as is done for the batch step.

View File

@ -19,7 +19,7 @@
# Check file META in sources: update so_version to (API_CURRENT - API_AGE) # Check file META in sources: update so_version to (API_CURRENT - API_AGE)
%define so_version 41 %define so_version 41
# Make sure to update `upgrades` as well! # Make sure to update `upgrades` as well!
%define ver 24.05.0 %define ver 24.05.3
%define _ver _24_05 %define _ver _24_05
%define dl_ver %{ver} %define dl_ver %{ver}
# so-version is 0 and seems to be stable # so-version is 0 and seems to be stable
@ -59,6 +59,9 @@ ExclusiveArch: do_not_build
%if 0%{?sle_version} == 150500 || 0%{?sle_version} == 150600 %if 0%{?sle_version} == 150500 || 0%{?sle_version} == 150600
%define base_ver 2302 %define base_ver 2302
%endif %endif
%if 0%{?sle_version} == 150500 || 0%{?sle_version} == 150600
%define base_ver 2302
%endif
%define ver_m %{lua:x=string.gsub(rpm.expand("%ver"),"%.[^%.]*$","");print(x)} %define ver_m %{lua:x=string.gsub(rpm.expand("%ver"),"%.[^%.]*$","");print(x)}
# Keep format_spec_file from botching the define below: # Keep format_spec_file from botching the define below: