- Update to version 24.05.3
* `data_parser/v0.0.40` - Added field descriptions. * `slurmrestd` - Avoid creating new slurmdbd connection per request to `* /slurm/slurmctld/*/*` endpoints. * Fix compilation issue with `switch/hpe_slingshot` plugin. * Fix gres per task allocation with threads-per-core. * `data_parser/v0.0.41` - Added field descriptions. * `slurmrestd` - Change back generated OpenAPI schema for `DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using parameters for request. `slurmrestd` will continue accept endpoint requests via `RequestBody` or HTTP query. * `topology/tree` - Fix issues with switch distance optimization. * Fix potential segfault of secondary `slurmctld` when falling back to the primary when running with a `JobComp` plugin. * Enable `--json`/`--yaml=v0.0.39` options on client commands to dump data using data_parser/v0.0.39 instead or outputting nothing. * `switch/hpe_slingshot` - Fix issue that could result in a 0 length state file. * Fix unnecessary message protocol downgrade for unregistered nodes. * Fix unnecessarily packing alias addrs when terminating jobs with a mix of non-cloud/dynamic nodes and powered down cloud/dynamic nodes. * `accounting_storage/mysql` - Fix issue when deleting a qos that could remove too many commas from the qos and/or delta_qos fields of the assoc table. * `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU. * Fix allowing access to reservations without `MaxStartDelay` set. * Fix regression introduced in 24.05.0rc1 breaking `srun --send-libs` parsing. * Fix slurmd vsize memory leak when using job submission/allocation OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=295
This commit is contained in:
parent
fc209e050f
commit
b2f6e848a1
@ -1,3 +0,0 @@
|
|||||||
version https://git-lfs.github.com/spec/v1
|
|
||||||
oid sha256:a6d3e95f2bbda3c9567060efc3d7090ad8eac257fa3578798c89321957946e49
|
|
||||||
size 7117445
|
|
3
slurm-24.05.3.tar.bz2
Normal file
3
slurm-24.05.3.tar.bz2
Normal file
@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:b0b40513e9b6ae867ddb95d60b950bcb980c15b735b5d0dea37a9a00cc64ae24
|
||||||
|
size 7189600
|
872
slurm.changes
872
slurm.changes
@ -1,8 +1,275 @@
|
|||||||
|
-------------------------------------------------------------------
|
||||||
|
Mon Oct 14 10:40:10 UTC 2024 - Egbert Eich <eich@suse.com>
|
||||||
|
|
||||||
|
- Update to version 24.05.3
|
||||||
|
* `data_parser/v0.0.40` - Added field descriptions.
|
||||||
|
* `slurmrestd` - Avoid creating new slurmdbd connection per request
|
||||||
|
to `* /slurm/slurmctld/*/*` endpoints.
|
||||||
|
* Fix compilation issue with `switch/hpe_slingshot` plugin.
|
||||||
|
* Fix gres per task allocation with threads-per-core.
|
||||||
|
* `data_parser/v0.0.41` - Added field descriptions.
|
||||||
|
* `slurmrestd` - Change back generated OpenAPI schema for
|
||||||
|
`DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using
|
||||||
|
parameters for request. `slurmrestd` will continue accept endpoint
|
||||||
|
requests via `RequestBody` or HTTP query.
|
||||||
|
* `topology/tree` - Fix issues with switch distance optimization.
|
||||||
|
* Fix potential segfault of secondary `slurmctld` when falling back
|
||||||
|
to the primary when running with a `JobComp` plugin.
|
||||||
|
* Enable `--json`/`--yaml=v0.0.39` options on client commands to
|
||||||
|
dump data using data_parser/v0.0.39 instead or outputting nothing.
|
||||||
|
* `switch/hpe_slingshot` - Fix issue that could result in a 0 length
|
||||||
|
state file.
|
||||||
|
* Fix unnecessary message protocol downgrade for unregistered nodes.
|
||||||
|
* Fix unnecessarily packing alias addrs when terminating jobs with
|
||||||
|
a mix of non-cloud/dynamic nodes and powered down cloud/dynamic
|
||||||
|
nodes.
|
||||||
|
* `accounting_storage/mysql` - Fix issue when deleting a qos that
|
||||||
|
could remove too many commas from the qos and/or delta_qos fields
|
||||||
|
of the assoc table.
|
||||||
|
* `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU.
|
||||||
|
* Fix allowing access to reservations without `MaxStartDelay` set.
|
||||||
|
* Fix regression introduced in 24.05.0rc1 breaking
|
||||||
|
`srun --send-libs` parsing.
|
||||||
|
* Fix slurmd vsize memory leak when using job submission/allocation
|
||||||
|
commands that implicitly or explicitly use --get-user-env.
|
||||||
|
* `slurmd` - Fix node going into invalid state when using
|
||||||
|
`CPUSpecList` and setting CPUs to the # of cores on a
|
||||||
|
multithreaded node.
|
||||||
|
* Fix reboot asap nodes being considered in backfill after a restart.
|
||||||
|
* Fix `--clusters`/`-M queries` for clusters outside of a
|
||||||
|
federation when `fed_display` is configured.
|
||||||
|
* Fix `scontrol` allowing updating job with bad cpus-per-task value.
|
||||||
|
* `sattach` - Fix regression from 24.05.2 security fix leading to
|
||||||
|
crash.
|
||||||
|
* `mpi/pmix` - Fix assertion when built under `--enable-debug`.
|
||||||
|
- Changes from Slurm 24.05.2
|
||||||
|
* Fix energy gathering rpc counter underflow in
|
||||||
|
`_rpc_acct_gather_energy` when more than 10 threads try to get
|
||||||
|
energy at the same time. This prevented the possibility to get
|
||||||
|
energy from slurmd by any step until slurmd was restarted,
|
||||||
|
so losing energy accounting metrics in the node.
|
||||||
|
* `accounting_storage/mysql` - Fix issue where new user with `wckey`
|
||||||
|
did not have a default wckey sent to the slurmctld.
|
||||||
|
* `slurmrestd` - Prevent slurmrestd segfault when handling the
|
||||||
|
following endpoints when none of the optional parameters are
|
||||||
|
specified:
|
||||||
|
`DELETE /slurm/v0.0.40/jobs`
|
||||||
|
`DELETE /slurm/v0.0.41/jobs`
|
||||||
|
`GET /slurm/v0.0.40/shares`
|
||||||
|
`GET /slurm/v0.0.41/shares`
|
||||||
|
`GET /slurmdb/v0.0.40/instance`
|
||||||
|
`GET /slurmdb/v0.0.41/instance`
|
||||||
|
`GET /slurmdb/v0.0.40/instances`
|
||||||
|
`GET /slurmdb/v0.0.41/instances`
|
||||||
|
`POST /slurm/v0.0.40/job/{job_id}`
|
||||||
|
`POST /slurm/v0.0.41/job/{job_id}`
|
||||||
|
* Fix IPMI energy gathering when no IPMIPowerSensors are specified
|
||||||
|
in `acct_gather.conf`. This situation resulted in an accounted
|
||||||
|
energy of 0 for job steps.
|
||||||
|
* Fix a minor memory leak in slurmctld when updating a job dependency.
|
||||||
|
* `scontrol`,`squeue` - Fix regression that caused incorrect values
|
||||||
|
for multisocket nodes at `.jobs[].job_resources.nodes.allocation`
|
||||||
|
for `scontrol show jobs --(json|yaml)` and `squeue --(json|yaml)`.
|
||||||
|
* `slurmrestd` - Fix regression that caused incorrect values for
|
||||||
|
multisocket nodes at `.jobs[].job_resources.nodes.allocation` to
|
||||||
|
be dumped with endpoints:
|
||||||
|
`GET /slurm/v0.0.41/job/{job_id}`
|
||||||
|
`GET /slurm/v0.0.41/jobs`
|
||||||
|
* `jobcomp/filetxt` - Fix truncation of job record lines > 1024
|
||||||
|
characters.
|
||||||
|
* `switch/hpe_slingshot` - Drain node on failure to delete CXI
|
||||||
|
services.
|
||||||
|
* Fix a performance regression from 23.11.0 in cpu frequency
|
||||||
|
handling when no `CpuFreqDef` is defined.
|
||||||
|
* Fix one-task-per-sharing not working across multiple nodes.
|
||||||
|
* Fix inconsistent number of cpus when creating a reservation
|
||||||
|
using the TRESPerNode option.
|
||||||
|
* `data_parser/v0.0.40+` - Fix job state parsing which could
|
||||||
|
break filtering.
|
||||||
|
* Prevent `cpus-per-task` to be modified in jobs where a `-c`
|
||||||
|
value has been explicitly specified and the requested memory
|
||||||
|
constraints implicitly increase the number of CPUs to allocate.
|
||||||
|
* `slurmrestd` - Fix regression where args `-s v0.0.39,dbv0.0.39`
|
||||||
|
and `-d v0.0.39` would result in `GET /openapi/v3` not
|
||||||
|
registering as a valid possible query resulting in 404 errors.
|
||||||
|
* `slurmrestd` - Fix memory leak for dbv0.0.39 jobs query which
|
||||||
|
occurred if the query parameters specified account, association,
|
||||||
|
cluster, constraints, format, groups, job_name, partition, qos,
|
||||||
|
reason, reservation, state, users, or wckey. This affects the
|
||||||
|
following endpoints:
|
||||||
|
`GET /slurmdb/v0.0.39/jobs`
|
||||||
|
* `slurmrestd` - In the case the slurmdbd does not respond to a
|
||||||
|
persistent connection init message, prevent the closed fd from
|
||||||
|
being used, and instead emit an error or warning depending on
|
||||||
|
if the connection was required.
|
||||||
|
* Fix 24.05.0 regression that caused the slurmdbd not to send back
|
||||||
|
an error message if there is an error initializing a persistent
|
||||||
|
connection.
|
||||||
|
* Reduce latency of forwarded x11 packets.
|
||||||
|
* Add `curr_dependency` (representing the current dependency of
|
||||||
|
the job).
|
||||||
|
and `orig_dependency` (representing the original requested
|
||||||
|
dependency of the job) fields to the job record in
|
||||||
|
`job_submit.lua` (for job update) and `jobcomp.lua`.
|
||||||
|
* Fix potential segfault of slurmctld configured with
|
||||||
|
`SlurmctldParameters=enable_rpc_queue` from happening on
|
||||||
|
reconfigure.
|
||||||
|
* Fix potential segfault of slurmctld on its shutdown when rate
|
||||||
|
limitting is enabled.
|
||||||
|
* `slurmrestd` - Fix missing job environment for `SLURM_JOB_NAME`,
|
||||||
|
`SLURM_OPEN_MODE`, `SLURM_JOB_DEPENDENCY`, `SLURM_PROFILE`,
|
||||||
|
`SLURM_ACCTG_FREQ`, `SLURM_NETWORK` and `SLURM_CPU_FREQ_REQ` to
|
||||||
|
match sbatch.
|
||||||
|
* Fix GRES environment variable indices being incorrect when only
|
||||||
|
using a subset of all GPUs on a node and the
|
||||||
|
`--gres-flags=allow-task-sharing` option.
|
||||||
|
* Prevent `scontrol` from segfaulting when requesting scontrol
|
||||||
|
show reservation `--json` or `--yaml` if there is an error
|
||||||
|
retrieving reservations from the `slurmctld`.
|
||||||
|
* `switch/hpe_slingshot` - Fix security issue around managing VNI
|
||||||
|
access. CVE-2024-42511.
|
||||||
|
* `switch/nvidia_imex` - Fix security issue managing IMEX channel
|
||||||
|
access. CVE-2024-42511.
|
||||||
|
* `switch/nvidia_imex` - Allow for compatibility with
|
||||||
|
`job_container/tmpfs`.
|
||||||
|
- Changes in Slurm 24.05.1
|
||||||
|
* Fix `slurmctld` and `slurmdbd` potentially stopping instead of
|
||||||
|
performing a logrotate when recieving `SIGUSR2` when using
|
||||||
|
`auth/slurm`.
|
||||||
|
* `switch/hpe_slingshot` - Fix slurmctld crash when upgrading
|
||||||
|
from 23.02.
|
||||||
|
* Fix "Could not find group" errors from `validate_group()` when
|
||||||
|
using `AllowGroups` with large `/etc/group` files.
|
||||||
|
* Add `AccountingStoreFlags=no_stdio` which allows to not record
|
||||||
|
the stdio paths of the job when set.
|
||||||
|
* `slurmrestd` - Prevent a slurmrestd segfault when parsing the
|
||||||
|
`crontab` field, which was never usable. Now it explicitly
|
||||||
|
ignores the value and emits a warning if it is used for the
|
||||||
|
following endpoints:
|
||||||
|
`POST /slurm/v0.0.39/job/{job_id}`
|
||||||
|
`POST /slurm/v0.0.39/job/submit`
|
||||||
|
`POST /slurm/v0.0.40/job/{job_id}`
|
||||||
|
`POST /slurm/v0.0.40/job/submit`
|
||||||
|
`POST /slurm/v0.0.41/job/{job_id}`
|
||||||
|
`POST /slurm/v0.0.41/job/submit`
|
||||||
|
`POST /slurm/v0.0.41/job/allocate`
|
||||||
|
* `mpi/pmi2` - Fix communication issue leading to task launch
|
||||||
|
failure with "`invalid kvs seq from node`".
|
||||||
|
* Fix getting user environment when using sbatch with
|
||||||
|
`--get-user-env` or `--export=` when there is a user profile
|
||||||
|
script that reads `/proc`.
|
||||||
|
* Prevent slurmd from crashing if `acct_gather_energy/gpu` is
|
||||||
|
configured but `GresTypes` is not configured.
|
||||||
|
* Do not log the following errors when `AcctGatherEnergyType`
|
||||||
|
plugins are used but a node does not have or cannot find sensors:
|
||||||
|
"`error: _get_joules_task: can't get info from slurmd`"
|
||||||
|
"`error: slurm_get_node_energy: Zero Bytes were transmitted or
|
||||||
|
received`"
|
||||||
|
However, the following error will continue to be logged:
|
||||||
|
"`error: Can't get energy data. No power sensors are available.
|
||||||
|
Try later`"
|
||||||
|
* `sbatch`, `srun` - Set `SLURM_NETWORK` environment variable if
|
||||||
|
`--network` is set.
|
||||||
|
* Fix cloud nodes not being able to forward to nodes that restarted
|
||||||
|
with new IP addresses.
|
||||||
|
* Fix cwd not being set correctly when running a SPANK plugin with a
|
||||||
|
`spank_user_init()` hook and the new "`contain_spank`" option set.
|
||||||
|
* `slurmctld` - Avoid deadlock during shutdown when `auth/slurm`
|
||||||
|
is active.
|
||||||
|
* Fix segfault in `slurmctld` with `topology/block`.
|
||||||
|
* `sacct` - Fix printing of job group for job steps.
|
||||||
|
* `scrun` - Log when an invalid environment variable causes the
|
||||||
|
job submission to be rejected.
|
||||||
|
* `accounting_storage/mysql` - Fix problem where listing or
|
||||||
|
modifying an association when specifying a qos list could hang
|
||||||
|
or take a very long time.
|
||||||
|
* `gpu/nvml` - Fix `gpuutil/gpumem` only tracking last GPU in step.
|
||||||
|
Now, `gpuutil/gpumem` will record sums of all GPUS in the step.
|
||||||
|
* Fix error in `scrontab` jobs when using
|
||||||
|
`slurm.conf:PropagatePrioProcess=1`.
|
||||||
|
* Fix `slurmctld` crash on a batch job submission with
|
||||||
|
`--nodes 0,...`.
|
||||||
|
* Fix dynamic IP address fanout forwarding when using `auth/slurm`.
|
||||||
|
* Restrict listening sockets in the `mpi/pmix` plugin and `sattach`
|
||||||
|
to the `SrunPortRange`.
|
||||||
|
* `slurmrestd` - Limit mime types returned from query to
|
||||||
|
`GET /openapi/v3` to only return one mime type per serializer
|
||||||
|
plugin to fix issues with OpenAPI client generators that are
|
||||||
|
unable to handle multiple mime type aliases.
|
||||||
|
* Fix many commands possibly reporting an "`Unexpected Message
|
||||||
|
Received`" when in reality the connection timed out.
|
||||||
|
* Prevent slurmctld from starting if there is not a json
|
||||||
|
serializer present and the `extra_constraints` feature is enabled.
|
||||||
|
* Fix heterogeneous job components not being signaled with
|
||||||
|
`scancel --ctld` and `DELETE slurm/v0.0.40/jobs` if the job ids
|
||||||
|
are not explicitly given, the heterogeneous job components match
|
||||||
|
the given filters, and the heterogeneous job leader does not
|
||||||
|
match the given filters.
|
||||||
|
* Fix regression from 23.02 impeding job licenses from being cleared.
|
||||||
|
* Move error to `log_flag` which made `_get_joules_task` error to
|
||||||
|
be logged to the user when too many rpcs were queued in slurmd
|
||||||
|
for gathering energy.
|
||||||
|
* For `scancel --ctld` and the associated rest api endpoints:
|
||||||
|
`DELETE /slurm/v0.0.40/jobs`
|
||||||
|
`DELETE /slurm/v0.0.41/jobs`
|
||||||
|
Fix canceling the final array task in a job array when the task
|
||||||
|
is pending and all array tasks have been split into separate job
|
||||||
|
records. Previously this task was not canceled.
|
||||||
|
* Fix `power_save operation` after recovering from a failed
|
||||||
|
reconfigure.
|
||||||
|
* `slurmctld` - Skip removing the pidfile when running under
|
||||||
|
systemd. In that situation it is never created in the first place.
|
||||||
|
* Fix issue where altering the flags on a Slurm account
|
||||||
|
(`UsersAreCoords`) several limits on the account's association
|
||||||
|
would be set to 0 in Slurm's internal cache.
|
||||||
|
* Fix memory leak in the controller when relaying `stepmgr` step
|
||||||
|
accounting to the dbd.
|
||||||
|
* Fix segfault when submitting stepmgr jobs within an existing
|
||||||
|
allocation.
|
||||||
|
* Added `disable_slurm_hydra_bootstrap` as a possible `MpiParams`
|
||||||
|
parameter in `slurm.conf`. Using this will disable env variable
|
||||||
|
injection to allocations for the following variables:
|
||||||
|
`I_MPI_HYDRA_BOOTSTRAP,` `I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS`,
|
||||||
|
`HYDRA_BOOTSTRAP`, `HYDRA_LAUNCHER_EXTRA_ARGS`.
|
||||||
|
* `scrun` - Delay shutdown until after start requested.
|
||||||
|
This caused `scrun` to never start or shutdown and hung forever
|
||||||
|
when using `--tty`.
|
||||||
|
* Fix backup `slurmctld` potentially not running the agent when
|
||||||
|
taking over as the primary controller.
|
||||||
|
* Fix primary controller not running the agent when a reconfigure
|
||||||
|
of the `slurmctld` fails.
|
||||||
|
* `slurmd` - fix premature timeout waiting for
|
||||||
|
`REQUEST_LAUNCH_PROLOG` with large array jobs causing node to
|
||||||
|
drain.
|
||||||
|
* `jobcomp/{elasticsearch,kafka}` - Avoid sending fields with
|
||||||
|
invalid date/time.
|
||||||
|
* `jobcomp/elasticsearch` - Fix `slurmctld` memory leak from
|
||||||
|
curl usage.
|
||||||
|
* `acct_gather_profile/influxdb` - Fix slurmstepd memory leak from
|
||||||
|
curl usage
|
||||||
|
* Fix 24.05.0 regression not deleting job hash dirs after
|
||||||
|
`MinJobAge`.
|
||||||
|
* Fix filtering arguments being ignored when using squeue `--json`.
|
||||||
|
* `switch/nvidia_imex` - Move setup call after `spank_init()` to
|
||||||
|
allow namespace manipulation within the SPANK plugin.
|
||||||
|
* `switch/nvidia_imex` - Skip plugin operation if
|
||||||
|
`nvidia-caps-imex-channels` device is not present rather than
|
||||||
|
preventing slurmd from starting.
|
||||||
|
* `switch/nvidia_imex` - Skip plugin operation if
|
||||||
|
`job_container/tmpfs` is configured due to incompatibility.
|
||||||
|
* `switch/nvidia_imex` - Remove any pre-existing channels when
|
||||||
|
`slurmd` starts.
|
||||||
|
* `rpc_queue` - Add support for an optional `rpc_queue.yaml`
|
||||||
|
configuration file.
|
||||||
|
* `slurmrestd` - Add new +prefer_refs flag to `data_parser/v0.0.41`
|
||||||
|
plugin. This flag will avoid inlining single referenced schemas
|
||||||
|
in the OpenAPI schema.
|
||||||
|
|
||||||
-------------------------------------------------------------------
|
-------------------------------------------------------------------
|
||||||
Tue Jun 4 09:36:54 UTC 2024 - Christian Goll <cgoll@suse.com>
|
Tue Jun 4 09:36:54 UTC 2024 - Christian Goll <cgoll@suse.com>
|
||||||
|
|
||||||
- updated to new release 24.05.0 with following major changes
|
- Updated to new release 24.05.0 with following major changes
|
||||||
- IMPORTANT NOTES:
|
* Important Notes:
|
||||||
If using the slurmdbd (Slurm DataBase Daemon) you must update
|
If using the slurmdbd (Slurm DataBase Daemon) you must update
|
||||||
this first. NOTE: If using a backup DBD you must start the
|
this first. NOTE: If using a backup DBD you must start the
|
||||||
primary first to do any database conversion, the backup will not
|
primary first to do any database conversion, the backup will not
|
||||||
@ -11,302 +278,360 @@ Tue Jun 4 09:36:54 UTC 2024 - Christian Goll <cgoll@suse.com>
|
|||||||
need to update all clusters at the same time, but it is very
|
need to update all clusters at the same time, but it is very
|
||||||
important to update slurmdbd first and having it running before
|
important to update slurmdbd first and having it running before
|
||||||
updating any other clusters making use of it.
|
updating any other clusters making use of it.
|
||||||
- HIGHLIGHTS
|
* Highlights
|
||||||
* Federation - allow client command operation when slurmdbd is
|
+ Federation - allow client command operation when slurmdbd is
|
||||||
unavailable.
|
unavailable.
|
||||||
* burst_buffer/lua - Added two new hooks: slurm_bb_test_data_in
|
+ `burst_buffer/lua` - Added two new hooks: `slurm_bb_test_data_in`
|
||||||
and slurm_bb_test_data_out. The syntax and use of the new hooks
|
and `slurm_bb_test_data_out`. The syntax and use of the new hooks
|
||||||
are documented in etc/burst_buffer.lua.example. These are
|
are documented in `etc/burst_buffer.lua.example`. These are
|
||||||
required to exist. slurmctld now checks on startup if the
|
required to exist. slurmctld now checks on startup if the
|
||||||
burst_buffer.lua script loads and contains all required hooks;
|
`burst_buffer.lua` script loads and contains all required hooks;
|
||||||
slurmctld will exit with a fatal error if this is not
|
`slurmctld` will exit with a fatal error if this is not
|
||||||
successful. Added PollInterval to burst_buffer.conf. Removed
|
successful. Added `PollInterval` to `burst_buffer.conf`. Removed
|
||||||
the arbitrary limit of 512 copies of the script running
|
the arbitrary limit of 512 copies of the script running
|
||||||
simultaneously.
|
simultaneously.
|
||||||
* Add QOS limit MaxTRESRunMinsPerAccount.
|
+ Add QOS limit `MaxTRESRunMinsPerAccount`.
|
||||||
* Add QOS limit MaxTRESRunMinsPerUser.
|
+ Add QOS limit `MaxTRESRunMinsPerUser`.
|
||||||
* Add ELIGIBLE environment variable to jobcomp/script plugin.
|
+ Add `ELIGIBLE` environment variable to `jobcomp/script` plugin.
|
||||||
* Always use the QOS name for SLURM_JOB_QOS environment variables.
|
+ Always use the QOS name for `SLURM_JOB_QOS` environment variables.
|
||||||
Previously the batch environment would use the description field,
|
Previously the batch environment would use the description field,
|
||||||
which was usually equivalent to the name.
|
which was usually equivalent to the name.
|
||||||
* cgroup/v2 - Require dbus-1 version >= 1.11.16.
|
+ `cgroup/v2` - Require dbus-1 version >= 1.11.16.
|
||||||
* Allow NodeSet names to be used in SuspendExcNodes.
|
+ Allow `NodeSet` names to be used in SuspendExcNodes.
|
||||||
* SuspendExcNodes=<nodes>:N now counts allocated nodes in N. The
|
+ `SuspendExcNodes=<nodes>:N` now counts allocated nodes in `N`.
|
||||||
first N powered up nodes in <nodes> are protected from being
|
The first `N` powered up nodes in <nodes> are protected from
|
||||||
suspended.
|
being suspended.
|
||||||
* Store job output, input and error paths in SlurmDBD.
|
+ Store job output, input and error paths in `SlurmDBD`.
|
||||||
* Add USER_DELETE reservation flag to allow users with access to
|
+ Add `USER_DELETE` reservation flag to allow users with access
|
||||||
a reservation to delete it.
|
to a reservation to delete it.
|
||||||
* Add SlurmctldParameters=enable_stepmgr to enable step
|
+ Add `SlurmctldParameters=enable_stepmgr` to enable step
|
||||||
management through the slurmstepd instead of the controller.
|
management through the `slurmstepd` instead of the controller.
|
||||||
* Added PrologFlags=RunInJob to make prolog and epilog run
|
+ Added `PrologFlags=RunInJob` to make prolog and epilog run
|
||||||
inside the job extern step to include it in the job's cgroup.
|
inside the job extern step to include it in the job's cgroup.
|
||||||
* Add ability to reserve MPI ports at the job level for stepmgr
|
+ Add ability to reserve MPI ports at the job level for stepmgr
|
||||||
jobs and subdivide them at the step level.
|
jobs and subdivide them at the step level.
|
||||||
* slurmrestd - Add --generate-openapi-spec argument.
|
+ `slurmrestd` - Add `--generate-openapi-spec argument`.
|
||||||
- CONFIGURATION FILE CHANGES (see appropriate man page for details)
|
* Configuration File Changes (see appropriate man page for details)
|
||||||
* CoreSpecPlugin has been removed.
|
+ `CoreSpecPlugin` has been removed.
|
||||||
* Removed TopologyPlugin tree and dragonfly support from
|
+ Removed `TopologyPlugin` tree and dragonfly support from
|
||||||
select/linear. If those topology plugins are desired please switch to
|
`select/linear`. If those topology plugins are desired please
|
||||||
select/cons_tres.
|
switch to `select/cons_tres`.
|
||||||
* Changed the default value for UnkillableStepTimeout to 60
|
+ Changed the default value for `UnkillableStepTimeout` to 60
|
||||||
seconds or five times the value of MessageTimeout, whichever is greater.
|
seconds or five times the value of `MessageTimeout`, whichever
|
||||||
* An error log has been added if JobAcctGatherParams 'UsePss' or
|
is greater.
|
||||||
'NoShare' are configured with a plugin other than jobacct_gather/linux.
|
+ An error log has been added if `JobAcctGatherParams` '`UsePss`'
|
||||||
In such case these parameters are ignored.
|
or '`NoShare`' are configured with a plugin other than
|
||||||
* helpers.conf - Added Flags=rebootless parameter allowing feature changes
|
`jobacct_gather/linux`. In such case these parameters are ignored.
|
||||||
without rebooting compute nodes.
|
+ `helpers.conf` - Added `Flags=rebootless` parameter allowing
|
||||||
* topology/block - Replaced the BlockLevels with BlockSizes in topology.conf.
|
feature changes without rebooting compute nodes.
|
||||||
* Add contain_spank option to SlurmdParameters. When set, spank_user_init(),
|
+ `topology/block` - Replaced the `BlockLevels` with `BlockSizes`
|
||||||
spank_task_post_fork(), and spank_task_exit() will execute within the
|
in `topology.conf`.
|
||||||
job_container/tmpfs plugin namespace.
|
+ Add `contain_spank` option to `SlurmdParameters`. When set,
|
||||||
* Add SlurmctldParameters=max_powered_nodes=N, which prevents powering up
|
`spank_user_init()`, `spank_task_post_fork()`, and
|
||||||
nodes after the max is reached.
|
`spank_task_exit()` will execute within the
|
||||||
* Add ExclusiveTopo to a partition definition in slurm.conf.
|
`job_container/tmpfs` plugin namespace.
|
||||||
* Add AccountingStorageParameters=max_step_records to limit how many steps
|
+ Add `SlurmctldParameters=max_powered_nodes=N`, which prevents
|
||||||
are recorded in the database for each job *- excluding batc
|
powering up nodes after the max is reached.
|
||||||
- COMMAND CHANGES (see man pages for details)
|
+ Add `ExclusiveTopo` to a partition definition in `slurm.conf`.
|
||||||
* Add support for "elevenses" as an additional time specification.
|
+ Add `AccountingStorageParameters=max_step_records` to limit how
|
||||||
* Add support for sbcast --preserve when job_container/tmpfs configured
|
many steps are recorded in the database for each job - excluding
|
||||||
(previously documented as unsupported).
|
batch.
|
||||||
* scontrol - Add new subcommand 'power' for node power control.
|
* Command Changes (see man pages for details)
|
||||||
* squeue - Adjust StdErr, StdOut, and StdIn output formats. These will now
|
+ Add support for "elevenses" as an additional time specification.
|
||||||
consistently print "(null)" if a value is unavailable. StdErr will no
|
+ Add support for `sbcast --preserve` when `job_container/tmpfs`
|
||||||
longer display StdOut if it is not distinctly set. StdOut will now
|
configured (previously documented as unsupported).
|
||||||
correctly display the default filename pattern for job arrays, and no
|
+ `scontrol` - Add new subcommand `power` for node power control.
|
||||||
longer show it for non*batch jobs. However, the expansion patterns will
|
+ `squeue` - Adjust `StdErr`, `StdOut`, and `StdIn` output formats.
|
||||||
|
These will now consistently print "`(null)`" if a value is
|
||||||
|
unavailable. `StdErr` will no longer display `StdOut` if it is
|
||||||
|
not distinctly set. `StdOut` will now correctly display the
|
||||||
|
default filename pattern for job arrays, and no longer show it
|
||||||
|
for non-batch jobs. However, the expansion patterns will
|
||||||
no longer be substituted by default.
|
no longer be substituted by default.
|
||||||
* Add --segment to job allocation to be used in topology/block.
|
+ Add `--segment` to job allocation to be used in topology/block.
|
||||||
* Add --exclusive=topo for use with topology/block.
|
+ Add `--exclusive=topo` for use with topology/block.
|
||||||
* squeue - Add --expand-patterns option to expand StdErr, StdOut, StdIn
|
+ `squeue` - Add `--expand-patterns` option to expand `StdErr`,
|
||||||
filename patterns as best as possible.
|
`StdOut`, `StdIn` filename patterns as best as possible.
|
||||||
* sacct - Add --expand-patterns option to expand StdErr, StdOut, StdIn
|
+ `sacct` - Add `--expand-patterns` option to expand `StdErr`,
|
||||||
filename patterns as best as possible.
|
`StdOut`, `StdIn` filename patterns as best as possible.
|
||||||
* sreport - Requesting format=Planned will now return the expected Planned
|
+ `sreport` - Requesting `format=Planned` will now return the
|
||||||
time as documented, instead of PlannedDown. To request Planned Down,
|
expected `Planned` time as documented, instead of `PlannedDown`.
|
||||||
one must use now format=PLNDDown or format=PlannedDown explicitly. The
|
To request `Planned Down`, one must use now `format=PLNDDown`
|
||||||
abbreviations "Pl" or "Pla" will now make reference to Planned instead of
|
or `format=PlannedDown` explicitly. The abbreviations
|
||||||
PlannedDown.
|
"`Pl`" or "`Pla`" will now make reference to Planned instead
|
||||||
- API CHANGES
|
of `PlannedDown`.
|
||||||
* Removed ListIterator type from <slurm/slurm.h>.
|
* API Changes
|
||||||
* Removed slurm_xlate_job_id() from <slurm/slurm.h>
|
+ Removed `ListIterator` type from `<slurm/slurm.h>`.
|
||||||
- SLURMRESTD CHANGES
|
+ Removed `slurm_xlate_job_id()` from `<slurm/slurm.h>`
|
||||||
* openapi/dbv0.0.38 and openapi/v0.0.38 plugins have been removed.
|
* SLURMRESTD Changes
|
||||||
* openapi/dbv0.0.39 and openapi/v0.0.39 plugins have been tagged as
|
+ `openapi/dbv0.0.38` and `openapi/v0.0.38` plugins have been
|
||||||
deprecated to warn of their removal in the next release.
|
removed.
|
||||||
* Changed slurmrestd.service to only listen on TCP socket by default.
|
+ `openapi/dbv0.0.39` and `openapi/v0.0.39` plugins have been
|
||||||
Environments with existing drop*in units for the service may need
|
tagged as deprecated to warn of their removal in the next release.
|
||||||
further adjustments to work after upgrading.
|
+ Changed `slurmrestd.service` to only listen on TCP socket by
|
||||||
* slurmrestd - Tagged `script` field as deprecated in
|
default. Environments with existing drop-in units for the
|
||||||
'POST /slurm/v0.0.41/job/submit' in anticipation of removal in future
|
service may need further adjustments to work after upgrading.
|
||||||
OpenAPI plugin versions. Job submissions should set the `job.script` (or
|
+ `slurmrestd` - Tagged `script` field as deprecated in
|
||||||
`jobs[0].script` for HetJobs) fields instead.
|
`POST /slurm/v0.0.41/job/submit` in anticipation of removal in
|
||||||
* slurmrestd - Attempt to automatically convert enumerated string arrays with
|
future OpenAPI plugin versions. Job submissions should set the
|
||||||
incoming non*string values into strings. Add warning when incoming value for
|
`job.script` (or `jobs[0].script` for HetJobs) fields instead.
|
||||||
enumerated string arrays can not be converted to string and silently ignore
|
+ `slurmrestd` - Attempt to automatically convert enumerated
|
||||||
instead of rejecting entire request. This change affects any endpoint that
|
string arrays with incoming non-string values into strings.
|
||||||
uses an enunmerated string as given in the OpenAPI specification. An
|
Add warning when incoming value for enumerated string arrays
|
||||||
example of this conversion would be to 'POST /slurm/v0.0.41/job/submit' with
|
can not be converted to string and silently ignore instead of
|
||||||
'.job.exclusive = true'. While the JSON (boolean) true value matches a
|
rejecting entire request. This change affects any endpoint that
|
||||||
possible enumeration, it is not the expected "true" string. This change
|
uses an enunmerated string as given in the OpenAPI specification.
|
||||||
automatically converts the (boolean) true to (string) "true" avoiding a
|
An example of this conversion would be to
|
||||||
parsing failure.
|
`POST /slurm/v0.0.41/job/submit` with `.job.exclusive = true`.
|
||||||
* slurmrestd - Add 'POST /slurm/v0.0.41/job/allocate' endpoint. This endpoint
|
While the JSON (boolean) true value matches a possible
|
||||||
will create a new job allocation without any steps. The allocation will need
|
enumeration, it is not the expected "true" string. This change
|
||||||
to be ended via signaling the job or it will run to the timelimit.
|
automatically converts the (boolean) `true` to (string) "`true`"
|
||||||
* slurmrestd - Allow startup when slurmdbd is not configured and avoid loading
|
avoiding a parsing failure.
|
||||||
slurmdbd specific plugins.
|
+ `slurmrestd` - Add `POST /slurm/v0.0.41/job/allocate` endpoint.
|
||||||
- MPI/PMI2 CHANGES
|
This endpoint will create a new job allocation without any steps.
|
||||||
* Jobs submitted with the SLURM_HOSTFILE environment variable set implies
|
The allocation will need to be ended via signaling the job or
|
||||||
using an arbitrary distribution. Nevertheless, the logic used in PMI2 when
|
it will run to the timelimit.
|
||||||
generating their associated PMI_process_mapping values has been changed and
|
+ `slurmrestd` - Allow startup when `slurmdbd` is not configured
|
||||||
will now be the same used for the plane distribution, as if "-m plane" were
|
and avoid loading `slurmdbd` specific plugins.
|
||||||
used. This has been changed because the original arbitrary distribution
|
* MPI/PMI2 Changes
|
||||||
implementation did not account for multiple instances of the same host being
|
+ Jobs submitted with the `SLURM_HOSTFILE` environment variable
|
||||||
present in SLURM_HOSTFILE, providing an incorrect process mapping in such
|
set implies using an arbitrary distribution. Nevertheless, the
|
||||||
case. This change also enables distributing tasks in blocks when using
|
logic used in PMI2 when generating their associated
|
||||||
arbitrary distribution, which was not the case before. This only affects
|
`PMI_process_mapping` values has been changed and will now be
|
||||||
mpi/pmi2 plugin.
|
the same used for the plane distribution, as if `-m plane` were
|
||||||
- removed Fix-test-21.41.patch as upstream test changed
|
used. This has been changed because the original arbitrary
|
||||||
|
distribution implementation did not account for multiple
|
||||||
|
instances of the same host being present in `SLURM_HOSTFILE`,
|
||||||
|
providing an incorrect process mapping in such case. This
|
||||||
|
change also enables distributing tasks in blocks when using
|
||||||
|
arbitrary distribution, which was not the case before. This
|
||||||
|
only affects `mpi`/`pmi2` plugin.
|
||||||
|
* Removed Fix-test-21.41.patch as upstream test changed.
|
||||||
|
|
||||||
-------------------------------------------------------------------
|
-------------------------------------------------------------------
|
||||||
Mon Mar 25 15:16:44 UTC 2024 - Christian Goll <cgoll@suse.com>
|
Mon Mar 25 15:16:44 UTC 2024 - Christian Goll <cgoll@suse.com>
|
||||||
|
|
||||||
- removed Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch
|
- removed Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch
|
||||||
as incoperated upstream
|
as incoperated upstream
|
||||||
* Changes in Slurm 23.02.5
|
- Changes in Slurm 23.02.5
|
||||||
* Add the JobId to debug() messages indicating when cpus_per_task/mem_per_cpu
|
* Add the `JobId` to `debug()` messages indicating when
|
||||||
or pn_min_cpus are being automatically adjusted.
|
`cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being
|
||||||
* Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
|
automatically adjusted.
|
||||||
a node features plugin is configured.
|
* Fix regression in 23.02.2 that caused `slurmctld -R` to crash on
|
||||||
|
startup if a node features plugin is configured.
|
||||||
* Fix and prevent reoccurring reservations from overlapping.
|
* Fix and prevent reoccurring reservations from overlapping.
|
||||||
* job_container/tmpfs - Avoid attempts to share BasePath between nodes.
|
* `job_container/tmpfs` - Avoid attempts to share `BasePath`
|
||||||
* Change the log message warning for rate limited users from verbose to info.
|
between nodes.
|
||||||
* With CR_Cpu_Memory, fix node selection for jobs that request gres and
|
* Change the log message warning for rate limited users from
|
||||||
*-mem-per-cpu.
|
verbose to info.
|
||||||
* Fix a regression from 22.05.7 in which some jobs were allocated too few
|
* With `CR_Cpu_Memory`, fix node selection for jobs that request
|
||||||
nodes, thus overcommitting cpus to some tasks.
|
gres and `--mem-per-cpu`.
|
||||||
* Fix a job being stuck in the completing state if the job ends while the
|
* Fix a regression from 22.05.7 in which some jobs were allocated
|
||||||
primary controller is down or unresponsive and the backup controller has
|
too few nodes, thus overcommitting cpus to some tasks.
|
||||||
not yet taken over.
|
* Fix a job being stuck in the completing state if the job ends
|
||||||
* Fix slurmctld segfault when a node registers with a configured CpuSpecList
|
while the primary controller is down or unresponsive and the
|
||||||
while slurmctld configuration has the node without CpuSpecList.
|
backup controller has not yet taken over.
|
||||||
* Fix cloud nodes getting stuck in POWERED_DOWN+NO_RESPOND state after not
|
* Fix `slurmctld` segfault when a node registers with a configured
|
||||||
registering by ResumeTimeout.
|
`CpuSpecList` while slurmctld configuration has the node without
|
||||||
* slurmstepd - Avoid cleanup of config.json-less containers spooldir getting
|
`CpuSpecList`.
|
||||||
skipped.
|
* Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state
|
||||||
* slurmstepd - Cleanup per task generated environment for containers in
|
after not registering by `ResumeTimeout`.
|
||||||
spooldir.
|
* `slurmstepd` - Avoid cleanup of `config.json`-less containers
|
||||||
* Fix scontrol segfault when 'completing' command requested repeatedly in
|
spooldir getting skipped.
|
||||||
interactive mode.
|
* `slurmstepd` - Cleanup per task generated environment for
|
||||||
* Properly handle a race condition between bind() and listen() calls in the
|
containers in spooldir.
|
||||||
network stack when running with SrunPortRange set.
|
* Fix `scontrol segfault` when 'completing' command requested
|
||||||
* Federation - Fix revoked jobs being returned regardless of the -a/--all
|
repeatedly in interactive mode.
|
||||||
option for privileged users.
|
* Properly handle a race condition between `bind()` and `listen()`
|
||||||
* Federation - Fix canceling pending federated jobs from non-origin clusters
|
calls in the network stack when running with `SrunPortRange` set.
|
||||||
which could leave federated jobs orphaned from the origin cluster.
|
* Federation - Fix revoked jobs being returned regardless of the
|
||||||
* Fix sinfo segfault when printing multiple clusters with --noheader option.
|
`-a`/`--all` option for privileged users.
|
||||||
* Federation - fix clusters not syncing if clusters are added to a federation
|
* Federation - Fix canceling pending federated jobs from non-origin
|
||||||
before they have registered with the dbd.
|
clusters which could leave federated jobs orphaned from the origin
|
||||||
* Change pmi2 plugin to honor the SrunPortRange option. This matches the new
|
cluster.
|
||||||
behavior of the pmix plugin in 23.02.0. Note that neither of these plugins
|
* Fix sinfo segfault when printing multiple clusters with
|
||||||
makes use of the "MpiParams=ports=" option, and previously were only limited
|
`--noheader` option.
|
||||||
by the systems ephemeral port range.
|
* Federation - fix clusters not syncing if clusters are added to
|
||||||
* node_features/helpers - Fix node selection for jobs requesting changeable
|
a federation before they have registered with the dbd.
|
||||||
features with the '|' operator, which could prevent jobs from running on
|
* Change `pmi2` plugin to honor the `SrunPortRange` option. This
|
||||||
some valid nodes.
|
matches the new behavior of the pmix plugin in 23.02.0. Note that
|
||||||
* node_features/helpers - Fix inconsistent handling of '&' and '|', where an
|
neither of these plugins makes use of the "`MpiParams=ports=`"
|
||||||
AND'd feature was sometimes AND'd to all sets of features instead of just
|
option, and previously were only limited by the systems ephemeral
|
||||||
the current set. E.g. "foo|bar&baz" was interpreted as {foo,baz} or
|
port range.
|
||||||
{bar,baz} instead of how it is documented: "{foo} or {bar,baz}".
|
* `node_features/helpers` - Fix node selection for jobs requesting
|
||||||
* Fix job accounting so that when a job is requeued its allocated node count
|
changeable features with the '`|`' operator, which could prevent
|
||||||
is cleared. After the requeue, sacct will correctly show that the job has
|
jobs from running on some valid nodes.
|
||||||
0 AllocNodes while it is pending or if it is canceled before restarting.
|
* `node_features/helpers` - Fix inconsistent handling of '`&`' and
|
||||||
* sacct - AllocCPUS now correctly shows 0 if a job has not yet received an
|
'`|`', where an AND'd feature was sometimes AND'd to all sets of
|
||||||
allocation or if the job was canceled before getting one.
|
features instead of just the current set. E.g. "`foo|bar&baz`" was
|
||||||
* Fix intel oneapi autodetect: detect the /dev/dri/renderD[0-9]+ gpus, and do
|
interpreted as `{foo,baz}` or `{bar,baz}` instead of how it is
|
||||||
not detect /dev/dri/card[0*9]+.
|
documented: "`{foo} or {bar,baz}`".
|
||||||
* Format batch, extern, interactive, and pending step ids into strings that
|
* Fix job accounting so that when a job is requeued its allocated
|
||||||
are human readable.
|
node count is cleared. After the requeue, sacct will correctly
|
||||||
* Fix node selection for jobs that request --gpus and a number of tasks fewer
|
show that the job has 0 `AllocNodes` while it is pending or if
|
||||||
than gpus, which resulted in incorrectly rejecting these jobs.
|
it is canceled before restarting.
|
||||||
* Remove MYSQL_OPT_RECONNECT completely.
|
* `sacct` - `AllocCPUS` now correctly shows 0 if a job has not yet
|
||||||
* Fix cloud nodes in POWERING_UP state disappearing (getting set to FUTURE)
|
received an allocation or if the job was canceled before getting
|
||||||
when an `scontrol reconfigure` happens.
|
one.
|
||||||
* openapi/dbv0.0.39 - Avoid assert / segfault on missing coordinators list.
|
* Fix intel oneapi autodetect: detect the `/dev/dri/renderD[0-9]+`
|
||||||
* slurmrestd - Correct memory leak while parsing OpenAPI specification
|
gpus, and do not detect `/dev/dri/card[0-9]+`.
|
||||||
templates with server overrides.
|
* Format batch, extern, interactive, and pending step ids into
|
||||||
* slurmrestd - Reduce memory usage when printing out job CPU frequency.
|
strings that are human readable.
|
||||||
|
* Fix node selection for jobs that request `--gpus` and a number
|
||||||
|
of tasks fewer than gpus, which resulted in incorrectly rejecting
|
||||||
|
these jobs.
|
||||||
|
* Remove `MYSQL_OPT_RECONNECT` completely.
|
||||||
|
* Fix cloud nodes in `POWERING_UP` state disappearing (getting set
|
||||||
|
to `FUTURE`) when an `scontrol reconfigure` happens.
|
||||||
|
* `openapi/dbv0.0.39` - Avoid assert / segfault on missing
|
||||||
|
coordinators list.
|
||||||
|
* `slurmrestd` - Correct memory leak while parsing OpenAPI
|
||||||
|
specification templates with server overrides.
|
||||||
|
* `slurmrestd` - Reduce memory usage when printing out job CPU
|
||||||
|
frequency.
|
||||||
* Fix overwriting user node reason with system message.
|
* Fix overwriting user node reason with system message.
|
||||||
* Remove --uid / --gid options from salloc and srun commands.
|
* Remove `--uid` / `--gid` options from salloc and srun commands.
|
||||||
* Prevent deadlock when rpc_queue is enabled.
|
* Prevent deadlock when rpc_queue is enabled.
|
||||||
* slurmrestd - Correct OpenAPI specification generation bug where fields with
|
* `slurmrestd` - Correct OpenAPI specification generation bug where
|
||||||
overlapping parent paths would not get generated.
|
fields with overlapping parent paths would not get generated.
|
||||||
* Fix memory leak as a result of a partition info query.
|
* Fix memory leak as a result of a partition info query.
|
||||||
* Fix memory leak as a result of a job info query.
|
* Fix memory leak as a result of a job info query.
|
||||||
* slurmrestd - For 'GET /slurm/v0.0.39/node[s]', change format of node's
|
* slurmrestd - For `GET /slurm/v0.0.39/node[s]`, change format of
|
||||||
energy field "current_watts" to a dictionary to account for unset value
|
node's energy field `current_watts` to a dictionary to account
|
||||||
instead of dumping 4294967294.
|
for unset value instead of dumping `4294967294`.
|
||||||
* slurmrestd - For 'GET /slurm/v0.0.39/qos', change format of QOS's
|
* `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of
|
||||||
field "priority" to a dictionary to account for unset value instead of
|
QOS's field `priority` to a dictionary to account for unset
|
||||||
dumping 4294967294.
|
value instead of dumping `4294967294`.
|
||||||
* slurmrestd - For 'GET /slurm/v0.0.39/job[s]', the 'return code' code field
|
* `slurmrestd` - For `GET /slurm/v0.0.39/job[s]`, the `return code`
|
||||||
in v0.0.39_job_exit_code will be set to *127 instead of being left unset
|
code field in `v0.0.39_job_exit_code` will be set to 127 instead
|
||||||
where job does not have a relevant return code.
|
of being left unset where job does not have a relevant return code.
|
||||||
* data_parser/v0.0.39 - Add required/memory_per_cpu and
|
* `data_parser/v0.0.39` - Add `required/memory_per_cpu` and
|
||||||
required/memory_per_node to `sacct *-json` and `sacct --yaml` and
|
required/memory_per_node to `sacct --json` and `sacct --yaml` and
|
||||||
'GET /slurmdb/v0.0.39/jobs' from slurmrestd.
|
`GET /slurmdb/v0.0.39/jobs` from `slurmrestd`.
|
||||||
* For step allocations, fix --gres=none sometimes not ignoring gres from the
|
* For step allocations, fix `--gres=none` sometimes not ignoring
|
||||||
job.
|
gres from the job.
|
||||||
* Fix --exclusive jobs incorrectly gang-scheduling where they shouldn't.
|
* Fix `--exclusive` jobs incorrectly gang-scheduling where they
|
||||||
* Fix allocations with CR_SOCKET, gres not assigned to a specific socket, and
|
shouldn't.
|
||||||
block core distribion potentially allocating more sockets than required.
|
* Fix allocations with `CR_SOCKET`, gres not assigned to a specific
|
||||||
* gpu/oneapi - Store cores correctly so CPU affinity is tracked.
|
socket, and block core distribion potentially allocating more
|
||||||
* Revert a change in 23.02.3 where Slurm would kill a script's process group
|
sockets than required.
|
||||||
as soon as the script ended instead of waiting as long as any process in
|
* `gpu/oneapi` - Store cores correctly so CPU affinity is tracked.
|
||||||
that process group held the stdout/stderr file descriptors open. That change
|
* Revert a change in 23.02.3 where Slurm would kill a script's
|
||||||
broke some scripts that relied on the previous behavior. Setting time limits
|
process group as soon as the script ended instead of waiting as
|
||||||
for scripts (such as PrologEpilogTimeout) is strongly encouraged to avoid
|
long as any process in
|
||||||
Slurm waiting indefinitely for scripts to finish.
|
that process group held the stdout/stderr file descriptors open.
|
||||||
|
That change broke some scripts that relied on the previous
|
||||||
|
behavior. Setting time limits for scripts (such as
|
||||||
|
`PrologEpilogTimeout`) is strongly encouraged to avoid Slurm
|
||||||
|
waiting indefinitely for scripts to finish.
|
||||||
* Allow slurmdbd -R to work if the root assoc id is not 1.
|
* Allow slurmdbd -R to work if the root assoc id is not 1.
|
||||||
* Fix slurmdbd -R not returning an error under certain conditions.
|
* Fix `slurmdbd -R` not returning an error under certain conditions.
|
||||||
* slurmdbd - Avoid potential NULL pointer dereference in the mysql plugin.
|
* `slurmdbd` - Avoid potential NULL pointer dereference in the
|
||||||
* Revert a change in 23.02 where SLURM_NTASKS was no longer set in the job's
|
mysql plugin.
|
||||||
environment when *-ntasks-per-node was requested.
|
* Revert a change in 23.02 where `SLURM_NTASKS` was no longer
|
||||||
* Limit periodic node registrations to 50 instead of the full TreeWidth.
|
set in the job's environment when `--ntasks-per-node` was
|
||||||
Since unresolvable cloud/dynamic nodes must disable fanout by setting
|
requested.
|
||||||
TreeWidth to a large number, this would cause all nodes to register at
|
* Limit periodic node registrations to 50 instead of the full
|
||||||
once.
|
`TreeWidth`.
|
||||||
* Fix regression in 23.02.3 which broken x11 forwarding for hosts when
|
Since unresolvable `cloud/dynamic` nodes must disable fanout by
|
||||||
MUNGE sends a localhost address in the encode host field. This is caused
|
setting `TreeWidth` to a large number, this would cause all nodes
|
||||||
when the node hostname is mapped to 127.0.0.1 (or similar) in /etc/hosts.
|
to register at once.
|
||||||
* openapi/[db]v0.0.39 - fix memory leak on parsing error.
|
* Fix regression in 23.02.3 which broken x11 forwarding for hosts
|
||||||
* data_parser/v0.0.39 - fix updating qos for associations.
|
when `MUNGE` sends a localhost address in the encode host field.
|
||||||
* openapi/dbv0.0.39 - fix updating values for associations with null users.
|
This is caused when the node hostname is mapped to 127.0.0.1
|
||||||
* Fix minor memory leak with --tres-per-task and licenses.
|
(or similar) in `/etc/hosts`.
|
||||||
|
* `openapi/[db]v0.0.39` - fix memory leak on parsing error.
|
||||||
|
* `data_parser/v0.0.39` - fix updating qos for associations.
|
||||||
|
* `openapi/dbv0.0.39` - fix updating values for associations with
|
||||||
|
null users.
|
||||||
|
* Fix minor memory leak with `--tres-per-task` and licenses.
|
||||||
* Fix cyclic socket cpu distribution for tasks in a step where
|
* Fix cyclic socket cpu distribution for tasks in a step where
|
||||||
--cpus-per-task < usable threads per core.
|
`--cpus-per-task` < usable threads per core.
|
||||||
- Changes in Slurm 23.02.4
|
- Changes in Slurm 23.02.4
|
||||||
* Fix sbatch return code when **wait is requested on a job array.
|
* Fix `sbatch` return code when --wait is requested on a job array.
|
||||||
* switch/hpe_slingshot * avoid segfault when running with old libcxi.
|
* `switch/hpe_slingshot` - avoid segfault when running with old
|
||||||
* Avoid slurmctld segfault when specifying AccountingStorageExternalHost.
|
libcxi.
|
||||||
* Fix collected GPUUtilization values for acct_gather_profile plugins.
|
* Avoid slurmctld segfault when specifying
|
||||||
|
`AccountingStorageExternalHost`.
|
||||||
|
* Fix collected `GPUUtilization` values for `acct_gather_profile`
|
||||||
|
plugins.
|
||||||
* Fix slurmrestd handling of job hold/release operations.
|
* Fix slurmrestd handling of job hold/release operations.
|
||||||
* Make spank S_JOB_ARGV item value hold the requested command argv instead of
|
* Make spank `S_JOB_ARGV` item value hold the requested command
|
||||||
the srun **bcast value when **bcast requested (only in local context).
|
argv instead of the srun `--bcast` value when `--bcast` requested
|
||||||
* Fix step running indefinitely when slurmctld takes more than MessageTimeout
|
(only in local context).
|
||||||
to respond. Now, slurmctld will cancel the step when detected, preventing
|
* Fix step running indefinitely when slurmctld takes more than
|
||||||
following steps from getting stuck waiting for resources to be released.
|
`MessageTimeout` to respond. Now, `slurmctld` will cancel the
|
||||||
* Fix regression to make job_desc.min_cpus accurate again in job_submit when
|
step when detected, preventing following steps from getting stuck
|
||||||
requesting a job with **ntasks*per*node.
|
waiting for resources to be released.
|
||||||
* scontrol * Permit changes to StdErr and StdIn for pending jobs.
|
* Fix regression to make job_desc.min_cpus accurate again in
|
||||||
* scontrol * Reset std{err,in,out} when set to empty string.
|
job_submit when requesting a job with `--ntasks-per-node`.
|
||||||
* slurmrestd * mark environment as a required field for job submission
|
* `scontrol` - Permit changes to `StdErr` and `StdIn` for pending
|
||||||
descriptions.
|
jobs.
|
||||||
* slurmrestd * avoid dumping null in OpenAPI schema required fields.
|
* `scontrol` - Reset std{err,in,out} when set to empty string.
|
||||||
* data_parser/v0.0.39 * avoid rejecting valid memory_per_node formatted as
|
* `slurmrestd` - mark environment as a required field for job
|
||||||
dictionary provided with a job description.
|
submission descriptions.
|
||||||
* data_parser/v0.0.39 * avoid rejecting valid memory_per_cpu formatted as
|
* `slurmrestd` - avoid dumping null in OpenAPI schema required
|
||||||
dictionary provided with a job description.
|
fields.
|
||||||
* slurmrestd * Return HTTP error code 404 when job query fails.
|
`data_parser/v0.0.39` - avoid rejecting valid `memory_per_node`
|
||||||
* slurmrestd * Add return schema to error response to job and license query.
|
formatted as dictionary provided with a job description.
|
||||||
|
* `data_parser/v0.0.39` - avoid rejecting valid `memory_per_cpu`
|
||||||
|
formatted as dictionary provided with a job description.
|
||||||
|
* `slurmrestd` - Return HTTP error code 404 when job query fails.
|
||||||
|
* `slurmrestd` - Add return schema to error response to job and
|
||||||
|
license query.
|
||||||
* Fix handling of ArrayTaskThrottle in backfill.
|
* Fix handling of ArrayTaskThrottle in backfill.
|
||||||
* Fix regression in 23.02.2 when checking gres state on slurmctld startup or
|
* Fix regression in 23.02.2 when checking gres state on `slurmctld`
|
||||||
reconfigure. Gres changes in the configuration were not updated on slurmctld
|
startup or reconfigure. Gres changes in the configuration were
|
||||||
startup. On startup or reconfigure, these messages were present in the log:
|
not updated on `slurmctld` startup. On startup or reconfigure,
|
||||||
"error: Attempt to change gres/gpu Count".
|
these messages were present in the log:
|
||||||
|
"`error: Attempt to change gres/gpu Count`".
|
||||||
* Fix potential double count of gres when dealing with limits.
|
* Fix potential double count of gres when dealing with limits.
|
||||||
* switch/hpe_slingshot * support alternate traffic class names with "TC_"
|
* `switch/hpe_slingshot` - support alternate traffic class names
|
||||||
prefix.
|
with "`TC_`" prefix.
|
||||||
* scrontab * Fix cutting off the final character of quoted variables.
|
* `scrontab` - Fix cutting off the final character of quoted
|
||||||
* Fix slurmstepd segfault when ContainerPath is not set in oci.conf
|
variables.
|
||||||
* Change the log message warning for rate limited users from debug to verbose.
|
* Fix `slurmstepd` segfault when `ContainerPath` is not set in
|
||||||
* Fixed an issue where jobs requesting licenses were incorrectly rejected.
|
`oci.conf`.
|
||||||
* smail * Fix issues where e*mails at job completion were not being sent.
|
* Change the log message warning for rate limited users from
|
||||||
* scontrol/slurmctld * fix comma parsing when updating a reservation's nodes.
|
debug to verbose.
|
||||||
* cgroup/v2 * Avoid capturing log output for ebpf when constraining devices,
|
* Fixed an issue where jobs requesting licenses were incorrectly
|
||||||
as this can lead to inadvertent failure if the log buffer is too small.
|
rejected.
|
||||||
* Fix **gpu*bind=single binding tasks to wrong gpus, leading to some gpus
|
* `smail` - Fix issues where emails at job completion were not
|
||||||
having more tasks than they should and other gpus being unused.
|
being sent.
|
||||||
* Fix main scheduler loop not starting after failover to backup controller.
|
* `scontrol/slurmctld` - fix comma parsing when updating a
|
||||||
* Added error message when attempting to use sattach on batch or extern steps.
|
reservation's nodes.
|
||||||
* Fix regression in 23.02 that causes slurmstepd to crash when srun requests
|
* `cgroup/v2` - Avoid capturing log output for ebpf when
|
||||||
more than TreeWidth nodes in a step and uses the pmi2 or pmix plugin.
|
constraining devices, as this can lead to inadvertent failure
|
||||||
* Reject job ArrayTaskThrottle update requests from unprivileged users.
|
if the log buffer is too small.
|
||||||
* data_parser/v0.0.39 * populate description fields of property objects in
|
* Fix --gpu-bind=single binding tasks to wrong gpus, leading to
|
||||||
generated OpenAPI specifications where defined.
|
some gpus having more tasks than they should and other gpus being
|
||||||
* slurmstepd * Avoid segfault caused by ContainerPath not being terminated by
|
unused.
|
||||||
'/' in oci.conf.
|
* Fix main scheduler loop not starting after failover to backup
|
||||||
* data_parser/v0.0.39 * Change v0.0.39_job_info response to tag exit_code
|
controller.
|
||||||
field as being complex instead of only an unsigned integer.
|
* Added error message when attempting to use sattach on batch or
|
||||||
* job_container/tmpfs * Fix %h and %n substitution in BasePath where %h was
|
extern steps.
|
||||||
substituted as the NodeName instead of the hostname, and %n was substituted
|
* Fix regression in 23.02 that causes slurmstepd to crash when
|
||||||
as an empty string.
|
`srun` requests more than `TreeWidth` nodes in a step and uses
|
||||||
* Fix regression where **cpu*bind=verbose would override TaskPluginParam.
|
the `pmi2` or `pmix` plugin.
|
||||||
* scancel * Fix **clusters/*M for federations. Only filtered jobs (e.g. *A,
|
* Reject job `ArrayTaskThrottle` update requests from unprivileged
|
||||||
*u, *p, etc.) from the specified clusters will be canceled, rather than all
|
users.
|
||||||
jobs in the federation. Specific jobids will still be routed to the origin
|
* `data_parser/v0.0.39` - populate description fields of property
|
||||||
cluster for cancellation.
|
objects in generated OpenAPI specifications where defined.
|
||||||
|
* `slurmstepd` - Avoid segfault caused by ContainerPath not being
|
||||||
|
terminated by '`/`' in `oci.conf`.
|
||||||
|
* `data_parser/v0.0.39` - Change `v0.0.39_job_info` response to tag
|
||||||
|
`exit_code` field as being complex instead of only an unsigned
|
||||||
|
integer.
|
||||||
|
* `job_container/tmpfs` - Fix %h and %n substitution in `BasePath`
|
||||||
|
where `%h` was substituted as the `NodeName` instead of the
|
||||||
|
hostname, and `%n` was substituted as an empty string.
|
||||||
|
* Fix regression where --cpu-bind=verbose would override
|
||||||
|
`TaskPluginParam`.
|
||||||
|
* `scancel` - Fix `--clusters`/`-M` for federations. Only filtered
|
||||||
|
jobs (e.g. -A, -u, -p, etc.) from the specified clusters will be
|
||||||
|
canceled, rather than all jobs in the federation.
|
||||||
|
Specific jobids will still be routed to the origin cluster
|
||||||
|
for cancellation.
|
||||||
|
|
||||||
-------------------------------------------------------------------
|
-------------------------------------------------------------------
|
||||||
Mon Jan 29 13:47:55 UTC 2024 - Egbert Eich <eich@suse.com>
|
Mon Jan 29 13:47:55 UTC 2024 - Egbert Eich <eich@suse.com>
|
||||||
@ -2337,7 +2662,6 @@ Fri Jul 2 08:01:32 UTC 2021 - Christian Goll <cgoll@suse.com>
|
|||||||
- Updated to 20.11.8:
|
- Updated to 20.11.8:
|
||||||
* slurmctld - fix erroneous "StepId=CORRUPT" messages in error logs.
|
* slurmctld - fix erroneous "StepId=CORRUPT" messages in error logs.
|
||||||
* Correct the error given when auth plugin fails to pack a credential.
|
* Correct the error given when auth plugin fails to pack a credential.
|
||||||
* Fix unused-variable compiler warning on FreeBSD in fd_resolve_path().
|
|
||||||
* acct_gather_filesystem/lustre - only emit collection error once per step.
|
* acct_gather_filesystem/lustre - only emit collection error once per step.
|
||||||
* Add GRES environment variables (e.g., CUDA_VISIBLE_DEVICES) into the
|
* Add GRES environment variables (e.g., CUDA_VISIBLE_DEVICES) into the
|
||||||
interactive step, the same as is done for the batch step.
|
interactive step, the same as is done for the batch step.
|
||||||
|
@ -19,7 +19,7 @@
|
|||||||
# Check file META in sources: update so_version to (API_CURRENT - API_AGE)
|
# Check file META in sources: update so_version to (API_CURRENT - API_AGE)
|
||||||
%define so_version 41
|
%define so_version 41
|
||||||
# Make sure to update `upgrades` as well!
|
# Make sure to update `upgrades` as well!
|
||||||
%define ver 24.05.0
|
%define ver 24.05.3
|
||||||
%define _ver _24_05
|
%define _ver _24_05
|
||||||
%define dl_ver %{ver}
|
%define dl_ver %{ver}
|
||||||
# so-version is 0 and seems to be stable
|
# so-version is 0 and seems to be stable
|
||||||
@ -59,6 +59,9 @@ ExclusiveArch: do_not_build
|
|||||||
%if 0%{?sle_version} == 150500 || 0%{?sle_version} == 150600
|
%if 0%{?sle_version} == 150500 || 0%{?sle_version} == 150600
|
||||||
%define base_ver 2302
|
%define base_ver 2302
|
||||||
%endif
|
%endif
|
||||||
|
%if 0%{?sle_version} == 150500 || 0%{?sle_version} == 150600
|
||||||
|
%define base_ver 2302
|
||||||
|
%endif
|
||||||
|
|
||||||
%define ver_m %{lua:x=string.gsub(rpm.expand("%ver"),"%.[^%.]*$","");print(x)}
|
%define ver_m %{lua:x=string.gsub(rpm.expand("%ver"),"%.[^%.]*$","");print(x)}
|
||||||
# Keep format_spec_file from botching the define below:
|
# Keep format_spec_file from botching the define below:
|
||||||
|
Loading…
x
Reference in New Issue
Block a user