From b2f6e848a17d4f719632f0dbad4b9d9af642720b906e1b0f824d9754582b13ff Mon Sep 17 00:00:00 2001 From: Egbert Eich Date: Tue, 15 Oct 2024 06:51:09 +0000 Subject: [PATCH] - Update to version 24.05.3 * `data_parser/v0.0.40` - Added field descriptions. * `slurmrestd` - Avoid creating new slurmdbd connection per request to `* /slurm/slurmctld/*/*` endpoints. * Fix compilation issue with `switch/hpe_slingshot` plugin. * Fix gres per task allocation with threads-per-core. * `data_parser/v0.0.41` - Added field descriptions. * `slurmrestd` - Change back generated OpenAPI schema for `DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using parameters for request. `slurmrestd` will continue accept endpoint requests via `RequestBody` or HTTP query. * `topology/tree` - Fix issues with switch distance optimization. * Fix potential segfault of secondary `slurmctld` when falling back to the primary when running with a `JobComp` plugin. * Enable `--json`/`--yaml=v0.0.39` options on client commands to dump data using data_parser/v0.0.39 instead or outputting nothing. * `switch/hpe_slingshot` - Fix issue that could result in a 0 length state file. * Fix unnecessary message protocol downgrade for unregistered nodes. * Fix unnecessarily packing alias addrs when terminating jobs with a mix of non-cloud/dynamic nodes and powered down cloud/dynamic nodes. * `accounting_storage/mysql` - Fix issue when deleting a qos that could remove too many commas from the qos and/or delta_qos fields of the assoc table. * `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU. * Fix allowing access to reservations without `MaxStartDelay` set. * Fix regression introduced in 24.05.0rc1 breaking `srun --send-libs` parsing. * Fix slurmd vsize memory leak when using job submission/allocation OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=295 --- slurm-24.05.0.tar.bz2 | 3 - slurm-24.05.3.tar.bz2 | 3 + slurm.changes | 920 ++++++++++++++++++++++++++++-------------- slurm.spec | 5 +- 4 files changed, 629 insertions(+), 302 deletions(-) delete mode 100644 slurm-24.05.0.tar.bz2 create mode 100644 slurm-24.05.3.tar.bz2 diff --git a/slurm-24.05.0.tar.bz2 b/slurm-24.05.0.tar.bz2 deleted file mode 100644 index f03a32b..0000000 --- a/slurm-24.05.0.tar.bz2 +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:a6d3e95f2bbda3c9567060efc3d7090ad8eac257fa3578798c89321957946e49 -size 7117445 diff --git a/slurm-24.05.3.tar.bz2 b/slurm-24.05.3.tar.bz2 new file mode 100644 index 0000000..725eecc --- /dev/null +++ b/slurm-24.05.3.tar.bz2 @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b0b40513e9b6ae867ddb95d60b950bcb980c15b735b5d0dea37a9a00cc64ae24 +size 7189600 diff --git a/slurm.changes b/slurm.changes index 1d36455..a757ca4 100644 --- a/slurm.changes +++ b/slurm.changes @@ -1,312 +1,637 @@ +------------------------------------------------------------------- +Mon Oct 14 10:40:10 UTC 2024 - Egbert Eich + +- Update to version 24.05.3 + * `data_parser/v0.0.40` - Added field descriptions. + * `slurmrestd` - Avoid creating new slurmdbd connection per request + to `* /slurm/slurmctld/*/*` endpoints. + * Fix compilation issue with `switch/hpe_slingshot` plugin. + * Fix gres per task allocation with threads-per-core. + * `data_parser/v0.0.41` - Added field descriptions. + * `slurmrestd` - Change back generated OpenAPI schema for + `DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using + parameters for request. `slurmrestd` will continue accept endpoint + requests via `RequestBody` or HTTP query. + * `topology/tree` - Fix issues with switch distance optimization. + * Fix potential segfault of secondary `slurmctld` when falling back + to the primary when running with a `JobComp` plugin. + * Enable `--json`/`--yaml=v0.0.39` options on client commands to + dump data using data_parser/v0.0.39 instead or outputting nothing. + * `switch/hpe_slingshot` - Fix issue that could result in a 0 length + state file. + * Fix unnecessary message protocol downgrade for unregistered nodes. + * Fix unnecessarily packing alias addrs when terminating jobs with + a mix of non-cloud/dynamic nodes and powered down cloud/dynamic + nodes. + * `accounting_storage/mysql` - Fix issue when deleting a qos that + could remove too many commas from the qos and/or delta_qos fields + of the assoc table. + * `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU. + * Fix allowing access to reservations without `MaxStartDelay` set. + * Fix regression introduced in 24.05.0rc1 breaking + `srun --send-libs` parsing. + * Fix slurmd vsize memory leak when using job submission/allocation + commands that implicitly or explicitly use --get-user-env. + * `slurmd` - Fix node going into invalid state when using + `CPUSpecList` and setting CPUs to the # of cores on a + multithreaded node. + * Fix reboot asap nodes being considered in backfill after a restart. + * Fix `--clusters`/`-M queries` for clusters outside of a + federation when `fed_display` is configured. + * Fix `scontrol` allowing updating job with bad cpus-per-task value. + * `sattach` - Fix regression from 24.05.2 security fix leading to + crash. + * `mpi/pmix` - Fix assertion when built under `--enable-debug`. +- Changes from Slurm 24.05.2 + * Fix energy gathering rpc counter underflow in + `_rpc_acct_gather_energy` when more than 10 threads try to get + energy at the same time. This prevented the possibility to get + energy from slurmd by any step until slurmd was restarted, + so losing energy accounting metrics in the node. + * `accounting_storage/mysql` - Fix issue where new user with `wckey` + did not have a default wckey sent to the slurmctld. + * `slurmrestd` - Prevent slurmrestd segfault when handling the + following endpoints when none of the optional parameters are + specified: + `DELETE /slurm/v0.0.40/jobs` + `DELETE /slurm/v0.0.41/jobs` + `GET /slurm/v0.0.40/shares` + `GET /slurm/v0.0.41/shares` + `GET /slurmdb/v0.0.40/instance` + `GET /slurmdb/v0.0.41/instance` + `GET /slurmdb/v0.0.40/instances` + `GET /slurmdb/v0.0.41/instances` + `POST /slurm/v0.0.40/job/{job_id}` + `POST /slurm/v0.0.41/job/{job_id}` + * Fix IPMI energy gathering when no IPMIPowerSensors are specified + in `acct_gather.conf`. This situation resulted in an accounted + energy of 0 for job steps. + * Fix a minor memory leak in slurmctld when updating a job dependency. + * `scontrol`,`squeue` - Fix regression that caused incorrect values + for multisocket nodes at `.jobs[].job_resources.nodes.allocation` + for `scontrol show jobs --(json|yaml)` and `squeue --(json|yaml)`. + * `slurmrestd` - Fix regression that caused incorrect values for + multisocket nodes at `.jobs[].job_resources.nodes.allocation` to + be dumped with endpoints: + `GET /slurm/v0.0.41/job/{job_id}` + `GET /slurm/v0.0.41/jobs` + * `jobcomp/filetxt` - Fix truncation of job record lines > 1024 + characters. + * `switch/hpe_slingshot` - Drain node on failure to delete CXI + services. + * Fix a performance regression from 23.11.0 in cpu frequency + handling when no `CpuFreqDef` is defined. + * Fix one-task-per-sharing not working across multiple nodes. + * Fix inconsistent number of cpus when creating a reservation + using the TRESPerNode option. + * `data_parser/v0.0.40+` - Fix job state parsing which could + break filtering. + * Prevent `cpus-per-task` to be modified in jobs where a `-c` + value has been explicitly specified and the requested memory + constraints implicitly increase the number of CPUs to allocate. + * `slurmrestd` - Fix regression where args `-s v0.0.39,dbv0.0.39` + and `-d v0.0.39` would result in `GET /openapi/v3` not + registering as a valid possible query resulting in 404 errors. + * `slurmrestd` - Fix memory leak for dbv0.0.39 jobs query which + occurred if the query parameters specified account, association, + cluster, constraints, format, groups, job_name, partition, qos, + reason, reservation, state, users, or wckey. This affects the + following endpoints: + `GET /slurmdb/v0.0.39/jobs` + * `slurmrestd` - In the case the slurmdbd does not respond to a + persistent connection init message, prevent the closed fd from + being used, and instead emit an error or warning depending on + if the connection was required. + * Fix 24.05.0 regression that caused the slurmdbd not to send back + an error message if there is an error initializing a persistent + connection. + * Reduce latency of forwarded x11 packets. + * Add `curr_dependency` (representing the current dependency of + the job). + and `orig_dependency` (representing the original requested + dependency of the job) fields to the job record in + `job_submit.lua` (for job update) and `jobcomp.lua`. + * Fix potential segfault of slurmctld configured with + `SlurmctldParameters=enable_rpc_queue` from happening on + reconfigure. + * Fix potential segfault of slurmctld on its shutdown when rate + limitting is enabled. + * `slurmrestd` - Fix missing job environment for `SLURM_JOB_NAME`, + `SLURM_OPEN_MODE`, `SLURM_JOB_DEPENDENCY`, `SLURM_PROFILE`, + `SLURM_ACCTG_FREQ`, `SLURM_NETWORK` and `SLURM_CPU_FREQ_REQ` to + match sbatch. + * Fix GRES environment variable indices being incorrect when only + using a subset of all GPUs on a node and the + `--gres-flags=allow-task-sharing` option. + * Prevent `scontrol` from segfaulting when requesting scontrol + show reservation `--json` or `--yaml` if there is an error + retrieving reservations from the `slurmctld`. + * `switch/hpe_slingshot` - Fix security issue around managing VNI + access. CVE-2024-42511. + * `switch/nvidia_imex` - Fix security issue managing IMEX channel + access. CVE-2024-42511. + * `switch/nvidia_imex` - Allow for compatibility with + `job_container/tmpfs`. +- Changes in Slurm 24.05.1 + * Fix `slurmctld` and `slurmdbd` potentially stopping instead of + performing a logrotate when recieving `SIGUSR2` when using + `auth/slurm`. + * `switch/hpe_slingshot` - Fix slurmctld crash when upgrading + from 23.02. + * Fix "Could not find group" errors from `validate_group()` when + using `AllowGroups` with large `/etc/group` files. + * Add `AccountingStoreFlags=no_stdio` which allows to not record + the stdio paths of the job when set. + * `slurmrestd` - Prevent a slurmrestd segfault when parsing the + `crontab` field, which was never usable. Now it explicitly + ignores the value and emits a warning if it is used for the + following endpoints: + `POST /slurm/v0.0.39/job/{job_id}` + `POST /slurm/v0.0.39/job/submit` + `POST /slurm/v0.0.40/job/{job_id}` + `POST /slurm/v0.0.40/job/submit` + `POST /slurm/v0.0.41/job/{job_id}` + `POST /slurm/v0.0.41/job/submit` + `POST /slurm/v0.0.41/job/allocate` + * `mpi/pmi2` - Fix communication issue leading to task launch + failure with "`invalid kvs seq from node`". + * Fix getting user environment when using sbatch with + `--get-user-env` or `--export=` when there is a user profile + script that reads `/proc`. + * Prevent slurmd from crashing if `acct_gather_energy/gpu` is + configured but `GresTypes` is not configured. + * Do not log the following errors when `AcctGatherEnergyType` + plugins are used but a node does not have or cannot find sensors: + "`error: _get_joules_task: can't get info from slurmd`" + "`error: slurm_get_node_energy: Zero Bytes were transmitted or + received`" + However, the following error will continue to be logged: + "`error: Can't get energy data. No power sensors are available. + Try later`" + * `sbatch`, `srun` - Set `SLURM_NETWORK` environment variable if + `--network` is set. + * Fix cloud nodes not being able to forward to nodes that restarted + with new IP addresses. + * Fix cwd not being set correctly when running a SPANK plugin with a + `spank_user_init()` hook and the new "`contain_spank`" option set. + * `slurmctld` - Avoid deadlock during shutdown when `auth/slurm` + is active. + * Fix segfault in `slurmctld` with `topology/block`. + * `sacct` - Fix printing of job group for job steps. + * `scrun` - Log when an invalid environment variable causes the + job submission to be rejected. + * `accounting_storage/mysql` - Fix problem where listing or + modifying an association when specifying a qos list could hang + or take a very long time. + * `gpu/nvml` - Fix `gpuutil/gpumem` only tracking last GPU in step. + Now, `gpuutil/gpumem` will record sums of all GPUS in the step. + * Fix error in `scrontab` jobs when using + `slurm.conf:PropagatePrioProcess=1`. + * Fix `slurmctld` crash on a batch job submission with + `--nodes 0,...`. + * Fix dynamic IP address fanout forwarding when using `auth/slurm`. + * Restrict listening sockets in the `mpi/pmix` plugin and `sattach` + to the `SrunPortRange`. + * `slurmrestd` - Limit mime types returned from query to + `GET /openapi/v3` to only return one mime type per serializer + plugin to fix issues with OpenAPI client generators that are + unable to handle multiple mime type aliases. + * Fix many commands possibly reporting an "`Unexpected Message + Received`" when in reality the connection timed out. + * Prevent slurmctld from starting if there is not a json + serializer present and the `extra_constraints` feature is enabled. + * Fix heterogeneous job components not being signaled with + `scancel --ctld` and `DELETE slurm/v0.0.40/jobs` if the job ids + are not explicitly given, the heterogeneous job components match + the given filters, and the heterogeneous job leader does not + match the given filters. + * Fix regression from 23.02 impeding job licenses from being cleared. + * Move error to `log_flag` which made `_get_joules_task` error to + be logged to the user when too many rpcs were queued in slurmd + for gathering energy. + * For `scancel --ctld` and the associated rest api endpoints: + `DELETE /slurm/v0.0.40/jobs` + `DELETE /slurm/v0.0.41/jobs` + Fix canceling the final array task in a job array when the task + is pending and all array tasks have been split into separate job + records. Previously this task was not canceled. + * Fix `power_save operation` after recovering from a failed + reconfigure. + * `slurmctld` - Skip removing the pidfile when running under + systemd. In that situation it is never created in the first place. + * Fix issue where altering the flags on a Slurm account + (`UsersAreCoords`) several limits on the account's association + would be set to 0 in Slurm's internal cache. + * Fix memory leak in the controller when relaying `stepmgr` step + accounting to the dbd. + * Fix segfault when submitting stepmgr jobs within an existing + allocation. + * Added `disable_slurm_hydra_bootstrap` as a possible `MpiParams` + parameter in `slurm.conf`. Using this will disable env variable + injection to allocations for the following variables: + `I_MPI_HYDRA_BOOTSTRAP,` `I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS`, + `HYDRA_BOOTSTRAP`, `HYDRA_LAUNCHER_EXTRA_ARGS`. + * `scrun` - Delay shutdown until after start requested. + This caused `scrun` to never start or shutdown and hung forever + when using `--tty`. + * Fix backup `slurmctld` potentially not running the agent when + taking over as the primary controller. + * Fix primary controller not running the agent when a reconfigure + of the `slurmctld` fails. + * `slurmd` - fix premature timeout waiting for + `REQUEST_LAUNCH_PROLOG` with large array jobs causing node to + drain. + * `jobcomp/{elasticsearch,kafka}` - Avoid sending fields with + invalid date/time. + * `jobcomp/elasticsearch` - Fix `slurmctld` memory leak from + curl usage. + * `acct_gather_profile/influxdb` - Fix slurmstepd memory leak from + curl usage + * Fix 24.05.0 regression not deleting job hash dirs after + `MinJobAge`. + * Fix filtering arguments being ignored when using squeue `--json`. + * `switch/nvidia_imex` - Move setup call after `spank_init()` to + allow namespace manipulation within the SPANK plugin. + * `switch/nvidia_imex` - Skip plugin operation if + `nvidia-caps-imex-channels` device is not present rather than + preventing slurmd from starting. + * `switch/nvidia_imex` - Skip plugin operation if + `job_container/tmpfs` is configured due to incompatibility. + * `switch/nvidia_imex` - Remove any pre-existing channels when + `slurmd` starts. + * `rpc_queue` - Add support for an optional `rpc_queue.yaml` + configuration file. + * `slurmrestd` - Add new +prefer_refs flag to `data_parser/v0.0.41` + plugin. This flag will avoid inlining single referenced schemas + in the OpenAPI schema. + ------------------------------------------------------------------- Tue Jun 4 09:36:54 UTC 2024 - Christian Goll -- updated to new release 24.05.0 with following major changes -- IMPORTANT NOTES: - If using the slurmdbd (Slurm DataBase Daemon) you must update - this first. NOTE: If using a backup DBD you must start the - primary first to do any database conversion, the backup will not - start until this has happened. The 24.05 slurmdbd will work - with Slurm daemons of version 23.02 and above. You will not - need to update all clusters at the same time, but it is very - important to update slurmdbd first and having it running before - updating any other clusters making use of it. -- HIGHLIGHTS - * Federation - allow client command operation when slurmdbd is - unavailable. - * burst_buffer/lua - Added two new hooks: slurm_bb_test_data_in - and slurm_bb_test_data_out. The syntax and use of the new hooks - are documented in etc/burst_buffer.lua.example. These are - required to exist. slurmctld now checks on startup if the - burst_buffer.lua script loads and contains all required hooks; - slurmctld will exit with a fatal error if this is not - successful. Added PollInterval to burst_buffer.conf. Removed - the arbitrary limit of 512 copies of the script running - simultaneously. - * Add QOS limit MaxTRESRunMinsPerAccount. - * Add QOS limit MaxTRESRunMinsPerUser. - * Add ELIGIBLE environment variable to jobcomp/script plugin. - * Always use the QOS name for SLURM_JOB_QOS environment variables. - Previously the batch environment would use the description field, - which was usually equivalent to the name. - * cgroup/v2 - Require dbus-1 version >= 1.11.16. - * Allow NodeSet names to be used in SuspendExcNodes. - * SuspendExcNodes=:N now counts allocated nodes in N. The - first N powered up nodes in are protected from being - suspended. - * Store job output, input and error paths in SlurmDBD. - * Add USER_DELETE reservation flag to allow users with access to - a reservation to delete it. - * Add SlurmctldParameters=enable_stepmgr to enable step - management through the slurmstepd instead of the controller. - * Added PrologFlags=RunInJob to make prolog and epilog run - inside the job extern step to include it in the job's cgroup. - * Add ability to reserve MPI ports at the job level for stepmgr - jobs and subdivide them at the step level. - * slurmrestd - Add --generate-openapi-spec argument. -- CONFIGURATION FILE CHANGES (see appropriate man page for details) - * CoreSpecPlugin has been removed. - * Removed TopologyPlugin tree and dragonfly support from - select/linear. If those topology plugins are desired please switch to - select/cons_tres. - * Changed the default value for UnkillableStepTimeout to 60 - seconds or five times the value of MessageTimeout, whichever is greater. - * An error log has been added if JobAcctGatherParams 'UsePss' or - 'NoShare' are configured with a plugin other than jobacct_gather/linux. - In such case these parameters are ignored. - * helpers.conf - Added Flags=rebootless parameter allowing feature changes - without rebooting compute nodes. - * topology/block - Replaced the BlockLevels with BlockSizes in topology.conf. - * Add contain_spank option to SlurmdParameters. When set, spank_user_init(), - spank_task_post_fork(), and spank_task_exit() will execute within the - job_container/tmpfs plugin namespace. - * Add SlurmctldParameters=max_powered_nodes=N, which prevents powering up - nodes after the max is reached. - * Add ExclusiveTopo to a partition definition in slurm.conf. - * Add AccountingStorageParameters=max_step_records to limit how many steps - are recorded in the database for each job *- excluding batc -- COMMAND CHANGES (see man pages for details) - * Add support for "elevenses" as an additional time specification. - * Add support for sbcast --preserve when job_container/tmpfs configured - (previously documented as unsupported). - * scontrol - Add new subcommand 'power' for node power control. - * squeue - Adjust StdErr, StdOut, and StdIn output formats. These will now - consistently print "(null)" if a value is unavailable. StdErr will no - longer display StdOut if it is not distinctly set. StdOut will now - correctly display the default filename pattern for job arrays, and no - longer show it for non*batch jobs. However, the expansion patterns will - no longer be substituted by default. - * Add --segment to job allocation to be used in topology/block. - * Add --exclusive=topo for use with topology/block. - * squeue - Add --expand-patterns option to expand StdErr, StdOut, StdIn - filename patterns as best as possible. - * sacct - Add --expand-patterns option to expand StdErr, StdOut, StdIn - filename patterns as best as possible. - * sreport - Requesting format=Planned will now return the expected Planned - time as documented, instead of PlannedDown. To request Planned Down, - one must use now format=PLNDDown or format=PlannedDown explicitly. The - abbreviations "Pl" or "Pla" will now make reference to Planned instead of - PlannedDown. -- API CHANGES - * Removed ListIterator type from . - * Removed slurm_xlate_job_id() from -- SLURMRESTD CHANGES - * openapi/dbv0.0.38 and openapi/v0.0.38 plugins have been removed. - * openapi/dbv0.0.39 and openapi/v0.0.39 plugins have been tagged as - deprecated to warn of their removal in the next release. - * Changed slurmrestd.service to only listen on TCP socket by default. - Environments with existing drop*in units for the service may need - further adjustments to work after upgrading. - * slurmrestd - Tagged `script` field as deprecated in - 'POST /slurm/v0.0.41/job/submit' in anticipation of removal in future - OpenAPI plugin versions. Job submissions should set the `job.script` (or - `jobs[0].script` for HetJobs) fields instead. - * slurmrestd - Attempt to automatically convert enumerated string arrays with - incoming non*string values into strings. Add warning when incoming value for - enumerated string arrays can not be converted to string and silently ignore - instead of rejecting entire request. This change affects any endpoint that - uses an enunmerated string as given in the OpenAPI specification. An - example of this conversion would be to 'POST /slurm/v0.0.41/job/submit' with - '.job.exclusive = true'. While the JSON (boolean) true value matches a - possible enumeration, it is not the expected "true" string. This change - automatically converts the (boolean) true to (string) "true" avoiding a - parsing failure. - * slurmrestd - Add 'POST /slurm/v0.0.41/job/allocate' endpoint. This endpoint - will create a new job allocation without any steps. The allocation will need - to be ended via signaling the job or it will run to the timelimit. - * slurmrestd - Allow startup when slurmdbd is not configured and avoid loading - slurmdbd specific plugins. -- MPI/PMI2 CHANGES - * Jobs submitted with the SLURM_HOSTFILE environment variable set implies - using an arbitrary distribution. Nevertheless, the logic used in PMI2 when - generating their associated PMI_process_mapping values has been changed and - will now be the same used for the plane distribution, as if "-m plane" were - used. This has been changed because the original arbitrary distribution - implementation did not account for multiple instances of the same host being - present in SLURM_HOSTFILE, providing an incorrect process mapping in such - case. This change also enables distributing tasks in blocks when using - arbitrary distribution, which was not the case before. This only affects - mpi/pmi2 plugin. -- removed Fix-test-21.41.patch as upstream test changed +- Updated to new release 24.05.0 with following major changes + * Important Notes: + If using the slurmdbd (Slurm DataBase Daemon) you must update + this first. NOTE: If using a backup DBD you must start the + primary first to do any database conversion, the backup will not + start until this has happened. The 24.05 slurmdbd will work + with Slurm daemons of version 23.02 and above. You will not + need to update all clusters at the same time, but it is very + important to update slurmdbd first and having it running before + updating any other clusters making use of it. + * Highlights + + Federation - allow client command operation when slurmdbd is + unavailable. + + `burst_buffer/lua` - Added two new hooks: `slurm_bb_test_data_in` + and `slurm_bb_test_data_out`. The syntax and use of the new hooks + are documented in `etc/burst_buffer.lua.example`. These are + required to exist. slurmctld now checks on startup if the + `burst_buffer.lua` script loads and contains all required hooks; + `slurmctld` will exit with a fatal error if this is not + successful. Added `PollInterval` to `burst_buffer.conf`. Removed + the arbitrary limit of 512 copies of the script running + simultaneously. + + Add QOS limit `MaxTRESRunMinsPerAccount`. + + Add QOS limit `MaxTRESRunMinsPerUser`. + + Add `ELIGIBLE` environment variable to `jobcomp/script` plugin. + + Always use the QOS name for `SLURM_JOB_QOS` environment variables. + Previously the batch environment would use the description field, + which was usually equivalent to the name. + + `cgroup/v2` - Require dbus-1 version >= 1.11.16. + + Allow `NodeSet` names to be used in SuspendExcNodes. + + `SuspendExcNodes=:N` now counts allocated nodes in `N`. + The first `N` powered up nodes in are protected from + being suspended. + + Store job output, input and error paths in `SlurmDBD`. + + Add `USER_DELETE` reservation flag to allow users with access + to a reservation to delete it. + + Add `SlurmctldParameters=enable_stepmgr` to enable step + management through the `slurmstepd` instead of the controller. + + Added `PrologFlags=RunInJob` to make prolog and epilog run + inside the job extern step to include it in the job's cgroup. + + Add ability to reserve MPI ports at the job level for stepmgr + jobs and subdivide them at the step level. + + `slurmrestd` - Add `--generate-openapi-spec argument`. + * Configuration File Changes (see appropriate man page for details) + + `CoreSpecPlugin` has been removed. + + Removed `TopologyPlugin` tree and dragonfly support from + `select/linear`. If those topology plugins are desired please + switch to `select/cons_tres`. + + Changed the default value for `UnkillableStepTimeout` to 60 + seconds or five times the value of `MessageTimeout`, whichever + is greater. + + An error log has been added if `JobAcctGatherParams` '`UsePss`' + or '`NoShare`' are configured with a plugin other than + `jobacct_gather/linux`. In such case these parameters are ignored. + + `helpers.conf` - Added `Flags=rebootless` parameter allowing + feature changes without rebooting compute nodes. + + `topology/block` - Replaced the `BlockLevels` with `BlockSizes` + in `topology.conf`. + + Add `contain_spank` option to `SlurmdParameters`. When set, + `spank_user_init()`, `spank_task_post_fork()`, and + `spank_task_exit()` will execute within the + `job_container/tmpfs` plugin namespace. + + Add `SlurmctldParameters=max_powered_nodes=N`, which prevents + powering up nodes after the max is reached. + + Add `ExclusiveTopo` to a partition definition in `slurm.conf`. + + Add `AccountingStorageParameters=max_step_records` to limit how + many steps are recorded in the database for each job - excluding + batch. + * Command Changes (see man pages for details) + + Add support for "elevenses" as an additional time specification. + + Add support for `sbcast --preserve` when `job_container/tmpfs` + configured (previously documented as unsupported). + + `scontrol` - Add new subcommand `power` for node power control. + + `squeue` - Adjust `StdErr`, `StdOut`, and `StdIn` output formats. + These will now consistently print "`(null)`" if a value is + unavailable. `StdErr` will no longer display `StdOut` if it is + not distinctly set. `StdOut` will now correctly display the + default filename pattern for job arrays, and no longer show it + for non-batch jobs. However, the expansion patterns will + no longer be substituted by default. + + Add `--segment` to job allocation to be used in topology/block. + + Add `--exclusive=topo` for use with topology/block. + + `squeue` - Add `--expand-patterns` option to expand `StdErr`, + `StdOut`, `StdIn` filename patterns as best as possible. + + `sacct` - Add `--expand-patterns` option to expand `StdErr`, + `StdOut`, `StdIn` filename patterns as best as possible. + + `sreport` - Requesting `format=Planned` will now return the + expected `Planned` time as documented, instead of `PlannedDown`. + To request `Planned Down`, one must use now `format=PLNDDown` + or `format=PlannedDown` explicitly. The abbreviations + "`Pl`" or "`Pla`" will now make reference to Planned instead + of `PlannedDown`. + * API Changes + + Removed `ListIterator` type from ``. + + Removed `slurm_xlate_job_id()` from `` + * SLURMRESTD Changes + + `openapi/dbv0.0.38` and `openapi/v0.0.38` plugins have been + removed. + + `openapi/dbv0.0.39` and `openapi/v0.0.39` plugins have been + tagged as deprecated to warn of their removal in the next release. + + Changed `slurmrestd.service` to only listen on TCP socket by + default. Environments with existing drop-in units for the + service may need further adjustments to work after upgrading. + + `slurmrestd` - Tagged `script` field as deprecated in + `POST /slurm/v0.0.41/job/submit` in anticipation of removal in + future OpenAPI plugin versions. Job submissions should set the + `job.script` (or `jobs[0].script` for HetJobs) fields instead. + + `slurmrestd` - Attempt to automatically convert enumerated + string arrays with incoming non-string values into strings. + Add warning when incoming value for enumerated string arrays + can not be converted to string and silently ignore instead of + rejecting entire request. This change affects any endpoint that + uses an enunmerated string as given in the OpenAPI specification. + An example of this conversion would be to + `POST /slurm/v0.0.41/job/submit` with `.job.exclusive = true`. + While the JSON (boolean) true value matches a possible + enumeration, it is not the expected "true" string. This change + automatically converts the (boolean) `true` to (string) "`true`" + avoiding a parsing failure. + + `slurmrestd` - Add `POST /slurm/v0.0.41/job/allocate` endpoint. + This endpoint will create a new job allocation without any steps. + The allocation will need to be ended via signaling the job or + it will run to the timelimit. + + `slurmrestd` - Allow startup when `slurmdbd` is not configured + and avoid loading `slurmdbd` specific plugins. + * MPI/PMI2 Changes + + Jobs submitted with the `SLURM_HOSTFILE` environment variable + set implies using an arbitrary distribution. Nevertheless, the + logic used in PMI2 when generating their associated + `PMI_process_mapping` values has been changed and will now be + the same used for the plane distribution, as if `-m plane` were + used. This has been changed because the original arbitrary + distribution implementation did not account for multiple + instances of the same host being present in `SLURM_HOSTFILE`, + providing an incorrect process mapping in such case. This + change also enables distributing tasks in blocks when using + arbitrary distribution, which was not the case before. This + only affects `mpi`/`pmi2` plugin. + * Removed Fix-test-21.41.patch as upstream test changed. + ------------------------------------------------------------------- Mon Mar 25 15:16:44 UTC 2024 - Christian Goll - removed Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch as incoperated upstream -* Changes in Slurm 23.02.5 - * Add the JobId to debug() messages indicating when cpus_per_task/mem_per_cpu - or pn_min_cpus are being automatically adjusted. - * Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if - a node features plugin is configured. - * Fix and prevent reoccurring reservations from overlapping. - * job_container/tmpfs - Avoid attempts to share BasePath between nodes. - * Change the log message warning for rate limited users from verbose to info. - * With CR_Cpu_Memory, fix node selection for jobs that request gres and - *-mem-per-cpu. - * Fix a regression from 22.05.7 in which some jobs were allocated too few - nodes, thus overcommitting cpus to some tasks. - * Fix a job being stuck in the completing state if the job ends while the - primary controller is down or unresponsive and the backup controller has - not yet taken over. - * Fix slurmctld segfault when a node registers with a configured CpuSpecList - while slurmctld configuration has the node without CpuSpecList. - * Fix cloud nodes getting stuck in POWERED_DOWN+NO_RESPOND state after not - registering by ResumeTimeout. - * slurmstepd - Avoid cleanup of config.json-less containers spooldir getting - skipped. - * slurmstepd - Cleanup per task generated environment for containers in - spooldir. - * Fix scontrol segfault when 'completing' command requested repeatedly in - interactive mode. - * Properly handle a race condition between bind() and listen() calls in the - network stack when running with SrunPortRange set. - * Federation - Fix revoked jobs being returned regardless of the -a/--all - option for privileged users. - * Federation - Fix canceling pending federated jobs from non-origin clusters - which could leave federated jobs orphaned from the origin cluster. - * Fix sinfo segfault when printing multiple clusters with --noheader option. - * Federation - fix clusters not syncing if clusters are added to a federation - before they have registered with the dbd. - * Change pmi2 plugin to honor the SrunPortRange option. This matches the new - behavior of the pmix plugin in 23.02.0. Note that neither of these plugins - makes use of the "MpiParams=ports=" option, and previously were only limited - by the systems ephemeral port range. - * node_features/helpers - Fix node selection for jobs requesting changeable - features with the '|' operator, which could prevent jobs from running on - some valid nodes. - * node_features/helpers - Fix inconsistent handling of '&' and '|', where an - AND'd feature was sometimes AND'd to all sets of features instead of just - the current set. E.g. "foo|bar&baz" was interpreted as {foo,baz} or - {bar,baz} instead of how it is documented: "{foo} or {bar,baz}". - * Fix job accounting so that when a job is requeued its allocated node count - is cleared. After the requeue, sacct will correctly show that the job has - 0 AllocNodes while it is pending or if it is canceled before restarting. - * sacct - AllocCPUS now correctly shows 0 if a job has not yet received an - allocation or if the job was canceled before getting one. - * Fix intel oneapi autodetect: detect the /dev/dri/renderD[0-9]+ gpus, and do - not detect /dev/dri/card[0*9]+. - * Format batch, extern, interactive, and pending step ids into strings that - are human readable. - * Fix node selection for jobs that request --gpus and a number of tasks fewer - than gpus, which resulted in incorrectly rejecting these jobs. - * Remove MYSQL_OPT_RECONNECT completely. - * Fix cloud nodes in POWERING_UP state disappearing (getting set to FUTURE) - when an `scontrol reconfigure` happens. - * openapi/dbv0.0.39 - Avoid assert / segfault on missing coordinators list. - * slurmrestd - Correct memory leak while parsing OpenAPI specification - templates with server overrides. - * slurmrestd - Reduce memory usage when printing out job CPU frequency. - * Fix overwriting user node reason with system message. - * Remove --uid / --gid options from salloc and srun commands. - * Prevent deadlock when rpc_queue is enabled. - * slurmrestd - Correct OpenAPI specification generation bug where fields with - overlapping parent paths would not get generated. - * Fix memory leak as a result of a partition info query. - * Fix memory leak as a result of a job info query. - * slurmrestd - For 'GET /slurm/v0.0.39/node[s]', change format of node's - energy field "current_watts" to a dictionary to account for unset value - instead of dumping 4294967294. - * slurmrestd - For 'GET /slurm/v0.0.39/qos', change format of QOS's - field "priority" to a dictionary to account for unset value instead of - dumping 4294967294. - * slurmrestd - For 'GET /slurm/v0.0.39/job[s]', the 'return code' code field - in v0.0.39_job_exit_code will be set to *127 instead of being left unset - where job does not have a relevant return code. - * data_parser/v0.0.39 - Add required/memory_per_cpu and - required/memory_per_node to `sacct *-json` and `sacct --yaml` and - 'GET /slurmdb/v0.0.39/jobs' from slurmrestd. - * For step allocations, fix --gres=none sometimes not ignoring gres from the - job. - * Fix --exclusive jobs incorrectly gang-scheduling where they shouldn't. - * Fix allocations with CR_SOCKET, gres not assigned to a specific socket, and - block core distribion potentially allocating more sockets than required. - * gpu/oneapi - Store cores correctly so CPU affinity is tracked. - * Revert a change in 23.02.3 where Slurm would kill a script's process group - as soon as the script ended instead of waiting as long as any process in - that process group held the stdout/stderr file descriptors open. That change - broke some scripts that relied on the previous behavior. Setting time limits - for scripts (such as PrologEpilogTimeout) is strongly encouraged to avoid - Slurm waiting indefinitely for scripts to finish. - * Allow slurmdbd -R to work if the root assoc id is not 1. - * Fix slurmdbd -R not returning an error under certain conditions. - * slurmdbd - Avoid potential NULL pointer dereference in the mysql plugin. - * Revert a change in 23.02 where SLURM_NTASKS was no longer set in the job's - environment when *-ntasks-per-node was requested. - * Limit periodic node registrations to 50 instead of the full TreeWidth. - Since unresolvable cloud/dynamic nodes must disable fanout by setting - TreeWidth to a large number, this would cause all nodes to register at - once. - * Fix regression in 23.02.3 which broken x11 forwarding for hosts when - MUNGE sends a localhost address in the encode host field. This is caused - when the node hostname is mapped to 127.0.0.1 (or similar) in /etc/hosts. - * openapi/[db]v0.0.39 - fix memory leak on parsing error. - * data_parser/v0.0.39 - fix updating qos for associations. - * openapi/dbv0.0.39 - fix updating values for associations with null users. - * Fix minor memory leak with --tres-per-task and licenses. - * Fix cyclic socket cpu distribution for tasks in a step where - --cpus-per-task < usable threads per core. +- Changes in Slurm 23.02.5 + * Add the `JobId` to `debug()` messages indicating when + `cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being + automatically adjusted. + * Fix regression in 23.02.2 that caused `slurmctld -R` to crash on + startup if a node features plugin is configured. + * Fix and prevent reoccurring reservations from overlapping. + * `job_container/tmpfs` - Avoid attempts to share `BasePath` + between nodes. + * Change the log message warning for rate limited users from + verbose to info. + * With `CR_Cpu_Memory`, fix node selection for jobs that request + gres and `--mem-per-cpu`. + * Fix a regression from 22.05.7 in which some jobs were allocated + too few nodes, thus overcommitting cpus to some tasks. + * Fix a job being stuck in the completing state if the job ends + while the primary controller is down or unresponsive and the + backup controller has not yet taken over. + * Fix `slurmctld` segfault when a node registers with a configured + `CpuSpecList` while slurmctld configuration has the node without + `CpuSpecList`. + * Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state + after not registering by `ResumeTimeout`. + * `slurmstepd` - Avoid cleanup of `config.json`-less containers + spooldir getting skipped. + * `slurmstepd` - Cleanup per task generated environment for + containers in spooldir. + * Fix `scontrol segfault` when 'completing' command requested + repeatedly in interactive mode. + * Properly handle a race condition between `bind()` and `listen()` + calls in the network stack when running with `SrunPortRange` set. + * Federation - Fix revoked jobs being returned regardless of the + `-a`/`--all` option for privileged users. + * Federation - Fix canceling pending federated jobs from non-origin + clusters which could leave federated jobs orphaned from the origin + cluster. + * Fix sinfo segfault when printing multiple clusters with + `--noheader` option. + * Federation - fix clusters not syncing if clusters are added to + a federation before they have registered with the dbd. + * Change `pmi2` plugin to honor the `SrunPortRange` option. This + matches the new behavior of the pmix plugin in 23.02.0. Note that + neither of these plugins makes use of the "`MpiParams=ports=`" + option, and previously were only limited by the systems ephemeral + port range. + * `node_features/helpers` - Fix node selection for jobs requesting + changeable features with the '`|`' operator, which could prevent + jobs from running on some valid nodes. + * `node_features/helpers` - Fix inconsistent handling of '`&`' and + '`|`', where an AND'd feature was sometimes AND'd to all sets of + features instead of just the current set. E.g. "`foo|bar&baz`" was + interpreted as `{foo,baz}` or `{bar,baz}` instead of how it is + documented: "`{foo} or {bar,baz}`". + * Fix job accounting so that when a job is requeued its allocated + node count is cleared. After the requeue, sacct will correctly + show that the job has 0 `AllocNodes` while it is pending or if + it is canceled before restarting. + * `sacct` - `AllocCPUS` now correctly shows 0 if a job has not yet + received an allocation or if the job was canceled before getting + one. + * Fix intel oneapi autodetect: detect the `/dev/dri/renderD[0-9]+` + gpus, and do not detect `/dev/dri/card[0-9]+`. + * Format batch, extern, interactive, and pending step ids into + strings that are human readable. + * Fix node selection for jobs that request `--gpus` and a number + of tasks fewer than gpus, which resulted in incorrectly rejecting + these jobs. + * Remove `MYSQL_OPT_RECONNECT` completely. + * Fix cloud nodes in `POWERING_UP` state disappearing (getting set + to `FUTURE`) when an `scontrol reconfigure` happens. + * `openapi/dbv0.0.39` - Avoid assert / segfault on missing + coordinators list. + * `slurmrestd` - Correct memory leak while parsing OpenAPI + specification templates with server overrides. + * `slurmrestd` - Reduce memory usage when printing out job CPU + frequency. + * Fix overwriting user node reason with system message. + * Remove `--uid` / `--gid` options from salloc and srun commands. + * Prevent deadlock when rpc_queue is enabled. + * `slurmrestd` - Correct OpenAPI specification generation bug where + fields with overlapping parent paths would not get generated. + * Fix memory leak as a result of a partition info query. + * Fix memory leak as a result of a job info query. + * slurmrestd - For `GET /slurm/v0.0.39/node[s]`, change format of + node's energy field `current_watts` to a dictionary to account + for unset value instead of dumping `4294967294`. + * `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of + QOS's field `priority` to a dictionary to account for unset + value instead of dumping `4294967294`. + * `slurmrestd` - For `GET /slurm/v0.0.39/job[s]`, the `return code` + code field in `v0.0.39_job_exit_code` will be set to 127 instead + of being left unset where job does not have a relevant return code. + * `data_parser/v0.0.39` - Add `required/memory_per_cpu` and + required/memory_per_node to `sacct --json` and `sacct --yaml` and + `GET /slurmdb/v0.0.39/jobs` from `slurmrestd`. + * For step allocations, fix `--gres=none` sometimes not ignoring + gres from the job. + * Fix `--exclusive` jobs incorrectly gang-scheduling where they + shouldn't. + * Fix allocations with `CR_SOCKET`, gres not assigned to a specific + socket, and block core distribion potentially allocating more + sockets than required. + * `gpu/oneapi` - Store cores correctly so CPU affinity is tracked. + * Revert a change in 23.02.3 where Slurm would kill a script's + process group as soon as the script ended instead of waiting as + long as any process in + that process group held the stdout/stderr file descriptors open. + That change broke some scripts that relied on the previous + behavior. Setting time limits for scripts (such as + `PrologEpilogTimeout`) is strongly encouraged to avoid Slurm + waiting indefinitely for scripts to finish. + * Allow slurmdbd -R to work if the root assoc id is not 1. + * Fix `slurmdbd -R` not returning an error under certain conditions. + * `slurmdbd` - Avoid potential NULL pointer dereference in the + mysql plugin. + * Revert a change in 23.02 where `SLURM_NTASKS` was no longer + set in the job's environment when `--ntasks-per-node` was + requested. + * Limit periodic node registrations to 50 instead of the full + `TreeWidth`. + Since unresolvable `cloud/dynamic` nodes must disable fanout by + setting `TreeWidth` to a large number, this would cause all nodes + to register at once. + * Fix regression in 23.02.3 which broken x11 forwarding for hosts + when `MUNGE` sends a localhost address in the encode host field. + This is caused when the node hostname is mapped to 127.0.0.1 + (or similar) in `/etc/hosts`. + * `openapi/[db]v0.0.39` - fix memory leak on parsing error. + * `data_parser/v0.0.39` - fix updating qos for associations. + * `openapi/dbv0.0.39` - fix updating values for associations with + null users. + * Fix minor memory leak with `--tres-per-task` and licenses. + * Fix cyclic socket cpu distribution for tasks in a step where + `--cpus-per-task` < usable threads per core. - Changes in Slurm 23.02.4 - * Fix sbatch return code when **wait is requested on a job array. - * switch/hpe_slingshot * avoid segfault when running with old libcxi. - * Avoid slurmctld segfault when specifying AccountingStorageExternalHost. - * Fix collected GPUUtilization values for acct_gather_profile plugins. + * Fix `sbatch` return code when --wait is requested on a job array. + * `switch/hpe_slingshot` - avoid segfault when running with old + libcxi. + * Avoid slurmctld segfault when specifying + `AccountingStorageExternalHost`. + * Fix collected `GPUUtilization` values for `acct_gather_profile` + plugins. * Fix slurmrestd handling of job hold/release operations. - * Make spank S_JOB_ARGV item value hold the requested command argv instead of - the srun **bcast value when **bcast requested (only in local context). - * Fix step running indefinitely when slurmctld takes more than MessageTimeout - to respond. Now, slurmctld will cancel the step when detected, preventing - following steps from getting stuck waiting for resources to be released. - * Fix regression to make job_desc.min_cpus accurate again in job_submit when - requesting a job with **ntasks*per*node. - * scontrol * Permit changes to StdErr and StdIn for pending jobs. - * scontrol * Reset std{err,in,out} when set to empty string. - * slurmrestd * mark environment as a required field for job submission - descriptions. - * slurmrestd * avoid dumping null in OpenAPI schema required fields. - * data_parser/v0.0.39 * avoid rejecting valid memory_per_node formatted as - dictionary provided with a job description. - * data_parser/v0.0.39 * avoid rejecting valid memory_per_cpu formatted as - dictionary provided with a job description. - * slurmrestd * Return HTTP error code 404 when job query fails. - * slurmrestd * Add return schema to error response to job and license query. + * Make spank `S_JOB_ARGV` item value hold the requested command + argv instead of the srun `--bcast` value when `--bcast` requested + (only in local context). + * Fix step running indefinitely when slurmctld takes more than + `MessageTimeout` to respond. Now, `slurmctld` will cancel the + step when detected, preventing following steps from getting stuck + waiting for resources to be released. + * Fix regression to make job_desc.min_cpus accurate again in + job_submit when requesting a job with `--ntasks-per-node`. + * `scontrol` - Permit changes to `StdErr` and `StdIn` for pending + jobs. + * `scontrol` - Reset std{err,in,out} when set to empty string. + * `slurmrestd` - mark environment as a required field for job + submission descriptions. + * `slurmrestd` - avoid dumping null in OpenAPI schema required + fields. + `data_parser/v0.0.39` - avoid rejecting valid `memory_per_node` + formatted as dictionary provided with a job description. + * `data_parser/v0.0.39` - avoid rejecting valid `memory_per_cpu` + formatted as dictionary provided with a job description. + * `slurmrestd` - Return HTTP error code 404 when job query fails. + * `slurmrestd` - Add return schema to error response to job and + license query. * Fix handling of ArrayTaskThrottle in backfill. - * Fix regression in 23.02.2 when checking gres state on slurmctld startup or - reconfigure. Gres changes in the configuration were not updated on slurmctld - startup. On startup or reconfigure, these messages were present in the log: - "error: Attempt to change gres/gpu Count". + * Fix regression in 23.02.2 when checking gres state on `slurmctld` + startup or reconfigure. Gres changes in the configuration were + not updated on `slurmctld` startup. On startup or reconfigure, + these messages were present in the log: + "`error: Attempt to change gres/gpu Count`". * Fix potential double count of gres when dealing with limits. - * switch/hpe_slingshot * support alternate traffic class names with "TC_" - prefix. - * scrontab * Fix cutting off the final character of quoted variables. - * Fix slurmstepd segfault when ContainerPath is not set in oci.conf - * Change the log message warning for rate limited users from debug to verbose. - * Fixed an issue where jobs requesting licenses were incorrectly rejected. - * smail * Fix issues where e*mails at job completion were not being sent. - * scontrol/slurmctld * fix comma parsing when updating a reservation's nodes. - * cgroup/v2 * Avoid capturing log output for ebpf when constraining devices, - as this can lead to inadvertent failure if the log buffer is too small. - * Fix **gpu*bind=single binding tasks to wrong gpus, leading to some gpus - having more tasks than they should and other gpus being unused. - * Fix main scheduler loop not starting after failover to backup controller. - * Added error message when attempting to use sattach on batch or extern steps. - * Fix regression in 23.02 that causes slurmstepd to crash when srun requests - more than TreeWidth nodes in a step and uses the pmi2 or pmix plugin. - * Reject job ArrayTaskThrottle update requests from unprivileged users. - * data_parser/v0.0.39 * populate description fields of property objects in - generated OpenAPI specifications where defined. - * slurmstepd * Avoid segfault caused by ContainerPath not being terminated by - '/' in oci.conf. - * data_parser/v0.0.39 * Change v0.0.39_job_info response to tag exit_code - field as being complex instead of only an unsigned integer. - * job_container/tmpfs * Fix %h and %n substitution in BasePath where %h was - substituted as the NodeName instead of the hostname, and %n was substituted - as an empty string. - * Fix regression where **cpu*bind=verbose would override TaskPluginParam. - * scancel * Fix **clusters/*M for federations. Only filtered jobs (e.g. *A, - *u, *p, etc.) from the specified clusters will be canceled, rather than all - jobs in the federation. Specific jobids will still be routed to the origin - cluster for cancellation. - + * `switch/hpe_slingshot` - support alternate traffic class names + with "`TC_`" prefix. + * `scrontab` - Fix cutting off the final character of quoted + variables. + * Fix `slurmstepd` segfault when `ContainerPath` is not set in + `oci.conf`. + * Change the log message warning for rate limited users from + debug to verbose. + * Fixed an issue where jobs requesting licenses were incorrectly + rejected. + * `smail` - Fix issues where emails at job completion were not + being sent. + * `scontrol/slurmctld` - fix comma parsing when updating a + reservation's nodes. + * `cgroup/v2` - Avoid capturing log output for ebpf when + constraining devices, as this can lead to inadvertent failure + if the log buffer is too small. + * Fix --gpu-bind=single binding tasks to wrong gpus, leading to + some gpus having more tasks than they should and other gpus being + unused. + * Fix main scheduler loop not starting after failover to backup + controller. + * Added error message when attempting to use sattach on batch or + extern steps. + * Fix regression in 23.02 that causes slurmstepd to crash when + `srun` requests more than `TreeWidth` nodes in a step and uses + the `pmi2` or `pmix` plugin. + * Reject job `ArrayTaskThrottle` update requests from unprivileged + users. + * `data_parser/v0.0.39` - populate description fields of property + objects in generated OpenAPI specifications where defined. + * `slurmstepd` - Avoid segfault caused by ContainerPath not being + terminated by '`/`' in `oci.conf`. + * `data_parser/v0.0.39` - Change `v0.0.39_job_info` response to tag + `exit_code` field as being complex instead of only an unsigned + integer. + * `job_container/tmpfs` - Fix %h and %n substitution in `BasePath` + where `%h` was substituted as the `NodeName` instead of the + hostname, and `%n` was substituted as an empty string. + * Fix regression where --cpu-bind=verbose would override + `TaskPluginParam`. + * `scancel` - Fix `--clusters`/`-M` for federations. Only filtered + jobs (e.g. -A, -u, -p, etc.) from the specified clusters will be + canceled, rather than all jobs in the federation. + Specific jobids will still be routed to the origin cluster + for cancellation. ------------------------------------------------------------------- Mon Jan 29 13:47:55 UTC 2024 - Egbert Eich @@ -2337,7 +2662,6 @@ Fri Jul 2 08:01:32 UTC 2021 - Christian Goll - Updated to 20.11.8: * slurmctld - fix erroneous "StepId=CORRUPT" messages in error logs. * Correct the error given when auth plugin fails to pack a credential. - * Fix unused-variable compiler warning on FreeBSD in fd_resolve_path(). * acct_gather_filesystem/lustre - only emit collection error once per step. * Add GRES environment variables (e.g., CUDA_VISIBLE_DEVICES) into the interactive step, the same as is done for the batch step. diff --git a/slurm.spec b/slurm.spec index 40d7b7d..885b003 100644 --- a/slurm.spec +++ b/slurm.spec @@ -19,7 +19,7 @@ # Check file META in sources: update so_version to (API_CURRENT - API_AGE) %define so_version 41 # Make sure to update `upgrades` as well! -%define ver 24.05.0 +%define ver 24.05.3 %define _ver _24_05 %define dl_ver %{ver} # so-version is 0 and seems to be stable @@ -59,6 +59,9 @@ ExclusiveArch: do_not_build %if 0%{?sle_version} == 150500 || 0%{?sle_version} == 150600 %define base_ver 2302 %endif +%if 0%{?sle_version} == 150500 || 0%{?sle_version} == 150600 +%define base_ver 2302 +%endif %define ver_m %{lua:x=string.gsub(rpm.expand("%ver"),"%.[^%.]*$","");print(x)} # Keep format_spec_file from botching the define below: