Compare commits

..

No commits in common. "factory" and "factory" have entirely different histories.

5 changed files with 281 additions and 748 deletions

65
Fix-test-21.41.patch Normal file
View File

@ -0,0 +1,65 @@
From: Egbert Eich <eich@suse.com>
Date: Wed Jun 22 14:39:10 2022 +0200
Subject: Fix test 21.41
Patch-mainline: Not yet
Git-repo: https://github.com/SchedMD/slurm
Git-commit: 21619ffa15d1d656ee11a477ebb8215a06387fdd
References:
Since expect is not line oriented, the output is not matched line by line.
Thus the order in which results are returned by sacctmgr actually matters:
If the first test case matches what is returned first, this part will be
consumed. If the 2nd test case will then match what is left over, the
test will actually succeed.
If this is not the case, ie if the first test matches a part that is
actually sent later, the earlier parts will actually be forgotten and
won't match at all.
To make the test resilient to different order of results, the test has
been rewritten to only contain a single match line.
Signed-off-by: Egbert Eich <eich@suse.com>
Signed-off-by: Egbert Eich <eich@suse.de>
---
testsuite/expect/test21.41 | 30 +++++++++++++++---------------
1 file changed, 15 insertions(+), 15 deletions(-)
diff --git a/testsuite/expect/test21.41 b/testsuite/expect/test21.41
index c0961522db..1fd921a48f 100755
--- a/testsuite/expect/test21.41
+++ b/testsuite/expect/test21.41
@@ -372,21 +372,21 @@ expect {
-re "There was a problem" {
fail "There was a problem with the sacctmgr command"
}
- -re "$user1.$wckey1.($number)." {
- set user1wckey1 $expect_out(1,string)
- exp_continue
- }
- -re "$user2.$wckey1.($number)." {
- set user2wckey1 $expect_out(1,string)
- exp_continue
- }
- -re "$user1.$wckey2.($number)." {
- set user1wckey2 $expect_out(1,string)
- exp_continue
- }
- -re "$user2.$wckey2.($number)." {
- set user2wckey2 $expect_out(1,string)
- exp_continue
+ -re "($user1|$user2).($wckey1|$wckey2).($number)." {
+ if { $expect_out(1,string) eq $user1 } {
+ if { $expect_out(2,string) eq $wckey1 } {
+ set user1wckey1 $expect_out(3,string)
+ } elseif { $expect_out(2,string) eq $wckey2 } {
+ set user1wckey2 $expect_out(3,string)
+ }
+ } elseif { $expect_out(1,string) eq $user2 } {
+ if { $expect_out(2,string) eq $wckey1 } {
+ set user2wckey1 $expect_out(3,string)
+ } elseif { $expect_out(2,string) eq $wckey2 } {
+ set user2wckey2 $expect_out(3,string)
+ }
+ }
+ exp_continue
}
timeout {
fail "sacctmgr wckeys not responding"

3
slurm-23.11.5.tar.bz2 Normal file
View File

@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:7a8f4b1b46d3a8ec9a95066b04635c97f9095877f6189a8ff7388e5e74daeef3
size 7365175

View File

@ -1,3 +0,0 @@
version https://git-lfs.github.com/spec/v1
oid sha256:240a2105c8801bc0d222fa2bbcf46f71392ef94cce9253357e5f43f029adaf9b
size 7183430

View File

@ -1,733 +1,182 @@
-------------------------------------------------------------------
Fri Nov 1 12:50:27 UTC 2024 - Egbert Eich <eich@suse.com>
- Update to version 24.05.4 & fix for CVE-2024-48936.
* Fix generic int sort functions.
* Fix user look up using possible unrealized uid in the dbd.
* `slurmrestd` - Fix regressions that allowed `slurmrestd` to
be run as SlurmUser when `SlurmUser` was not root.
* mpi/pmix fix race conditions with het jobs at step start/end
which could make srun to hang.
* Fix not showing some `SelectTypeParameters` in `scontrol show
config`.
* Avoid assert when dumping removed certain fields in JSON/YAML.
* Improve how shards are scheduled with affinity in mind.
* Fix `MaxJobsAccruePU` not being respected when `MaxJobsAccruePA`
is set in the same QOS.
* Prevent backfill from planning jobs that use overlapping
resources for the same time slot if the job's time limit is
less than `bf_resolution`.
* Fix memory leak when requesting typed gres and
`--[cpus|mem]-per-gpu`.
* Prevent backfill from breaking out due to "system state
changed" every 30 seconds if reservations use `REPLACE` or
`REPLACE_DOWN` flags.
* `slurmrestd` - Make sure that scheduler_unset parameter defaults
to true even when the following flags are also set:
`show_duplicates`, `skip_steps`, `disable_truncate_usage_time`,
`run_away_jobs`, `whole_hetjob`, `disable_whole_hetjob`,
`disable_wait_for_result`, `usage_time_as_submit_time`,
`show_batch_script`, and or `show_job_environment`. Additionaly,
always make sure show_duplicates and
`disable_truncate_usage_time` default to true when the following
flags are also set: `scheduler_unset`, `scheduled_on_submit`,
`scheduled_by_main`, `scheduled_by_backfill`, and or `job_started`.
This effects the following endpoints:
`GET /slurmdb/v0.0.40/jobs`
`GET /slurmdb/v0.0.41/jobs`
* Ignore `--json` and `--yaml` options for `scontrol` show config
to prevent mixing output types.
* Fix not considering nodes in reservations with Maintenance or
Overlap flags when creating new reservations with `nodecnt` or
when they replace down nodes.
* Fix suspending/resuming steps running under a 23.02 `slurmstepd`
process.
* Fix options like `sprio --me` and `squeue --me` for users with
a uid greater than 2147483647.
* `fatal()` if `BlockSizes=0`. This value is invalid and would
otherwise cause the `slurmctld` to crash.
* `sacctmgr` - Fix issue where clearing out a preemption list using
`preempt=''` would cause the given qos to no longer be preempt-able
until set again.
* Fix `stepmgr` creating job steps concurrently.
* `data_parser/v0.0.40` - Avoid dumping "Infinity" for `NO_VAL` tagged
"number" fields.
* `data_parser/v0.0.41` - Avoid dumping "Infinity" for `NO_VAL` tagged
"number" fields.
* `slurmctld` - Fix a potential leak while updating a reservation.
* `slurmctld` - Fix state save with reservation flags when a update
fails.
* Fix reservation update issues with parameters Accounts and Users, when
using +/- signs.
* `slurmrestd` - Don't dump warning on empty wckeys in:
`GET /slurmdb/v0.0.40/config`
`GET /slurmdb/v0.0.41/config`
* Fix slurmd possibly leaving zombie processes on start up in configless
when the initial attempt to fetch the config fails.
* Fix crash when trying to drain a non-existing node (possibly deleted
before).
* `slurmctld` - fix segfault when calculating limit decay for jobs with
an invalid association.
* Fix IPMI energy gathering with multiple sensors.
* `data_parser/v0.0.39` - Remove xassert requiring errors and warnings
to have a source string.
* `slurmrestd` - Prevent potential segfault when there is an error
parsing an array field which could lead to a double xfree. This
applies to several endpoints in `data_parser` v0.0.39, v0.0.40 and
v0.0.41.
* `scancel` - Fix a regression from 23.11.6 where using both the
`--ctld` and `--sibling` options would cancel the federated job on
all clusters instead of only the cluster(s) specified by `--sibling`.
* `accounting_storage/mysql` - Fix bug when removing an association
specified with an empty partition.
* Fix setting multiple partition state restore on a job correctly.
* Fix difference in behavior when swapping partition order in job
submission.
* Fix security issue in stepmgr that could permit an attacker to
execute processes under other users' jobs. CVE-2024-48936.
-------------------------------------------------------------------
Wed Oct 23 08:54:29 UTC 2024 - Egbert Eich <eich@suse.com>
- Add %(?%sysusers_requires} to slurm-config.
This fixes issues when building against Slurm.
-------------------------------------------------------------------
Mon Oct 14 10:40:10 UTC 2024 - Egbert Eich <eich@suse.com>
- Update to version 24.05.3
* `data_parser/v0.0.40` - Added field descriptions.
* `slurmrestd` - Avoid creating new slurmdbd connection per request
to `* /slurm/slurmctld/*/*` endpoints.
* Fix compilation issue with `switch/hpe_slingshot` plugin.
* Fix gres per task allocation with threads-per-core.
* `data_parser/v0.0.41` - Added field descriptions.
* `slurmrestd` - Change back generated OpenAPI schema for
`DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using
parameters for request. `slurmrestd` will continue accept endpoint
requests via `RequestBody` or HTTP query.
* `topology/tree` - Fix issues with switch distance optimization.
* Fix potential segfault of secondary `slurmctld` when falling back
to the primary when running with a `JobComp` plugin.
* Enable `--json`/`--yaml=v0.0.39` options on client commands to
dump data using data_parser/v0.0.39 instead or outputting nothing.
* `switch/hpe_slingshot` - Fix issue that could result in a 0 length
state file.
* Fix unnecessary message protocol downgrade for unregistered nodes.
* Fix unnecessarily packing alias addrs when terminating jobs with
a mix of non-cloud/dynamic nodes and powered down cloud/dynamic
nodes.
* `accounting_storage/mysql` - Fix issue when deleting a qos that
could remove too many commas from the qos and/or delta_qos fields
of the assoc table.
* `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU.
* Fix allowing access to reservations without `MaxStartDelay` set.
* Fix regression introduced in 24.05.0rc1 breaking
`srun --send-libs` parsing.
* Fix slurmd vsize memory leak when using job submission/allocation
commands that implicitly or explicitly use --get-user-env.
* `slurmd` - Fix node going into invalid state when using
`CPUSpecList` and setting CPUs to the # of cores on a
multithreaded node.
* Fix reboot asap nodes being considered in backfill after a restart.
* Fix `--clusters`/`-M queries` for clusters outside of a
federation when `fed_display` is configured.
* Fix `scontrol` allowing updating job with bad cpus-per-task value.
* `sattach` - Fix regression from 24.05.2 security fix leading to
crash.
* `mpi/pmix` - Fix assertion when built under `--enable-debug`.
- Changes from Slurm 24.05.2
* Fix energy gathering rpc counter underflow in
`_rpc_acct_gather_energy` when more than 10 threads try to get
energy at the same time. This prevented the possibility to get
energy from slurmd by any step until slurmd was restarted,
so losing energy accounting metrics in the node.
* `accounting_storage/mysql` - Fix issue where new user with `wckey`
did not have a default wckey sent to the slurmctld.
* `slurmrestd` - Prevent slurmrestd segfault when handling the
following endpoints when none of the optional parameters are
specified:
`DELETE /slurm/v0.0.40/jobs`
`DELETE /slurm/v0.0.41/jobs`
`GET /slurm/v0.0.40/shares`
`GET /slurm/v0.0.41/shares`
`GET /slurmdb/v0.0.40/instance`
`GET /slurmdb/v0.0.41/instance`
`GET /slurmdb/v0.0.40/instances`
`GET /slurmdb/v0.0.41/instances`
`POST /slurm/v0.0.40/job/{job_id}`
`POST /slurm/v0.0.41/job/{job_id}`
* Fix IPMI energy gathering when no IPMIPowerSensors are specified
in `acct_gather.conf`. This situation resulted in an accounted
energy of 0 for job steps.
* Fix a minor memory leak in slurmctld when updating a job dependency.
* `scontrol`,`squeue` - Fix regression that caused incorrect values
for multisocket nodes at `.jobs[].job_resources.nodes.allocation`
for `scontrol show jobs --(json|yaml)` and `squeue --(json|yaml)`.
* `slurmrestd` - Fix regression that caused incorrect values for
multisocket nodes at `.jobs[].job_resources.nodes.allocation` to
be dumped with endpoints:
`GET /slurm/v0.0.41/job/{job_id}`
`GET /slurm/v0.0.41/jobs`
* `jobcomp/filetxt` - Fix truncation of job record lines > 1024
characters.
* `switch/hpe_slingshot` - Drain node on failure to delete CXI
services.
* Fix a performance regression from 23.11.0 in cpu frequency
handling when no `CpuFreqDef` is defined.
* Fix one-task-per-sharing not working across multiple nodes.
* Fix inconsistent number of cpus when creating a reservation
using the TRESPerNode option.
* `data_parser/v0.0.40+` - Fix job state parsing which could
break filtering.
* Prevent `cpus-per-task` to be modified in jobs where a `-c`
value has been explicitly specified and the requested memory
constraints implicitly increase the number of CPUs to allocate.
* `slurmrestd` - Fix regression where args `-s v0.0.39,dbv0.0.39`
and `-d v0.0.39` would result in `GET /openapi/v3` not
registering as a valid possible query resulting in 404 errors.
* `slurmrestd` - Fix memory leak for dbv0.0.39 jobs query which
occurred if the query parameters specified account, association,
cluster, constraints, format, groups, job_name, partition, qos,
reason, reservation, state, users, or wckey. This affects the
following endpoints:
`GET /slurmdb/v0.0.39/jobs`
* `slurmrestd` - In the case the slurmdbd does not respond to a
persistent connection init message, prevent the closed fd from
being used, and instead emit an error or warning depending on
if the connection was required.
* Fix 24.05.0 regression that caused the slurmdbd not to send back
an error message if there is an error initializing a persistent
connection.
* Reduce latency of forwarded x11 packets.
* Add `curr_dependency` (representing the current dependency of
the job).
and `orig_dependency` (representing the original requested
dependency of the job) fields to the job record in
`job_submit.lua` (for job update) and `jobcomp.lua`.
* Fix potential segfault of slurmctld configured with
`SlurmctldParameters=enable_rpc_queue` from happening on
reconfigure.
* Fix potential segfault of slurmctld on its shutdown when rate
limitting is enabled.
* `slurmrestd` - Fix missing job environment for `SLURM_JOB_NAME`,
`SLURM_OPEN_MODE`, `SLURM_JOB_DEPENDENCY`, `SLURM_PROFILE`,
`SLURM_ACCTG_FREQ`, `SLURM_NETWORK` and `SLURM_CPU_FREQ_REQ` to
match sbatch.
* Fix GRES environment variable indices being incorrect when only
using a subset of all GPUs on a node and the
`--gres-flags=allow-task-sharing` option.
* Prevent `scontrol` from segfaulting when requesting scontrol
show reservation `--json` or `--yaml` if there is an error
retrieving reservations from the `slurmctld`.
* `switch/hpe_slingshot` - Fix security issue around managing VNI
access. CVE-2024-42511.
* `switch/nvidia_imex` - Fix security issue managing IMEX channel
access. CVE-2024-42511.
* `switch/nvidia_imex` - Allow for compatibility with
`job_container/tmpfs`.
- Changes in Slurm 24.05.1
* Fix `slurmctld` and `slurmdbd` potentially stopping instead of
performing a logrotate when recieving `SIGUSR2` when using
`auth/slurm`.
* `switch/hpe_slingshot` - Fix slurmctld crash when upgrading
from 23.02.
* Fix "Could not find group" errors from `validate_group()` when
using `AllowGroups` with large `/etc/group` files.
* Add `AccountingStoreFlags=no_stdio` which allows to not record
the stdio paths of the job when set.
* `slurmrestd` - Prevent a slurmrestd segfault when parsing the
`crontab` field, which was never usable. Now it explicitly
ignores the value and emits a warning if it is used for the
following endpoints:
`POST /slurm/v0.0.39/job/{job_id}`
`POST /slurm/v0.0.39/job/submit`
`POST /slurm/v0.0.40/job/{job_id}`
`POST /slurm/v0.0.40/job/submit`
`POST /slurm/v0.0.41/job/{job_id}`
`POST /slurm/v0.0.41/job/submit`
`POST /slurm/v0.0.41/job/allocate`
* `mpi/pmi2` - Fix communication issue leading to task launch
failure with "`invalid kvs seq from node`".
* Fix getting user environment when using sbatch with
`--get-user-env` or `--export=` when there is a user profile
script that reads `/proc`.
* Prevent slurmd from crashing if `acct_gather_energy/gpu` is
configured but `GresTypes` is not configured.
* Do not log the following errors when `AcctGatherEnergyType`
plugins are used but a node does not have or cannot find sensors:
"`error: _get_joules_task: can't get info from slurmd`"
"`error: slurm_get_node_energy: Zero Bytes were transmitted or
received`"
However, the following error will continue to be logged:
"`error: Can't get energy data. No power sensors are available.
Try later`"
* `sbatch`, `srun` - Set `SLURM_NETWORK` environment variable if
`--network` is set.
* Fix cloud nodes not being able to forward to nodes that restarted
with new IP addresses.
* Fix cwd not being set correctly when running a SPANK plugin with a
`spank_user_init()` hook and the new "`contain_spank`" option set.
* `slurmctld` - Avoid deadlock during shutdown when `auth/slurm`
is active.
* Fix segfault in `slurmctld` with `topology/block`.
* `sacct` - Fix printing of job group for job steps.
* `scrun` - Log when an invalid environment variable causes the
job submission to be rejected.
* `accounting_storage/mysql` - Fix problem where listing or
modifying an association when specifying a qos list could hang
or take a very long time.
* `gpu/nvml` - Fix `gpuutil/gpumem` only tracking last GPU in step.
Now, `gpuutil/gpumem` will record sums of all GPUS in the step.
* Fix error in `scrontab` jobs when using
`slurm.conf:PropagatePrioProcess=1`.
* Fix `slurmctld` crash on a batch job submission with
`--nodes 0,...`.
* Fix dynamic IP address fanout forwarding when using `auth/slurm`.
* Restrict listening sockets in the `mpi/pmix` plugin and `sattach`
to the `SrunPortRange`.
* `slurmrestd` - Limit mime types returned from query to
`GET /openapi/v3` to only return one mime type per serializer
plugin to fix issues with OpenAPI client generators that are
unable to handle multiple mime type aliases.
* Fix many commands possibly reporting an "`Unexpected Message
Received`" when in reality the connection timed out.
* Prevent slurmctld from starting if there is not a json
serializer present and the `extra_constraints` feature is enabled.
* Fix heterogeneous job components not being signaled with
`scancel --ctld` and `DELETE slurm/v0.0.40/jobs` if the job ids
are not explicitly given, the heterogeneous job components match
the given filters, and the heterogeneous job leader does not
match the given filters.
* Fix regression from 23.02 impeding job licenses from being cleared.
* Move error to `log_flag` which made `_get_joules_task` error to
be logged to the user when too many rpcs were queued in slurmd
for gathering energy.
* For `scancel --ctld` and the associated rest api endpoints:
`DELETE /slurm/v0.0.40/jobs`
`DELETE /slurm/v0.0.41/jobs`
Fix canceling the final array task in a job array when the task
is pending and all array tasks have been split into separate job
records. Previously this task was not canceled.
* Fix `power_save operation` after recovering from a failed
reconfigure.
* `slurmctld` - Skip removing the pidfile when running under
systemd. In that situation it is never created in the first place.
* Fix issue where altering the flags on a Slurm account
(`UsersAreCoords`) several limits on the account's association
would be set to 0 in Slurm's internal cache.
* Fix memory leak in the controller when relaying `stepmgr` step
accounting to the dbd.
* Fix segfault when submitting stepmgr jobs within an existing
allocation.
* Added `disable_slurm_hydra_bootstrap` as a possible `MpiParams`
parameter in `slurm.conf`. Using this will disable env variable
injection to allocations for the following variables:
`I_MPI_HYDRA_BOOTSTRAP,` `I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS`,
`HYDRA_BOOTSTRAP`, `HYDRA_LAUNCHER_EXTRA_ARGS`.
* `scrun` - Delay shutdown until after start requested.
This caused `scrun` to never start or shutdown and hung forever
when using `--tty`.
* Fix backup `slurmctld` potentially not running the agent when
taking over as the primary controller.
* Fix primary controller not running the agent when a reconfigure
of the `slurmctld` fails.
* `slurmd` - fix premature timeout waiting for
`REQUEST_LAUNCH_PROLOG` with large array jobs causing node to
drain.
* `jobcomp/{elasticsearch,kafka}` - Avoid sending fields with
invalid date/time.
* `jobcomp/elasticsearch` - Fix `slurmctld` memory leak from
curl usage.
* `acct_gather_profile/influxdb` - Fix slurmstepd memory leak from
curl usage
* Fix 24.05.0 regression not deleting job hash dirs after
`MinJobAge`.
* Fix filtering arguments being ignored when using squeue `--json`.
* `switch/nvidia_imex` - Move setup call after `spank_init()` to
allow namespace manipulation within the SPANK plugin.
* `switch/nvidia_imex` - Skip plugin operation if
`nvidia-caps-imex-channels` device is not present rather than
preventing slurmd from starting.
* `switch/nvidia_imex` - Skip plugin operation if
`job_container/tmpfs` is configured due to incompatibility.
* `switch/nvidia_imex` - Remove any pre-existing channels when
`slurmd` starts.
* `rpc_queue` - Add support for an optional `rpc_queue.yaml`
configuration file.
* `slurmrestd` - Add new +prefer_refs flag to `data_parser/v0.0.41`
plugin. This flag will avoid inlining single referenced schemas
in the OpenAPI schema.
-------------------------------------------------------------------
Tue Jun 4 09:36:54 UTC 2024 - Christian Goll <cgoll@suse.com>
- Updated to new release 24.05.0 with following major changes
* Important Notes:
If using the slurmdbd (Slurm DataBase Daemon) you must update
this first. NOTE: If using a backup DBD you must start the
primary first to do any database conversion, the backup will not
start until this has happened. The 24.05 slurmdbd will work
with Slurm daemons of version 23.02 and above. You will not
need to update all clusters at the same time, but it is very
important to update slurmdbd first and having it running before
updating any other clusters making use of it.
* Highlights
+ Federation - allow client command operation when slurmdbd is
unavailable.
+ `burst_buffer/lua` - Added two new hooks: `slurm_bb_test_data_in`
and `slurm_bb_test_data_out`. The syntax and use of the new hooks
are documented in `etc/burst_buffer.lua.example`. These are
required to exist. slurmctld now checks on startup if the
`burst_buffer.lua` script loads and contains all required hooks;
`slurmctld` will exit with a fatal error if this is not
successful. Added `PollInterval` to `burst_buffer.conf`. Removed
the arbitrary limit of 512 copies of the script running
simultaneously.
+ Add QOS limit `MaxTRESRunMinsPerAccount`.
+ Add QOS limit `MaxTRESRunMinsPerUser`.
+ Add `ELIGIBLE` environment variable to `jobcomp/script` plugin.
+ Always use the QOS name for `SLURM_JOB_QOS` environment variables.
Previously the batch environment would use the description field,
which was usually equivalent to the name.
+ `cgroup/v2` - Require dbus-1 version >= 1.11.16.
+ Allow `NodeSet` names to be used in SuspendExcNodes.
+ `SuspendExcNodes=<nodes>:N` now counts allocated nodes in `N`.
The first `N` powered up nodes in <nodes> are protected from
being suspended.
+ Store job output, input and error paths in `SlurmDBD`.
+ Add `USER_DELETE` reservation flag to allow users with access
to a reservation to delete it.
+ Add `SlurmctldParameters=enable_stepmgr` to enable step
management through the `slurmstepd` instead of the controller.
+ Added `PrologFlags=RunInJob` to make prolog and epilog run
inside the job extern step to include it in the job's cgroup.
+ Add ability to reserve MPI ports at the job level for stepmgr
jobs and subdivide them at the step level.
+ `slurmrestd` - Add `--generate-openapi-spec argument`.
* Configuration File Changes (see appropriate man page for details)
+ `CoreSpecPlugin` has been removed.
+ Removed `TopologyPlugin` tree and dragonfly support from
`select/linear`. If those topology plugins are desired please
switch to `select/cons_tres`.
+ Changed the default value for `UnkillableStepTimeout` to 60
seconds or five times the value of `MessageTimeout`, whichever
is greater.
+ An error log has been added if `JobAcctGatherParams` '`UsePss`'
or '`NoShare`' are configured with a plugin other than
`jobacct_gather/linux`. In such case these parameters are ignored.
+ `helpers.conf` - Added `Flags=rebootless` parameter allowing
feature changes without rebooting compute nodes.
+ `topology/block` - Replaced the `BlockLevels` with `BlockSizes`
in `topology.conf`.
+ Add `contain_spank` option to `SlurmdParameters`. When set,
`spank_user_init()`, `spank_task_post_fork()`, and
`spank_task_exit()` will execute within the
`job_container/tmpfs` plugin namespace.
+ Add `SlurmctldParameters=max_powered_nodes=N`, which prevents
powering up nodes after the max is reached.
+ Add `ExclusiveTopo` to a partition definition in `slurm.conf`.
+ Add `AccountingStorageParameters=max_step_records` to limit how
many steps are recorded in the database for each job - excluding
batch.
* Command Changes (see man pages for details)
+ Add support for "elevenses" as an additional time specification.
+ Add support for `sbcast --preserve` when `job_container/tmpfs`
configured (previously documented as unsupported).
+ `scontrol` - Add new subcommand `power` for node power control.
+ `squeue` - Adjust `StdErr`, `StdOut`, and `StdIn` output formats.
These will now consistently print "`(null)`" if a value is
unavailable. `StdErr` will no longer display `StdOut` if it is
not distinctly set. `StdOut` will now correctly display the
default filename pattern for job arrays, and no longer show it
for non-batch jobs. However, the expansion patterns will
no longer be substituted by default.
+ Add `--segment` to job allocation to be used in topology/block.
+ Add `--exclusive=topo` for use with topology/block.
+ `squeue` - Add `--expand-patterns` option to expand `StdErr`,
`StdOut`, `StdIn` filename patterns as best as possible.
+ `sacct` - Add `--expand-patterns` option to expand `StdErr`,
`StdOut`, `StdIn` filename patterns as best as possible.
+ `sreport` - Requesting `format=Planned` will now return the
expected `Planned` time as documented, instead of `PlannedDown`.
To request `Planned Down`, one must use now `format=PLNDDown`
or `format=PlannedDown` explicitly. The abbreviations
"`Pl`" or "`Pla`" will now make reference to Planned instead
of `PlannedDown`.
* API Changes
+ Removed `ListIterator` type from `<slurm/slurm.h>`.
+ Removed `slurm_xlate_job_id()` from `<slurm/slurm.h>`
* SLURMRESTD Changes
+ `openapi/dbv0.0.38` and `openapi/v0.0.38` plugins have been
removed.
+ `openapi/dbv0.0.39` and `openapi/v0.0.39` plugins have been
tagged as deprecated to warn of their removal in the next release.
+ Changed `slurmrestd.service` to only listen on TCP socket by
default. Environments with existing drop-in units for the
service may need further adjustments to work after upgrading.
+ `slurmrestd` - Tagged `script` field as deprecated in
`POST /slurm/v0.0.41/job/submit` in anticipation of removal in
future OpenAPI plugin versions. Job submissions should set the
`job.script` (or `jobs[0].script` for HetJobs) fields instead.
+ `slurmrestd` - Attempt to automatically convert enumerated
string arrays with incoming non-string values into strings.
Add warning when incoming value for enumerated string arrays
can not be converted to string and silently ignore instead of
rejecting entire request. This change affects any endpoint that
uses an enunmerated string as given in the OpenAPI specification.
An example of this conversion would be to
`POST /slurm/v0.0.41/job/submit` with `.job.exclusive = true`.
While the JSON (boolean) true value matches a possible
enumeration, it is not the expected "true" string. This change
automatically converts the (boolean) `true` to (string) "`true`"
avoiding a parsing failure.
+ `slurmrestd` - Add `POST /slurm/v0.0.41/job/allocate` endpoint.
This endpoint will create a new job allocation without any steps.
The allocation will need to be ended via signaling the job or
it will run to the timelimit.
+ `slurmrestd` - Allow startup when `slurmdbd` is not configured
and avoid loading `slurmdbd` specific plugins.
* MPI/PMI2 Changes
+ Jobs submitted with the `SLURM_HOSTFILE` environment variable
set implies using an arbitrary distribution. Nevertheless, the
logic used in PMI2 when generating their associated
`PMI_process_mapping` values has been changed and will now be
the same used for the plane distribution, as if `-m plane` were
used. This has been changed because the original arbitrary
distribution implementation did not account for multiple
instances of the same host being present in `SLURM_HOSTFILE`,
providing an incorrect process mapping in such case. This
change also enables distributing tasks in blocks when using
arbitrary distribution, which was not the case before. This
only affects `mpi`/`pmi2` plugin.
- Removed Fix-test-21.41.patch as upstream test changed.
- Dropped package plugin-ext-sensors-rrd as the plugin module no
longer exists.
-------------------------------------------------------------------
Mon Mar 25 15:16:44 UTC 2024 - Christian Goll <cgoll@suse.com>
- removed Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch
as incoperated upstream
- Changes in Slurm 23.02.5
* Add the `JobId` to `debug()` messages indicating when
`cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being
automatically adjusted.
* Fix regression in 23.02.2 that caused `slurmctld -R` to crash on
startup if a node features plugin is configured.
* Fix and prevent reoccurring reservations from overlapping.
* `job_container/tmpfs` - Avoid attempts to share `BasePath`
between nodes.
* Change the log message warning for rate limited users from
verbose to info.
* With `CR_Cpu_Memory`, fix node selection for jobs that request
gres and `--mem-per-cpu`.
* Fix a regression from 22.05.7 in which some jobs were allocated
too few nodes, thus overcommitting cpus to some tasks.
* Fix a job being stuck in the completing state if the job ends
while the primary controller is down or unresponsive and the
backup controller has not yet taken over.
* Fix `slurmctld` segfault when a node registers with a configured
`CpuSpecList` while slurmctld configuration has the node without
`CpuSpecList`.
* Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state
after not registering by `ResumeTimeout`.
* `slurmstepd` - Avoid cleanup of `config.json`-less containers
spooldir getting skipped.
* `slurmstepd` - Cleanup per task generated environment for
containers in spooldir.
* Fix `scontrol segfault` when 'completing' command requested
repeatedly in interactive mode.
* Properly handle a race condition between `bind()` and `listen()`
calls in the network stack when running with `SrunPortRange` set.
* Federation - Fix revoked jobs being returned regardless of the
`-a`/`--all` option for privileged users.
* Federation - Fix canceling pending federated jobs from non-origin
clusters which could leave federated jobs orphaned from the origin
cluster.
* Fix sinfo segfault when printing multiple clusters with
`--noheader` option.
* Federation - fix clusters not syncing if clusters are added to
a federation before they have registered with the dbd.
* Change `pmi2` plugin to honor the `SrunPortRange` option. This
matches the new behavior of the pmix plugin in 23.02.0. Note that
neither of these plugins makes use of the "`MpiParams=ports=`"
option, and previously were only limited by the systems ephemeral
port range.
* `node_features/helpers` - Fix node selection for jobs requesting
changeable features with the '`|`' operator, which could prevent
jobs from running on some valid nodes.
* `node_features/helpers` - Fix inconsistent handling of '`&`' and
'`|`', where an AND'd feature was sometimes AND'd to all sets of
features instead of just the current set. E.g. "`foo|bar&baz`" was
interpreted as `{foo,baz}` or `{bar,baz}` instead of how it is
documented: "`{foo} or {bar,baz}`".
* Fix job accounting so that when a job is requeued its allocated
node count is cleared. After the requeue, sacct will correctly
show that the job has 0 `AllocNodes` while it is pending or if
it is canceled before restarting.
* `sacct` - `AllocCPUS` now correctly shows 0 if a job has not yet
received an allocation or if the job was canceled before getting
one.
* Fix intel oneapi autodetect: detect the `/dev/dri/renderD[0-9]+`
gpus, and do not detect `/dev/dri/card[0-9]+`.
* Format batch, extern, interactive, and pending step ids into
strings that are human readable.
* Fix node selection for jobs that request `--gpus` and a number
of tasks fewer than gpus, which resulted in incorrectly rejecting
these jobs.
* Remove `MYSQL_OPT_RECONNECT` completely.
* Fix cloud nodes in `POWERING_UP` state disappearing (getting set
to `FUTURE`) when an `scontrol reconfigure` happens.
* `openapi/dbv0.0.39` - Avoid assert / segfault on missing
coordinators list.
* `slurmrestd` - Correct memory leak while parsing OpenAPI
specification templates with server overrides.
* `slurmrestd` - Reduce memory usage when printing out job CPU
frequency.
* Fix overwriting user node reason with system message.
* Remove `--uid` / `--gid` options from salloc and srun commands.
* Prevent deadlock when rpc_queue is enabled.
* `slurmrestd` - Correct OpenAPI specification generation bug where
fields with overlapping parent paths would not get generated.
* Fix memory leak as a result of a partition info query.
* Fix memory leak as a result of a job info query.
* slurmrestd - For `GET /slurm/v0.0.39/node[s]`, change format of
node's energy field `current_watts` to a dictionary to account
for unset value instead of dumping `4294967294`.
* `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of
QOS's field `priority` to a dictionary to account for unset
value instead of dumping `4294967294`.
* `slurmrestd` - For `GET /slurm/v0.0.39/job[s]`, the `return code`
code field in `v0.0.39_job_exit_code` will be set to 127 instead
of being left unset where job does not have a relevant return code.
* `data_parser/v0.0.39` - Add `required/memory_per_cpu` and
required/memory_per_node to `sacct --json` and `sacct --yaml` and
`GET /slurmdb/v0.0.39/jobs` from `slurmrestd`.
* For step allocations, fix `--gres=none` sometimes not ignoring
gres from the job.
* Fix `--exclusive` jobs incorrectly gang-scheduling where they
shouldn't.
* Fix allocations with `CR_SOCKET`, gres not assigned to a specific
socket, and block core distribion potentially allocating more
sockets than required.
* `gpu/oneapi` - Store cores correctly so CPU affinity is tracked.
* Revert a change in 23.02.3 where Slurm would kill a script's
process group as soon as the script ended instead of waiting as
long as any process in
that process group held the stdout/stderr file descriptors open.
That change broke some scripts that relied on the previous
behavior. Setting time limits for scripts (such as
`PrologEpilogTimeout`) is strongly encouraged to avoid Slurm
waiting indefinitely for scripts to finish.
* Allow slurmdbd -R to work if the root assoc id is not 1.
* Fix `slurmdbd -R` not returning an error under certain conditions.
* `slurmdbd` - Avoid potential NULL pointer dereference in the
mysql plugin.
* Revert a change in 23.02 where `SLURM_NTASKS` was no longer
set in the job's environment when `--ntasks-per-node` was
requested.
* Limit periodic node registrations to 50 instead of the full
`TreeWidth`.
Since unresolvable `cloud/dynamic` nodes must disable fanout by
setting `TreeWidth` to a large number, this would cause all nodes
to register at once.
* Fix regression in 23.02.3 which broken x11 forwarding for hosts
when `MUNGE` sends a localhost address in the encode host field.
This is caused when the node hostname is mapped to 127.0.0.1
(or similar) in `/etc/hosts`.
* `openapi/[db]v0.0.39` - fix memory leak on parsing error.
* `data_parser/v0.0.39` - fix updating qos for associations.
* `openapi/dbv0.0.39` - fix updating values for associations with
null users.
* Fix minor memory leak with `--tres-per-task` and licenses.
* Fix cyclic socket cpu distribution for tasks in a step where
`--cpus-per-task` < usable threads per core.
* Changes in Slurm 23.02.5
* Add the JobId to debug() messages indicating when cpus_per_task/mem_per_cpu
or pn_min_cpus are being automatically adjusted.
* Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
a node features plugin is configured.
* Fix and prevent reoccurring reservations from overlapping.
* job_container/tmpfs - Avoid attempts to share BasePath between nodes.
* Change the log message warning for rate limited users from verbose to info.
* With CR_Cpu_Memory, fix node selection for jobs that request gres and
*-mem-per-cpu.
* Fix a regression from 22.05.7 in which some jobs were allocated too few
nodes, thus overcommitting cpus to some tasks.
* Fix a job being stuck in the completing state if the job ends while the
primary controller is down or unresponsive and the backup controller has
not yet taken over.
* Fix slurmctld segfault when a node registers with a configured CpuSpecList
while slurmctld configuration has the node without CpuSpecList.
* Fix cloud nodes getting stuck in POWERED_DOWN+NO_RESPOND state after not
registering by ResumeTimeout.
* slurmstepd - Avoid cleanup of config.json-less containers spooldir getting
skipped.
* slurmstepd - Cleanup per task generated environment for containers in
spooldir.
* Fix scontrol segfault when 'completing' command requested repeatedly in
interactive mode.
* Properly handle a race condition between bind() and listen() calls in the
network stack when running with SrunPortRange set.
* Federation - Fix revoked jobs being returned regardless of the -a/--all
option for privileged users.
* Federation - Fix canceling pending federated jobs from non-origin clusters
which could leave federated jobs orphaned from the origin cluster.
* Fix sinfo segfault when printing multiple clusters with --noheader option.
* Federation - fix clusters not syncing if clusters are added to a federation
before they have registered with the dbd.
* Change pmi2 plugin to honor the SrunPortRange option. This matches the new
behavior of the pmix plugin in 23.02.0. Note that neither of these plugins
makes use of the "MpiParams=ports=" option, and previously were only limited
by the systems ephemeral port range.
* node_features/helpers - Fix node selection for jobs requesting changeable
features with the '|' operator, which could prevent jobs from running on
some valid nodes.
* node_features/helpers - Fix inconsistent handling of '&' and '|', where an
AND'd feature was sometimes AND'd to all sets of features instead of just
the current set. E.g. "foo|bar&baz" was interpreted as {foo,baz} or
{bar,baz} instead of how it is documented: "{foo} or {bar,baz}".
* Fix job accounting so that when a job is requeued its allocated node count
is cleared. After the requeue, sacct will correctly show that the job has
0 AllocNodes while it is pending or if it is canceled before restarting.
* sacct - AllocCPUS now correctly shows 0 if a job has not yet received an
allocation or if the job was canceled before getting one.
* Fix intel oneapi autodetect: detect the /dev/dri/renderD[0-9]+ gpus, and do
not detect /dev/dri/card[0*9]+.
* Format batch, extern, interactive, and pending step ids into strings that
are human readable.
* Fix node selection for jobs that request --gpus and a number of tasks fewer
than gpus, which resulted in incorrectly rejecting these jobs.
* Remove MYSQL_OPT_RECONNECT completely.
* Fix cloud nodes in POWERING_UP state disappearing (getting set to FUTURE)
when an `scontrol reconfigure` happens.
* openapi/dbv0.0.39 - Avoid assert / segfault on missing coordinators list.
* slurmrestd - Correct memory leak while parsing OpenAPI specification
templates with server overrides.
* slurmrestd - Reduce memory usage when printing out job CPU frequency.
* Fix overwriting user node reason with system message.
* Remove --uid / --gid options from salloc and srun commands.
* Prevent deadlock when rpc_queue is enabled.
* slurmrestd - Correct OpenAPI specification generation bug where fields with
overlapping parent paths would not get generated.
* Fix memory leak as a result of a partition info query.
* Fix memory leak as a result of a job info query.
* slurmrestd - For 'GET /slurm/v0.0.39/node[s]', change format of node's
energy field "current_watts" to a dictionary to account for unset value
instead of dumping 4294967294.
* slurmrestd - For 'GET /slurm/v0.0.39/qos', change format of QOS's
field "priority" to a dictionary to account for unset value instead of
dumping 4294967294.
* slurmrestd - For 'GET /slurm/v0.0.39/job[s]', the 'return code' code field
in v0.0.39_job_exit_code will be set to *127 instead of being left unset
where job does not have a relevant return code.
* data_parser/v0.0.39 - Add required/memory_per_cpu and
required/memory_per_node to `sacct *-json` and `sacct --yaml` and
'GET /slurmdb/v0.0.39/jobs' from slurmrestd.
* For step allocations, fix --gres=none sometimes not ignoring gres from the
job.
* Fix --exclusive jobs incorrectly gang-scheduling where they shouldn't.
* Fix allocations with CR_SOCKET, gres not assigned to a specific socket, and
block core distribion potentially allocating more sockets than required.
* gpu/oneapi - Store cores correctly so CPU affinity is tracked.
* Revert a change in 23.02.3 where Slurm would kill a script's process group
as soon as the script ended instead of waiting as long as any process in
that process group held the stdout/stderr file descriptors open. That change
broke some scripts that relied on the previous behavior. Setting time limits
for scripts (such as PrologEpilogTimeout) is strongly encouraged to avoid
Slurm waiting indefinitely for scripts to finish.
* Allow slurmdbd -R to work if the root assoc id is not 1.
* Fix slurmdbd -R not returning an error under certain conditions.
* slurmdbd - Avoid potential NULL pointer dereference in the mysql plugin.
* Revert a change in 23.02 where SLURM_NTASKS was no longer set in the job's
environment when *-ntasks-per-node was requested.
* Limit periodic node registrations to 50 instead of the full TreeWidth.
Since unresolvable cloud/dynamic nodes must disable fanout by setting
TreeWidth to a large number, this would cause all nodes to register at
once.
* Fix regression in 23.02.3 which broken x11 forwarding for hosts when
MUNGE sends a localhost address in the encode host field. This is caused
when the node hostname is mapped to 127.0.0.1 (or similar) in /etc/hosts.
* openapi/[db]v0.0.39 - fix memory leak on parsing error.
* data_parser/v0.0.39 - fix updating qos for associations.
* openapi/dbv0.0.39 - fix updating values for associations with null users.
* Fix minor memory leak with --tres-per-task and licenses.
* Fix cyclic socket cpu distribution for tasks in a step where
--cpus-per-task < usable threads per core.
- Changes in Slurm 23.02.4
* Fix `sbatch` return code when --wait is requested on a job array.
* `switch/hpe_slingshot` - avoid segfault when running with old
libcxi.
* Avoid slurmctld segfault when specifying
`AccountingStorageExternalHost`.
* Fix collected `GPUUtilization` values for `acct_gather_profile`
plugins.
* Fix sbatch return code when **wait is requested on a job array.
* switch/hpe_slingshot * avoid segfault when running with old libcxi.
* Avoid slurmctld segfault when specifying AccountingStorageExternalHost.
* Fix collected GPUUtilization values for acct_gather_profile plugins.
* Fix slurmrestd handling of job hold/release operations.
* Make spank `S_JOB_ARGV` item value hold the requested command
argv instead of the srun `--bcast` value when `--bcast` requested
(only in local context).
* Fix step running indefinitely when slurmctld takes more than
`MessageTimeout` to respond. Now, `slurmctld` will cancel the
step when detected, preventing following steps from getting stuck
waiting for resources to be released.
* Fix regression to make job_desc.min_cpus accurate again in
job_submit when requesting a job with `--ntasks-per-node`.
* `scontrol` - Permit changes to `StdErr` and `StdIn` for pending
jobs.
* `scontrol` - Reset std{err,in,out} when set to empty string.
* `slurmrestd` - mark environment as a required field for job
submission descriptions.
* `slurmrestd` - avoid dumping null in OpenAPI schema required
fields.
`data_parser/v0.0.39` - avoid rejecting valid `memory_per_node`
formatted as dictionary provided with a job description.
* `data_parser/v0.0.39` - avoid rejecting valid `memory_per_cpu`
formatted as dictionary provided with a job description.
* `slurmrestd` - Return HTTP error code 404 when job query fails.
* `slurmrestd` - Add return schema to error response to job and
license query.
* Make spank S_JOB_ARGV item value hold the requested command argv instead of
the srun **bcast value when **bcast requested (only in local context).
* Fix step running indefinitely when slurmctld takes more than MessageTimeout
to respond. Now, slurmctld will cancel the step when detected, preventing
following steps from getting stuck waiting for resources to be released.
* Fix regression to make job_desc.min_cpus accurate again in job_submit when
requesting a job with **ntasks*per*node.
* scontrol * Permit changes to StdErr and StdIn for pending jobs.
* scontrol * Reset std{err,in,out} when set to empty string.
* slurmrestd * mark environment as a required field for job submission
descriptions.
* slurmrestd * avoid dumping null in OpenAPI schema required fields.
* data_parser/v0.0.39 * avoid rejecting valid memory_per_node formatted as
dictionary provided with a job description.
* data_parser/v0.0.39 * avoid rejecting valid memory_per_cpu formatted as
dictionary provided with a job description.
* slurmrestd * Return HTTP error code 404 when job query fails.
* slurmrestd * Add return schema to error response to job and license query.
* Fix handling of ArrayTaskThrottle in backfill.
* Fix regression in 23.02.2 when checking gres state on `slurmctld`
startup or reconfigure. Gres changes in the configuration were
not updated on `slurmctld` startup. On startup or reconfigure,
these messages were present in the log:
"`error: Attempt to change gres/gpu Count`".
* Fix regression in 23.02.2 when checking gres state on slurmctld startup or
reconfigure. Gres changes in the configuration were not updated on slurmctld
startup. On startup or reconfigure, these messages were present in the log:
"error: Attempt to change gres/gpu Count".
* Fix potential double count of gres when dealing with limits.
* `switch/hpe_slingshot` - support alternate traffic class names
with "`TC_`" prefix.
* `scrontab` - Fix cutting off the final character of quoted
variables.
* Fix `slurmstepd` segfault when `ContainerPath` is not set in
`oci.conf`.
* Change the log message warning for rate limited users from
debug to verbose.
* Fixed an issue where jobs requesting licenses were incorrectly
rejected.
* `smail` - Fix issues where emails at job completion were not
being sent.
* `scontrol/slurmctld` - fix comma parsing when updating a
reservation's nodes.
* `cgroup/v2` - Avoid capturing log output for ebpf when
constraining devices, as this can lead to inadvertent failure
if the log buffer is too small.
* Fix --gpu-bind=single binding tasks to wrong gpus, leading to
some gpus having more tasks than they should and other gpus being
unused.
* Fix main scheduler loop not starting after failover to backup
controller.
* Added error message when attempting to use sattach on batch or
extern steps.
* Fix regression in 23.02 that causes slurmstepd to crash when
`srun` requests more than `TreeWidth` nodes in a step and uses
the `pmi2` or `pmix` plugin.
* Reject job `ArrayTaskThrottle` update requests from unprivileged
users.
* `data_parser/v0.0.39` - populate description fields of property
objects in generated OpenAPI specifications where defined.
* `slurmstepd` - Avoid segfault caused by ContainerPath not being
terminated by '`/`' in `oci.conf`.
* `data_parser/v0.0.39` - Change `v0.0.39_job_info` response to tag
`exit_code` field as being complex instead of only an unsigned
integer.
* `job_container/tmpfs` - Fix %h and %n substitution in `BasePath`
where `%h` was substituted as the `NodeName` instead of the
hostname, and `%n` was substituted as an empty string.
* Fix regression where --cpu-bind=verbose would override
`TaskPluginParam`.
* `scancel` - Fix `--clusters`/`-M` for federations. Only filtered
jobs (e.g. -A, -u, -p, etc.) from the specified clusters will be
canceled, rather than all jobs in the federation.
Specific jobids will still be routed to the origin cluster
for cancellation.
* switch/hpe_slingshot * support alternate traffic class names with "TC_"
prefix.
* scrontab * Fix cutting off the final character of quoted variables.
* Fix slurmstepd segfault when ContainerPath is not set in oci.conf
* Change the log message warning for rate limited users from debug to verbose.
* Fixed an issue where jobs requesting licenses were incorrectly rejected.
* smail * Fix issues where e*mails at job completion were not being sent.
* scontrol/slurmctld * fix comma parsing when updating a reservation's nodes.
* cgroup/v2 * Avoid capturing log output for ebpf when constraining devices,
as this can lead to inadvertent failure if the log buffer is too small.
* Fix **gpu*bind=single binding tasks to wrong gpus, leading to some gpus
having more tasks than they should and other gpus being unused.
* Fix main scheduler loop not starting after failover to backup controller.
* Added error message when attempting to use sattach on batch or extern steps.
* Fix regression in 23.02 that causes slurmstepd to crash when srun requests
more than TreeWidth nodes in a step and uses the pmi2 or pmix plugin.
* Reject job ArrayTaskThrottle update requests from unprivileged users.
* data_parser/v0.0.39 * populate description fields of property objects in
generated OpenAPI specifications where defined.
* slurmstepd * Avoid segfault caused by ContainerPath not being terminated by
'/' in oci.conf.
* data_parser/v0.0.39 * Change v0.0.39_job_info response to tag exit_code
field as being complex instead of only an unsigned integer.
* job_container/tmpfs * Fix %h and %n substitution in BasePath where %h was
substituted as the NodeName instead of the hostname, and %n was substituted
as an empty string.
* Fix regression where **cpu*bind=verbose would override TaskPluginParam.
* scancel * Fix **clusters/*M for federations. Only filtered jobs (e.g. *A,
*u, *p, etc.) from the specified clusters will be canceled, rather than all
jobs in the federation. Specific jobids will still be routed to the origin
cluster for cancellation.
-------------------------------------------------------------------
Mon Jan 29 13:47:55 UTC 2024 - Egbert Eich <eich@suse.com>
@ -2758,6 +2207,7 @@ Fri Jul 2 08:01:32 UTC 2021 - Christian Goll <cgoll@suse.com>
- Updated to 20.11.8:
* slurmctld - fix erroneous "StepId=CORRUPT" messages in error logs.
* Correct the error given when auth plugin fails to pack a credential.
* Fix unused-variable compiler warning on FreeBSD in fd_resolve_path().
* acct_gather_filesystem/lustre - only emit collection error once per step.
* Add GRES environment variables (e.g., CUDA_VISIBLE_DEVICES) into the
interactive step, the same as is done for the batch step.

View File

@ -1,5 +1,5 @@
#
# spec file for package slurm
# spec file
#
# Copyright (c) 2024 SUSE LLC
#
@ -17,10 +17,10 @@
# Check file META in sources: update so_version to (API_CURRENT - API_AGE)
%define so_version 41
%define so_version 40
# Make sure to update `upgrades` as well!
%define ver 24.05.4
%define _ver _24_05
%define ver 23.11.5
%define _ver _23_11
%define dl_ver %{ver}
# so-version is 0 and seems to be stable
%define pmi_so 0
@ -59,9 +59,6 @@ ExclusiveArch: do_not_build
%if 0%{?sle_version} == 150500 || 0%{?sle_version} == 150600
%define base_ver 2302
%endif
%if 0%{?sle_version} == 150500 || 0%{?sle_version} == 150600
%define base_ver 2302
%endif
%define ver_m %{lua:x=string.gsub(rpm.expand("%ver"),"%.[^%.]*$","");print(x)}
# Keep format_spec_file from botching the define below:
@ -173,6 +170,8 @@ Source20: test_setup.tar.gz
Source21: README_Testsuite.md
Patch0: Remove-rpath-from-build.patch
Patch2: pam_slurm-Initialize-arrays-and-pass-sizes.patch
Patch10: Fix-test-21.41.patch
#Patch14: Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch
Patch15: Fix-test7.2-to-find-libpmix-under-lib64-as-well.patch
%{upgrade_dep %pname}
@ -407,6 +406,19 @@ Requires: %{name}-config = %{version}
%description plugins
This package contains the SLURM plugins (loadable shared objects)
%package plugin-ext-sensors-rrd
Summary: SLURM ext_sensors/rrd Plugin (loadable shared objects)
Group: Productivity/Clustering/Computing
Requires: %{name}-plugins = %{version}
%{upgrade_dep %{pname}-plugin-ext-sensors-rrd}
# file was moved from slurm-plugins to here
Conflicts: %{pname}-plugins < %{version}
%description plugin-ext-sensors-rrd
This package contains the ext_sensors/rrd plugin used to read data
using RRD, a tool that creates and manages a linear database for
sampling and logging data.
%package torque
Summary: Wrappers for transitition from Torque/PBS to SLURM
Group: Productivity/Clustering/Computing
@ -517,7 +529,6 @@ This package contains just the minmal code to run a compute node.
%package config
Summary: Config files and directories for slurm services
Group: Productivity/Clustering/Computing
%{?sysusers_requires}
Requires: logrotate
BuildArch: noarch
%if 0%{?suse_version} <= 1140
@ -751,15 +762,9 @@ rm -rf %{buildroot}/%{_libdir}/slurm/*.{a,la} \
%{buildroot}/%{_libdir}/*.la \
%{buildroot}/%_lib/security/*.la
# Fix perl
rm %{buildroot}%{perl_archlib}/perllocal.pod \
%{buildroot}%{perl_sitearch}/auto/Slurm/.packlist \
%{buildroot}%{perl_sitearch}/auto/Slurmdb/.packlist
mkdir -p %{buildroot}%{perl_vendorarch}
mv %{buildroot}%{perl_sitearch}/* \
%{buildroot}%{perl_vendorarch}
rm %{buildroot}/%{perl_archlib}/perllocal.pod \
%{buildroot}/%{perl_vendorarch}/auto/Slurm/.packlist \
%{buildroot}/%{perl_vendorarch}/auto/Slurmdb/.packlist
# Remove Cray specific binaries
rm -f %{buildroot}/%{_sbindir}/capmc_suspend \
@ -1081,6 +1086,7 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
%{?have_netloc:%{_bindir}/netloc_to_topology}
%{_sbindir}/sackd
%{_sbindir}/slurmctld
%{_sbindir}/slurmsmwd
%dir %{_libdir}/slurm/src
%{_unitdir}/slurmctld.service
%{_sbindir}/rcslurmctld
@ -1158,10 +1164,9 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
%files -n perl-%{name}
%{perl_vendorarch}/Slurm.pm
%{perl_vendorarch}/Slurm
%{perl_vendorarch}/Slurmdb.pm
%{perl_vendorarch}/auto/Slurm
%{perl_vendorarch}/Slurmdb.pm
%{perl_vendorarch}/auto/Slurmdb
%dir %{perl_vendorarch}/auto
%{_mandir}/man3/Slurm*.3pm.*
%files slurmdbd
@ -1184,7 +1189,6 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
%dir %{_libdir}/slurm
%{_libdir}/slurm/libslurmfull.so
%{_libdir}/slurm/accounting_storage_slurmdbd.so
%{_libdir}/slurm/accounting_storage_ctld_relay.so
%{_libdir}/slurm/acct_gather_energy_pm_counters.so
%{_libdir}/slurm/acct_gather_energy_gpu.so
%{_libdir}/slurm/acct_gather_energy_ibmaem.so
@ -1193,7 +1197,6 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
%{_libdir}/slurm/acct_gather_filesystem_lustre.so
%{_libdir}/slurm/burst_buffer_lua.so
%{_libdir}/slurm/burst_buffer_datawarp.so
%{_libdir}/slurm/data_parser_v0_0_41.so
%{_libdir}/slurm/data_parser_v0_0_40.so
%{_libdir}/slurm/data_parser_v0_0_39.so
%{_libdir}/slurm/cgroup_v1.so
@ -1211,13 +1214,12 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
%{_libdir}/slurm/gres_nic.so
%{_libdir}/slurm/gres_shard.so
%{_libdir}/slurm/hash_k12.so
%{_libdir}/slurm/hash_sha3.so
%{_libdir}/slurm/tls_none.so
%{_libdir}/slurm/jobacct_gather_cgroup.so
%{_libdir}/slurm/jobacct_gather_linux.so
%{_libdir}/slurm/jobcomp_filetxt.so
%{_libdir}/slurm/jobcomp_lua.so
%{_libdir}/slurm/jobcomp_script.so
%{_libdir}/slurm/job_container_cncu.so
%{_libdir}/slurm/job_container_tmpfs.so
%{_libdir}/slurm/job_submit_all_partitions.so
%{_libdir}/slurm/job_submit_defaults.so
@ -1251,7 +1253,6 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
%{_libdir}/slurm/serializer_url_encoded.so
%{_libdir}/slurm/serializer_yaml.so
%{_libdir}/slurm/site_factor_example.so
%{_libdir}/slurm/switch_nvidia_imex.so
%{_libdir}/slurm/task_affinity.so
%{_libdir}/slurm/task_cgroup.so
%{_libdir}/slurm/topology_3d_torus.so
@ -1271,6 +1272,9 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
%{_libdir}/slurm/acct_gather_profile_influxdb.so
%{_libdir}/slurm/jobcomp_elasticsearch.so
%files plugin-ext-sensors-rrd
%{_libdir}/slurm/ext_sensors_rrd.so
%files lua
%{_libdir}/slurm/job_submit_lua.so
@ -1306,6 +1310,8 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
%{_libdir}/slurm/openapi_slurmdbd.so
%{_libdir}/slurm/openapi_dbv0_0_39.so
%{_libdir}/slurm/openapi_v0_0_39.so
%{_libdir}/slurm/openapi_dbv0_0_38.so
%{_libdir}/slurm/openapi_v0_0_38.so
%{_libdir}/slurm/rest_auth_local.so
%endif
@ -1342,10 +1348,12 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
%files config-man
%{_mandir}/man5/acct_gather.conf.*
%{_mandir}/man5/burst_buffer.conf.*
%{_mandir}/man5/ext_sensors.conf.*
%{_mandir}/man5/slurm.*
%{_mandir}/man5/cgroup.*
%{_mandir}/man5/gres.*
%{_mandir}/man5/helpers.*
#%%{_mandir}/man5/nonstop.conf.5.*
%{_mandir}/man5/oci.conf.5.gz
%{_mandir}/man5/topology.*
%{_mandir}/man5/knl.conf.5.*
@ -1360,7 +1368,17 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
%endif
%files cray
# do not remove cray sepcific packages from SLES update
# Only for Cray
%{_libdir}/slurm/core_spec_cray_aries.so
%{_libdir}/slurm/job_submit_cray_aries.so
%{_libdir}/slurm/select_cray_aries.so
%{_libdir}/slurm/switch_cray_aries.so
%{_libdir}/slurm/task_cray_aries.so
%{_libdir}/slurm/proctrack_cray_aries.so
%{_libdir}/slurm/mpi_cray_shasta.so
%{_libdir}/slurm/node_features_knl_cray.so
%{_libdir}/slurm/power_cray_aries.so
%if 0%{?slurm_testsuite}
%files testsuite