Compare commits
No commits in common. "factory" and "factory" have entirely different histories.
65
Fix-test-21.41.patch
Normal file
65
Fix-test-21.41.patch
Normal file
@ -0,0 +1,65 @@
|
||||
From: Egbert Eich <eich@suse.com>
|
||||
Date: Wed Jun 22 14:39:10 2022 +0200
|
||||
Subject: Fix test 21.41
|
||||
Patch-mainline: Not yet
|
||||
Git-repo: https://github.com/SchedMD/slurm
|
||||
Git-commit: 21619ffa15d1d656ee11a477ebb8215a06387fdd
|
||||
References:
|
||||
|
||||
Since expect is not line oriented, the output is not matched line by line.
|
||||
Thus the order in which results are returned by sacctmgr actually matters:
|
||||
If the first test case matches what is returned first, this part will be
|
||||
consumed. If the 2nd test case will then match what is left over, the
|
||||
test will actually succeed.
|
||||
If this is not the case, ie if the first test matches a part that is
|
||||
actually sent later, the earlier parts will actually be forgotten and
|
||||
won't match at all.
|
||||
To make the test resilient to different order of results, the test has
|
||||
been rewritten to only contain a single match line.
|
||||
|
||||
Signed-off-by: Egbert Eich <eich@suse.com>
|
||||
Signed-off-by: Egbert Eich <eich@suse.de>
|
||||
---
|
||||
testsuite/expect/test21.41 | 30 +++++++++++++++---------------
|
||||
1 file changed, 15 insertions(+), 15 deletions(-)
|
||||
diff --git a/testsuite/expect/test21.41 b/testsuite/expect/test21.41
|
||||
index c0961522db..1fd921a48f 100755
|
||||
--- a/testsuite/expect/test21.41
|
||||
+++ b/testsuite/expect/test21.41
|
||||
@@ -372,21 +372,21 @@ expect {
|
||||
-re "There was a problem" {
|
||||
fail "There was a problem with the sacctmgr command"
|
||||
}
|
||||
- -re "$user1.$wckey1.($number)." {
|
||||
- set user1wckey1 $expect_out(1,string)
|
||||
- exp_continue
|
||||
- }
|
||||
- -re "$user2.$wckey1.($number)." {
|
||||
- set user2wckey1 $expect_out(1,string)
|
||||
- exp_continue
|
||||
- }
|
||||
- -re "$user1.$wckey2.($number)." {
|
||||
- set user1wckey2 $expect_out(1,string)
|
||||
- exp_continue
|
||||
- }
|
||||
- -re "$user2.$wckey2.($number)." {
|
||||
- set user2wckey2 $expect_out(1,string)
|
||||
- exp_continue
|
||||
+ -re "($user1|$user2).($wckey1|$wckey2).($number)." {
|
||||
+ if { $expect_out(1,string) eq $user1 } {
|
||||
+ if { $expect_out(2,string) eq $wckey1 } {
|
||||
+ set user1wckey1 $expect_out(3,string)
|
||||
+ } elseif { $expect_out(2,string) eq $wckey2 } {
|
||||
+ set user1wckey2 $expect_out(3,string)
|
||||
+ }
|
||||
+ } elseif { $expect_out(1,string) eq $user2 } {
|
||||
+ if { $expect_out(2,string) eq $wckey1 } {
|
||||
+ set user2wckey1 $expect_out(3,string)
|
||||
+ } elseif { $expect_out(2,string) eq $wckey2 } {
|
||||
+ set user2wckey2 $expect_out(3,string)
|
||||
+ }
|
||||
+ }
|
||||
+ exp_continue
|
||||
}
|
||||
timeout {
|
||||
fail "sacctmgr wckeys not responding"
|
3
slurm-23.11.5.tar.bz2
Normal file
3
slurm-23.11.5.tar.bz2
Normal file
@ -0,0 +1,3 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:7a8f4b1b46d3a8ec9a95066b04635c97f9095877f6189a8ff7388e5e74daeef3
|
||||
size 7365175
|
@ -1,3 +0,0 @@
|
||||
version https://git-lfs.github.com/spec/v1
|
||||
oid sha256:240a2105c8801bc0d222fa2bbcf46f71392ef94cce9253357e5f43f029adaf9b
|
||||
size 7183430
|
892
slurm.changes
892
slurm.changes
@ -1,733 +1,182 @@
|
||||
-------------------------------------------------------------------
|
||||
Fri Nov 1 12:50:27 UTC 2024 - Egbert Eich <eich@suse.com>
|
||||
|
||||
- Update to version 24.05.4 & fix for CVE-2024-48936.
|
||||
* Fix generic int sort functions.
|
||||
* Fix user look up using possible unrealized uid in the dbd.
|
||||
* `slurmrestd` - Fix regressions that allowed `slurmrestd` to
|
||||
be run as SlurmUser when `SlurmUser` was not root.
|
||||
* mpi/pmix fix race conditions with het jobs at step start/end
|
||||
which could make srun to hang.
|
||||
* Fix not showing some `SelectTypeParameters` in `scontrol show
|
||||
config`.
|
||||
* Avoid assert when dumping removed certain fields in JSON/YAML.
|
||||
* Improve how shards are scheduled with affinity in mind.
|
||||
* Fix `MaxJobsAccruePU` not being respected when `MaxJobsAccruePA`
|
||||
is set in the same QOS.
|
||||
* Prevent backfill from planning jobs that use overlapping
|
||||
resources for the same time slot if the job's time limit is
|
||||
less than `bf_resolution`.
|
||||
* Fix memory leak when requesting typed gres and
|
||||
`--[cpus|mem]-per-gpu`.
|
||||
* Prevent backfill from breaking out due to "system state
|
||||
changed" every 30 seconds if reservations use `REPLACE` or
|
||||
`REPLACE_DOWN` flags.
|
||||
* `slurmrestd` - Make sure that scheduler_unset parameter defaults
|
||||
to true even when the following flags are also set:
|
||||
`show_duplicates`, `skip_steps`, `disable_truncate_usage_time`,
|
||||
`run_away_jobs`, `whole_hetjob`, `disable_whole_hetjob`,
|
||||
`disable_wait_for_result`, `usage_time_as_submit_time`,
|
||||
`show_batch_script`, and or `show_job_environment`. Additionaly,
|
||||
always make sure show_duplicates and
|
||||
`disable_truncate_usage_time` default to true when the following
|
||||
flags are also set: `scheduler_unset`, `scheduled_on_submit`,
|
||||
`scheduled_by_main`, `scheduled_by_backfill`, and or `job_started`.
|
||||
This effects the following endpoints:
|
||||
`GET /slurmdb/v0.0.40/jobs`
|
||||
`GET /slurmdb/v0.0.41/jobs`
|
||||
* Ignore `--json` and `--yaml` options for `scontrol` show config
|
||||
to prevent mixing output types.
|
||||
* Fix not considering nodes in reservations with Maintenance or
|
||||
Overlap flags when creating new reservations with `nodecnt` or
|
||||
when they replace down nodes.
|
||||
* Fix suspending/resuming steps running under a 23.02 `slurmstepd`
|
||||
process.
|
||||
* Fix options like `sprio --me` and `squeue --me` for users with
|
||||
a uid greater than 2147483647.
|
||||
* `fatal()` if `BlockSizes=0`. This value is invalid and would
|
||||
otherwise cause the `slurmctld` to crash.
|
||||
* `sacctmgr` - Fix issue where clearing out a preemption list using
|
||||
`preempt=''` would cause the given qos to no longer be preempt-able
|
||||
until set again.
|
||||
* Fix `stepmgr` creating job steps concurrently.
|
||||
* `data_parser/v0.0.40` - Avoid dumping "Infinity" for `NO_VAL` tagged
|
||||
"number" fields.
|
||||
* `data_parser/v0.0.41` - Avoid dumping "Infinity" for `NO_VAL` tagged
|
||||
"number" fields.
|
||||
* `slurmctld` - Fix a potential leak while updating a reservation.
|
||||
* `slurmctld` - Fix state save with reservation flags when a update
|
||||
fails.
|
||||
* Fix reservation update issues with parameters Accounts and Users, when
|
||||
using +/- signs.
|
||||
* `slurmrestd` - Don't dump warning on empty wckeys in:
|
||||
`GET /slurmdb/v0.0.40/config`
|
||||
`GET /slurmdb/v0.0.41/config`
|
||||
* Fix slurmd possibly leaving zombie processes on start up in configless
|
||||
when the initial attempt to fetch the config fails.
|
||||
* Fix crash when trying to drain a non-existing node (possibly deleted
|
||||
before).
|
||||
* `slurmctld` - fix segfault when calculating limit decay for jobs with
|
||||
an invalid association.
|
||||
* Fix IPMI energy gathering with multiple sensors.
|
||||
* `data_parser/v0.0.39` - Remove xassert requiring errors and warnings
|
||||
to have a source string.
|
||||
* `slurmrestd` - Prevent potential segfault when there is an error
|
||||
parsing an array field which could lead to a double xfree. This
|
||||
applies to several endpoints in `data_parser` v0.0.39, v0.0.40 and
|
||||
v0.0.41.
|
||||
* `scancel` - Fix a regression from 23.11.6 where using both the
|
||||
`--ctld` and `--sibling` options would cancel the federated job on
|
||||
all clusters instead of only the cluster(s) specified by `--sibling`.
|
||||
* `accounting_storage/mysql` - Fix bug when removing an association
|
||||
specified with an empty partition.
|
||||
* Fix setting multiple partition state restore on a job correctly.
|
||||
* Fix difference in behavior when swapping partition order in job
|
||||
submission.
|
||||
* Fix security issue in stepmgr that could permit an attacker to
|
||||
execute processes under other users' jobs. CVE-2024-48936.
|
||||
|
||||
-------------------------------------------------------------------
|
||||
Wed Oct 23 08:54:29 UTC 2024 - Egbert Eich <eich@suse.com>
|
||||
|
||||
- Add %(?%sysusers_requires} to slurm-config.
|
||||
This fixes issues when building against Slurm.
|
||||
|
||||
-------------------------------------------------------------------
|
||||
Mon Oct 14 10:40:10 UTC 2024 - Egbert Eich <eich@suse.com>
|
||||
|
||||
- Update to version 24.05.3
|
||||
* `data_parser/v0.0.40` - Added field descriptions.
|
||||
* `slurmrestd` - Avoid creating new slurmdbd connection per request
|
||||
to `* /slurm/slurmctld/*/*` endpoints.
|
||||
* Fix compilation issue with `switch/hpe_slingshot` plugin.
|
||||
* Fix gres per task allocation with threads-per-core.
|
||||
* `data_parser/v0.0.41` - Added field descriptions.
|
||||
* `slurmrestd` - Change back generated OpenAPI schema for
|
||||
`DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using
|
||||
parameters for request. `slurmrestd` will continue accept endpoint
|
||||
requests via `RequestBody` or HTTP query.
|
||||
* `topology/tree` - Fix issues with switch distance optimization.
|
||||
* Fix potential segfault of secondary `slurmctld` when falling back
|
||||
to the primary when running with a `JobComp` plugin.
|
||||
* Enable `--json`/`--yaml=v0.0.39` options on client commands to
|
||||
dump data using data_parser/v0.0.39 instead or outputting nothing.
|
||||
* `switch/hpe_slingshot` - Fix issue that could result in a 0 length
|
||||
state file.
|
||||
* Fix unnecessary message protocol downgrade for unregistered nodes.
|
||||
* Fix unnecessarily packing alias addrs when terminating jobs with
|
||||
a mix of non-cloud/dynamic nodes and powered down cloud/dynamic
|
||||
nodes.
|
||||
* `accounting_storage/mysql` - Fix issue when deleting a qos that
|
||||
could remove too many commas from the qos and/or delta_qos fields
|
||||
of the assoc table.
|
||||
* `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU.
|
||||
* Fix allowing access to reservations without `MaxStartDelay` set.
|
||||
* Fix regression introduced in 24.05.0rc1 breaking
|
||||
`srun --send-libs` parsing.
|
||||
* Fix slurmd vsize memory leak when using job submission/allocation
|
||||
commands that implicitly or explicitly use --get-user-env.
|
||||
* `slurmd` - Fix node going into invalid state when using
|
||||
`CPUSpecList` and setting CPUs to the # of cores on a
|
||||
multithreaded node.
|
||||
* Fix reboot asap nodes being considered in backfill after a restart.
|
||||
* Fix `--clusters`/`-M queries` for clusters outside of a
|
||||
federation when `fed_display` is configured.
|
||||
* Fix `scontrol` allowing updating job with bad cpus-per-task value.
|
||||
* `sattach` - Fix regression from 24.05.2 security fix leading to
|
||||
crash.
|
||||
* `mpi/pmix` - Fix assertion when built under `--enable-debug`.
|
||||
- Changes from Slurm 24.05.2
|
||||
* Fix energy gathering rpc counter underflow in
|
||||
`_rpc_acct_gather_energy` when more than 10 threads try to get
|
||||
energy at the same time. This prevented the possibility to get
|
||||
energy from slurmd by any step until slurmd was restarted,
|
||||
so losing energy accounting metrics in the node.
|
||||
* `accounting_storage/mysql` - Fix issue where new user with `wckey`
|
||||
did not have a default wckey sent to the slurmctld.
|
||||
* `slurmrestd` - Prevent slurmrestd segfault when handling the
|
||||
following endpoints when none of the optional parameters are
|
||||
specified:
|
||||
`DELETE /slurm/v0.0.40/jobs`
|
||||
`DELETE /slurm/v0.0.41/jobs`
|
||||
`GET /slurm/v0.0.40/shares`
|
||||
`GET /slurm/v0.0.41/shares`
|
||||
`GET /slurmdb/v0.0.40/instance`
|
||||
`GET /slurmdb/v0.0.41/instance`
|
||||
`GET /slurmdb/v0.0.40/instances`
|
||||
`GET /slurmdb/v0.0.41/instances`
|
||||
`POST /slurm/v0.0.40/job/{job_id}`
|
||||
`POST /slurm/v0.0.41/job/{job_id}`
|
||||
* Fix IPMI energy gathering when no IPMIPowerSensors are specified
|
||||
in `acct_gather.conf`. This situation resulted in an accounted
|
||||
energy of 0 for job steps.
|
||||
* Fix a minor memory leak in slurmctld when updating a job dependency.
|
||||
* `scontrol`,`squeue` - Fix regression that caused incorrect values
|
||||
for multisocket nodes at `.jobs[].job_resources.nodes.allocation`
|
||||
for `scontrol show jobs --(json|yaml)` and `squeue --(json|yaml)`.
|
||||
* `slurmrestd` - Fix regression that caused incorrect values for
|
||||
multisocket nodes at `.jobs[].job_resources.nodes.allocation` to
|
||||
be dumped with endpoints:
|
||||
`GET /slurm/v0.0.41/job/{job_id}`
|
||||
`GET /slurm/v0.0.41/jobs`
|
||||
* `jobcomp/filetxt` - Fix truncation of job record lines > 1024
|
||||
characters.
|
||||
* `switch/hpe_slingshot` - Drain node on failure to delete CXI
|
||||
services.
|
||||
* Fix a performance regression from 23.11.0 in cpu frequency
|
||||
handling when no `CpuFreqDef` is defined.
|
||||
* Fix one-task-per-sharing not working across multiple nodes.
|
||||
* Fix inconsistent number of cpus when creating a reservation
|
||||
using the TRESPerNode option.
|
||||
* `data_parser/v0.0.40+` - Fix job state parsing which could
|
||||
break filtering.
|
||||
* Prevent `cpus-per-task` to be modified in jobs where a `-c`
|
||||
value has been explicitly specified and the requested memory
|
||||
constraints implicitly increase the number of CPUs to allocate.
|
||||
* `slurmrestd` - Fix regression where args `-s v0.0.39,dbv0.0.39`
|
||||
and `-d v0.0.39` would result in `GET /openapi/v3` not
|
||||
registering as a valid possible query resulting in 404 errors.
|
||||
* `slurmrestd` - Fix memory leak for dbv0.0.39 jobs query which
|
||||
occurred if the query parameters specified account, association,
|
||||
cluster, constraints, format, groups, job_name, partition, qos,
|
||||
reason, reservation, state, users, or wckey. This affects the
|
||||
following endpoints:
|
||||
`GET /slurmdb/v0.0.39/jobs`
|
||||
* `slurmrestd` - In the case the slurmdbd does not respond to a
|
||||
persistent connection init message, prevent the closed fd from
|
||||
being used, and instead emit an error or warning depending on
|
||||
if the connection was required.
|
||||
* Fix 24.05.0 regression that caused the slurmdbd not to send back
|
||||
an error message if there is an error initializing a persistent
|
||||
connection.
|
||||
* Reduce latency of forwarded x11 packets.
|
||||
* Add `curr_dependency` (representing the current dependency of
|
||||
the job).
|
||||
and `orig_dependency` (representing the original requested
|
||||
dependency of the job) fields to the job record in
|
||||
`job_submit.lua` (for job update) and `jobcomp.lua`.
|
||||
* Fix potential segfault of slurmctld configured with
|
||||
`SlurmctldParameters=enable_rpc_queue` from happening on
|
||||
reconfigure.
|
||||
* Fix potential segfault of slurmctld on its shutdown when rate
|
||||
limitting is enabled.
|
||||
* `slurmrestd` - Fix missing job environment for `SLURM_JOB_NAME`,
|
||||
`SLURM_OPEN_MODE`, `SLURM_JOB_DEPENDENCY`, `SLURM_PROFILE`,
|
||||
`SLURM_ACCTG_FREQ`, `SLURM_NETWORK` and `SLURM_CPU_FREQ_REQ` to
|
||||
match sbatch.
|
||||
* Fix GRES environment variable indices being incorrect when only
|
||||
using a subset of all GPUs on a node and the
|
||||
`--gres-flags=allow-task-sharing` option.
|
||||
* Prevent `scontrol` from segfaulting when requesting scontrol
|
||||
show reservation `--json` or `--yaml` if there is an error
|
||||
retrieving reservations from the `slurmctld`.
|
||||
* `switch/hpe_slingshot` - Fix security issue around managing VNI
|
||||
access. CVE-2024-42511.
|
||||
* `switch/nvidia_imex` - Fix security issue managing IMEX channel
|
||||
access. CVE-2024-42511.
|
||||
* `switch/nvidia_imex` - Allow for compatibility with
|
||||
`job_container/tmpfs`.
|
||||
- Changes in Slurm 24.05.1
|
||||
* Fix `slurmctld` and `slurmdbd` potentially stopping instead of
|
||||
performing a logrotate when recieving `SIGUSR2` when using
|
||||
`auth/slurm`.
|
||||
* `switch/hpe_slingshot` - Fix slurmctld crash when upgrading
|
||||
from 23.02.
|
||||
* Fix "Could not find group" errors from `validate_group()` when
|
||||
using `AllowGroups` with large `/etc/group` files.
|
||||
* Add `AccountingStoreFlags=no_stdio` which allows to not record
|
||||
the stdio paths of the job when set.
|
||||
* `slurmrestd` - Prevent a slurmrestd segfault when parsing the
|
||||
`crontab` field, which was never usable. Now it explicitly
|
||||
ignores the value and emits a warning if it is used for the
|
||||
following endpoints:
|
||||
`POST /slurm/v0.0.39/job/{job_id}`
|
||||
`POST /slurm/v0.0.39/job/submit`
|
||||
`POST /slurm/v0.0.40/job/{job_id}`
|
||||
`POST /slurm/v0.0.40/job/submit`
|
||||
`POST /slurm/v0.0.41/job/{job_id}`
|
||||
`POST /slurm/v0.0.41/job/submit`
|
||||
`POST /slurm/v0.0.41/job/allocate`
|
||||
* `mpi/pmi2` - Fix communication issue leading to task launch
|
||||
failure with "`invalid kvs seq from node`".
|
||||
* Fix getting user environment when using sbatch with
|
||||
`--get-user-env` or `--export=` when there is a user profile
|
||||
script that reads `/proc`.
|
||||
* Prevent slurmd from crashing if `acct_gather_energy/gpu` is
|
||||
configured but `GresTypes` is not configured.
|
||||
* Do not log the following errors when `AcctGatherEnergyType`
|
||||
plugins are used but a node does not have or cannot find sensors:
|
||||
"`error: _get_joules_task: can't get info from slurmd`"
|
||||
"`error: slurm_get_node_energy: Zero Bytes were transmitted or
|
||||
received`"
|
||||
However, the following error will continue to be logged:
|
||||
"`error: Can't get energy data. No power sensors are available.
|
||||
Try later`"
|
||||
* `sbatch`, `srun` - Set `SLURM_NETWORK` environment variable if
|
||||
`--network` is set.
|
||||
* Fix cloud nodes not being able to forward to nodes that restarted
|
||||
with new IP addresses.
|
||||
* Fix cwd not being set correctly when running a SPANK plugin with a
|
||||
`spank_user_init()` hook and the new "`contain_spank`" option set.
|
||||
* `slurmctld` - Avoid deadlock during shutdown when `auth/slurm`
|
||||
is active.
|
||||
* Fix segfault in `slurmctld` with `topology/block`.
|
||||
* `sacct` - Fix printing of job group for job steps.
|
||||
* `scrun` - Log when an invalid environment variable causes the
|
||||
job submission to be rejected.
|
||||
* `accounting_storage/mysql` - Fix problem where listing or
|
||||
modifying an association when specifying a qos list could hang
|
||||
or take a very long time.
|
||||
* `gpu/nvml` - Fix `gpuutil/gpumem` only tracking last GPU in step.
|
||||
Now, `gpuutil/gpumem` will record sums of all GPUS in the step.
|
||||
* Fix error in `scrontab` jobs when using
|
||||
`slurm.conf:PropagatePrioProcess=1`.
|
||||
* Fix `slurmctld` crash on a batch job submission with
|
||||
`--nodes 0,...`.
|
||||
* Fix dynamic IP address fanout forwarding when using `auth/slurm`.
|
||||
* Restrict listening sockets in the `mpi/pmix` plugin and `sattach`
|
||||
to the `SrunPortRange`.
|
||||
* `slurmrestd` - Limit mime types returned from query to
|
||||
`GET /openapi/v3` to only return one mime type per serializer
|
||||
plugin to fix issues with OpenAPI client generators that are
|
||||
unable to handle multiple mime type aliases.
|
||||
* Fix many commands possibly reporting an "`Unexpected Message
|
||||
Received`" when in reality the connection timed out.
|
||||
* Prevent slurmctld from starting if there is not a json
|
||||
serializer present and the `extra_constraints` feature is enabled.
|
||||
* Fix heterogeneous job components not being signaled with
|
||||
`scancel --ctld` and `DELETE slurm/v0.0.40/jobs` if the job ids
|
||||
are not explicitly given, the heterogeneous job components match
|
||||
the given filters, and the heterogeneous job leader does not
|
||||
match the given filters.
|
||||
* Fix regression from 23.02 impeding job licenses from being cleared.
|
||||
* Move error to `log_flag` which made `_get_joules_task` error to
|
||||
be logged to the user when too many rpcs were queued in slurmd
|
||||
for gathering energy.
|
||||
* For `scancel --ctld` and the associated rest api endpoints:
|
||||
`DELETE /slurm/v0.0.40/jobs`
|
||||
`DELETE /slurm/v0.0.41/jobs`
|
||||
Fix canceling the final array task in a job array when the task
|
||||
is pending and all array tasks have been split into separate job
|
||||
records. Previously this task was not canceled.
|
||||
* Fix `power_save operation` after recovering from a failed
|
||||
reconfigure.
|
||||
* `slurmctld` - Skip removing the pidfile when running under
|
||||
systemd. In that situation it is never created in the first place.
|
||||
* Fix issue where altering the flags on a Slurm account
|
||||
(`UsersAreCoords`) several limits on the account's association
|
||||
would be set to 0 in Slurm's internal cache.
|
||||
* Fix memory leak in the controller when relaying `stepmgr` step
|
||||
accounting to the dbd.
|
||||
* Fix segfault when submitting stepmgr jobs within an existing
|
||||
allocation.
|
||||
* Added `disable_slurm_hydra_bootstrap` as a possible `MpiParams`
|
||||
parameter in `slurm.conf`. Using this will disable env variable
|
||||
injection to allocations for the following variables:
|
||||
`I_MPI_HYDRA_BOOTSTRAP,` `I_MPI_HYDRA_BOOTSTRAP_EXEC_EXTRA_ARGS`,
|
||||
`HYDRA_BOOTSTRAP`, `HYDRA_LAUNCHER_EXTRA_ARGS`.
|
||||
* `scrun` - Delay shutdown until after start requested.
|
||||
This caused `scrun` to never start or shutdown and hung forever
|
||||
when using `--tty`.
|
||||
* Fix backup `slurmctld` potentially not running the agent when
|
||||
taking over as the primary controller.
|
||||
* Fix primary controller not running the agent when a reconfigure
|
||||
of the `slurmctld` fails.
|
||||
* `slurmd` - fix premature timeout waiting for
|
||||
`REQUEST_LAUNCH_PROLOG` with large array jobs causing node to
|
||||
drain.
|
||||
* `jobcomp/{elasticsearch,kafka}` - Avoid sending fields with
|
||||
invalid date/time.
|
||||
* `jobcomp/elasticsearch` - Fix `slurmctld` memory leak from
|
||||
curl usage.
|
||||
* `acct_gather_profile/influxdb` - Fix slurmstepd memory leak from
|
||||
curl usage
|
||||
* Fix 24.05.0 regression not deleting job hash dirs after
|
||||
`MinJobAge`.
|
||||
* Fix filtering arguments being ignored when using squeue `--json`.
|
||||
* `switch/nvidia_imex` - Move setup call after `spank_init()` to
|
||||
allow namespace manipulation within the SPANK plugin.
|
||||
* `switch/nvidia_imex` - Skip plugin operation if
|
||||
`nvidia-caps-imex-channels` device is not present rather than
|
||||
preventing slurmd from starting.
|
||||
* `switch/nvidia_imex` - Skip plugin operation if
|
||||
`job_container/tmpfs` is configured due to incompatibility.
|
||||
* `switch/nvidia_imex` - Remove any pre-existing channels when
|
||||
`slurmd` starts.
|
||||
* `rpc_queue` - Add support for an optional `rpc_queue.yaml`
|
||||
configuration file.
|
||||
* `slurmrestd` - Add new +prefer_refs flag to `data_parser/v0.0.41`
|
||||
plugin. This flag will avoid inlining single referenced schemas
|
||||
in the OpenAPI schema.
|
||||
|
||||
-------------------------------------------------------------------
|
||||
Tue Jun 4 09:36:54 UTC 2024 - Christian Goll <cgoll@suse.com>
|
||||
|
||||
- Updated to new release 24.05.0 with following major changes
|
||||
* Important Notes:
|
||||
If using the slurmdbd (Slurm DataBase Daemon) you must update
|
||||
this first. NOTE: If using a backup DBD you must start the
|
||||
primary first to do any database conversion, the backup will not
|
||||
start until this has happened. The 24.05 slurmdbd will work
|
||||
with Slurm daemons of version 23.02 and above. You will not
|
||||
need to update all clusters at the same time, but it is very
|
||||
important to update slurmdbd first and having it running before
|
||||
updating any other clusters making use of it.
|
||||
* Highlights
|
||||
+ Federation - allow client command operation when slurmdbd is
|
||||
unavailable.
|
||||
+ `burst_buffer/lua` - Added two new hooks: `slurm_bb_test_data_in`
|
||||
and `slurm_bb_test_data_out`. The syntax and use of the new hooks
|
||||
are documented in `etc/burst_buffer.lua.example`. These are
|
||||
required to exist. slurmctld now checks on startup if the
|
||||
`burst_buffer.lua` script loads and contains all required hooks;
|
||||
`slurmctld` will exit with a fatal error if this is not
|
||||
successful. Added `PollInterval` to `burst_buffer.conf`. Removed
|
||||
the arbitrary limit of 512 copies of the script running
|
||||
simultaneously.
|
||||
+ Add QOS limit `MaxTRESRunMinsPerAccount`.
|
||||
+ Add QOS limit `MaxTRESRunMinsPerUser`.
|
||||
+ Add `ELIGIBLE` environment variable to `jobcomp/script` plugin.
|
||||
+ Always use the QOS name for `SLURM_JOB_QOS` environment variables.
|
||||
Previously the batch environment would use the description field,
|
||||
which was usually equivalent to the name.
|
||||
+ `cgroup/v2` - Require dbus-1 version >= 1.11.16.
|
||||
+ Allow `NodeSet` names to be used in SuspendExcNodes.
|
||||
+ `SuspendExcNodes=<nodes>:N` now counts allocated nodes in `N`.
|
||||
The first `N` powered up nodes in <nodes> are protected from
|
||||
being suspended.
|
||||
+ Store job output, input and error paths in `SlurmDBD`.
|
||||
+ Add `USER_DELETE` reservation flag to allow users with access
|
||||
to a reservation to delete it.
|
||||
+ Add `SlurmctldParameters=enable_stepmgr` to enable step
|
||||
management through the `slurmstepd` instead of the controller.
|
||||
+ Added `PrologFlags=RunInJob` to make prolog and epilog run
|
||||
inside the job extern step to include it in the job's cgroup.
|
||||
+ Add ability to reserve MPI ports at the job level for stepmgr
|
||||
jobs and subdivide them at the step level.
|
||||
+ `slurmrestd` - Add `--generate-openapi-spec argument`.
|
||||
* Configuration File Changes (see appropriate man page for details)
|
||||
+ `CoreSpecPlugin` has been removed.
|
||||
+ Removed `TopologyPlugin` tree and dragonfly support from
|
||||
`select/linear`. If those topology plugins are desired please
|
||||
switch to `select/cons_tres`.
|
||||
+ Changed the default value for `UnkillableStepTimeout` to 60
|
||||
seconds or five times the value of `MessageTimeout`, whichever
|
||||
is greater.
|
||||
+ An error log has been added if `JobAcctGatherParams` '`UsePss`'
|
||||
or '`NoShare`' are configured with a plugin other than
|
||||
`jobacct_gather/linux`. In such case these parameters are ignored.
|
||||
+ `helpers.conf` - Added `Flags=rebootless` parameter allowing
|
||||
feature changes without rebooting compute nodes.
|
||||
+ `topology/block` - Replaced the `BlockLevels` with `BlockSizes`
|
||||
in `topology.conf`.
|
||||
+ Add `contain_spank` option to `SlurmdParameters`. When set,
|
||||
`spank_user_init()`, `spank_task_post_fork()`, and
|
||||
`spank_task_exit()` will execute within the
|
||||
`job_container/tmpfs` plugin namespace.
|
||||
+ Add `SlurmctldParameters=max_powered_nodes=N`, which prevents
|
||||
powering up nodes after the max is reached.
|
||||
+ Add `ExclusiveTopo` to a partition definition in `slurm.conf`.
|
||||
+ Add `AccountingStorageParameters=max_step_records` to limit how
|
||||
many steps are recorded in the database for each job - excluding
|
||||
batch.
|
||||
* Command Changes (see man pages for details)
|
||||
+ Add support for "elevenses" as an additional time specification.
|
||||
+ Add support for `sbcast --preserve` when `job_container/tmpfs`
|
||||
configured (previously documented as unsupported).
|
||||
+ `scontrol` - Add new subcommand `power` for node power control.
|
||||
+ `squeue` - Adjust `StdErr`, `StdOut`, and `StdIn` output formats.
|
||||
These will now consistently print "`(null)`" if a value is
|
||||
unavailable. `StdErr` will no longer display `StdOut` if it is
|
||||
not distinctly set. `StdOut` will now correctly display the
|
||||
default filename pattern for job arrays, and no longer show it
|
||||
for non-batch jobs. However, the expansion patterns will
|
||||
no longer be substituted by default.
|
||||
+ Add `--segment` to job allocation to be used in topology/block.
|
||||
+ Add `--exclusive=topo` for use with topology/block.
|
||||
+ `squeue` - Add `--expand-patterns` option to expand `StdErr`,
|
||||
`StdOut`, `StdIn` filename patterns as best as possible.
|
||||
+ `sacct` - Add `--expand-patterns` option to expand `StdErr`,
|
||||
`StdOut`, `StdIn` filename patterns as best as possible.
|
||||
+ `sreport` - Requesting `format=Planned` will now return the
|
||||
expected `Planned` time as documented, instead of `PlannedDown`.
|
||||
To request `Planned Down`, one must use now `format=PLNDDown`
|
||||
or `format=PlannedDown` explicitly. The abbreviations
|
||||
"`Pl`" or "`Pla`" will now make reference to Planned instead
|
||||
of `PlannedDown`.
|
||||
* API Changes
|
||||
+ Removed `ListIterator` type from `<slurm/slurm.h>`.
|
||||
+ Removed `slurm_xlate_job_id()` from `<slurm/slurm.h>`
|
||||
* SLURMRESTD Changes
|
||||
+ `openapi/dbv0.0.38` and `openapi/v0.0.38` plugins have been
|
||||
removed.
|
||||
+ `openapi/dbv0.0.39` and `openapi/v0.0.39` plugins have been
|
||||
tagged as deprecated to warn of their removal in the next release.
|
||||
+ Changed `slurmrestd.service` to only listen on TCP socket by
|
||||
default. Environments with existing drop-in units for the
|
||||
service may need further adjustments to work after upgrading.
|
||||
+ `slurmrestd` - Tagged `script` field as deprecated in
|
||||
`POST /slurm/v0.0.41/job/submit` in anticipation of removal in
|
||||
future OpenAPI plugin versions. Job submissions should set the
|
||||
`job.script` (or `jobs[0].script` for HetJobs) fields instead.
|
||||
+ `slurmrestd` - Attempt to automatically convert enumerated
|
||||
string arrays with incoming non-string values into strings.
|
||||
Add warning when incoming value for enumerated string arrays
|
||||
can not be converted to string and silently ignore instead of
|
||||
rejecting entire request. This change affects any endpoint that
|
||||
uses an enunmerated string as given in the OpenAPI specification.
|
||||
An example of this conversion would be to
|
||||
`POST /slurm/v0.0.41/job/submit` with `.job.exclusive = true`.
|
||||
While the JSON (boolean) true value matches a possible
|
||||
enumeration, it is not the expected "true" string. This change
|
||||
automatically converts the (boolean) `true` to (string) "`true`"
|
||||
avoiding a parsing failure.
|
||||
+ `slurmrestd` - Add `POST /slurm/v0.0.41/job/allocate` endpoint.
|
||||
This endpoint will create a new job allocation without any steps.
|
||||
The allocation will need to be ended via signaling the job or
|
||||
it will run to the timelimit.
|
||||
+ `slurmrestd` - Allow startup when `slurmdbd` is not configured
|
||||
and avoid loading `slurmdbd` specific plugins.
|
||||
* MPI/PMI2 Changes
|
||||
+ Jobs submitted with the `SLURM_HOSTFILE` environment variable
|
||||
set implies using an arbitrary distribution. Nevertheless, the
|
||||
logic used in PMI2 when generating their associated
|
||||
`PMI_process_mapping` values has been changed and will now be
|
||||
the same used for the plane distribution, as if `-m plane` were
|
||||
used. This has been changed because the original arbitrary
|
||||
distribution implementation did not account for multiple
|
||||
instances of the same host being present in `SLURM_HOSTFILE`,
|
||||
providing an incorrect process mapping in such case. This
|
||||
change also enables distributing tasks in blocks when using
|
||||
arbitrary distribution, which was not the case before. This
|
||||
only affects `mpi`/`pmi2` plugin.
|
||||
- Removed Fix-test-21.41.patch as upstream test changed.
|
||||
- Dropped package plugin-ext-sensors-rrd as the plugin module no
|
||||
longer exists.
|
||||
|
||||
-------------------------------------------------------------------
|
||||
Mon Mar 25 15:16:44 UTC 2024 - Christian Goll <cgoll@suse.com>
|
||||
|
||||
- removed Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch
|
||||
as incoperated upstream
|
||||
- Changes in Slurm 23.02.5
|
||||
* Add the `JobId` to `debug()` messages indicating when
|
||||
`cpus_per_task/mem_per_cpu` or `pn_min_cpus` are being
|
||||
automatically adjusted.
|
||||
* Fix regression in 23.02.2 that caused `slurmctld -R` to crash on
|
||||
startup if a node features plugin is configured.
|
||||
* Fix and prevent reoccurring reservations from overlapping.
|
||||
* `job_container/tmpfs` - Avoid attempts to share `BasePath`
|
||||
between nodes.
|
||||
* Change the log message warning for rate limited users from
|
||||
verbose to info.
|
||||
* With `CR_Cpu_Memory`, fix node selection for jobs that request
|
||||
gres and `--mem-per-cpu`.
|
||||
* Fix a regression from 22.05.7 in which some jobs were allocated
|
||||
too few nodes, thus overcommitting cpus to some tasks.
|
||||
* Fix a job being stuck in the completing state if the job ends
|
||||
while the primary controller is down or unresponsive and the
|
||||
backup controller has not yet taken over.
|
||||
* Fix `slurmctld` segfault when a node registers with a configured
|
||||
`CpuSpecList` while slurmctld configuration has the node without
|
||||
`CpuSpecList`.
|
||||
* Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state
|
||||
after not registering by `ResumeTimeout`.
|
||||
* `slurmstepd` - Avoid cleanup of `config.json`-less containers
|
||||
spooldir getting skipped.
|
||||
* `slurmstepd` - Cleanup per task generated environment for
|
||||
containers in spooldir.
|
||||
* Fix `scontrol segfault` when 'completing' command requested
|
||||
repeatedly in interactive mode.
|
||||
* Properly handle a race condition between `bind()` and `listen()`
|
||||
calls in the network stack when running with `SrunPortRange` set.
|
||||
* Federation - Fix revoked jobs being returned regardless of the
|
||||
`-a`/`--all` option for privileged users.
|
||||
* Federation - Fix canceling pending federated jobs from non-origin
|
||||
clusters which could leave federated jobs orphaned from the origin
|
||||
cluster.
|
||||
* Fix sinfo segfault when printing multiple clusters with
|
||||
`--noheader` option.
|
||||
* Federation - fix clusters not syncing if clusters are added to
|
||||
a federation before they have registered with the dbd.
|
||||
* Change `pmi2` plugin to honor the `SrunPortRange` option. This
|
||||
matches the new behavior of the pmix plugin in 23.02.0. Note that
|
||||
neither of these plugins makes use of the "`MpiParams=ports=`"
|
||||
option, and previously were only limited by the systems ephemeral
|
||||
port range.
|
||||
* `node_features/helpers` - Fix node selection for jobs requesting
|
||||
changeable features with the '`|`' operator, which could prevent
|
||||
jobs from running on some valid nodes.
|
||||
* `node_features/helpers` - Fix inconsistent handling of '`&`' and
|
||||
'`|`', where an AND'd feature was sometimes AND'd to all sets of
|
||||
features instead of just the current set. E.g. "`foo|bar&baz`" was
|
||||
interpreted as `{foo,baz}` or `{bar,baz}` instead of how it is
|
||||
documented: "`{foo} or {bar,baz}`".
|
||||
* Fix job accounting so that when a job is requeued its allocated
|
||||
node count is cleared. After the requeue, sacct will correctly
|
||||
show that the job has 0 `AllocNodes` while it is pending or if
|
||||
it is canceled before restarting.
|
||||
* `sacct` - `AllocCPUS` now correctly shows 0 if a job has not yet
|
||||
received an allocation or if the job was canceled before getting
|
||||
one.
|
||||
* Fix intel oneapi autodetect: detect the `/dev/dri/renderD[0-9]+`
|
||||
gpus, and do not detect `/dev/dri/card[0-9]+`.
|
||||
* Format batch, extern, interactive, and pending step ids into
|
||||
strings that are human readable.
|
||||
* Fix node selection for jobs that request `--gpus` and a number
|
||||
of tasks fewer than gpus, which resulted in incorrectly rejecting
|
||||
these jobs.
|
||||
* Remove `MYSQL_OPT_RECONNECT` completely.
|
||||
* Fix cloud nodes in `POWERING_UP` state disappearing (getting set
|
||||
to `FUTURE`) when an `scontrol reconfigure` happens.
|
||||
* `openapi/dbv0.0.39` - Avoid assert / segfault on missing
|
||||
coordinators list.
|
||||
* `slurmrestd` - Correct memory leak while parsing OpenAPI
|
||||
specification templates with server overrides.
|
||||
* `slurmrestd` - Reduce memory usage when printing out job CPU
|
||||
frequency.
|
||||
* Fix overwriting user node reason with system message.
|
||||
* Remove `--uid` / `--gid` options from salloc and srun commands.
|
||||
* Prevent deadlock when rpc_queue is enabled.
|
||||
* `slurmrestd` - Correct OpenAPI specification generation bug where
|
||||
fields with overlapping parent paths would not get generated.
|
||||
* Fix memory leak as a result of a partition info query.
|
||||
* Fix memory leak as a result of a job info query.
|
||||
* slurmrestd - For `GET /slurm/v0.0.39/node[s]`, change format of
|
||||
node's energy field `current_watts` to a dictionary to account
|
||||
for unset value instead of dumping `4294967294`.
|
||||
* `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of
|
||||
QOS's field `priority` to a dictionary to account for unset
|
||||
value instead of dumping `4294967294`.
|
||||
* `slurmrestd` - For `GET /slurm/v0.0.39/job[s]`, the `return code`
|
||||
code field in `v0.0.39_job_exit_code` will be set to 127 instead
|
||||
of being left unset where job does not have a relevant return code.
|
||||
* `data_parser/v0.0.39` - Add `required/memory_per_cpu` and
|
||||
required/memory_per_node to `sacct --json` and `sacct --yaml` and
|
||||
`GET /slurmdb/v0.0.39/jobs` from `slurmrestd`.
|
||||
* For step allocations, fix `--gres=none` sometimes not ignoring
|
||||
gres from the job.
|
||||
* Fix `--exclusive` jobs incorrectly gang-scheduling where they
|
||||
shouldn't.
|
||||
* Fix allocations with `CR_SOCKET`, gres not assigned to a specific
|
||||
socket, and block core distribion potentially allocating more
|
||||
sockets than required.
|
||||
* `gpu/oneapi` - Store cores correctly so CPU affinity is tracked.
|
||||
* Revert a change in 23.02.3 where Slurm would kill a script's
|
||||
process group as soon as the script ended instead of waiting as
|
||||
long as any process in
|
||||
that process group held the stdout/stderr file descriptors open.
|
||||
That change broke some scripts that relied on the previous
|
||||
behavior. Setting time limits for scripts (such as
|
||||
`PrologEpilogTimeout`) is strongly encouraged to avoid Slurm
|
||||
waiting indefinitely for scripts to finish.
|
||||
* Allow slurmdbd -R to work if the root assoc id is not 1.
|
||||
* Fix `slurmdbd -R` not returning an error under certain conditions.
|
||||
* `slurmdbd` - Avoid potential NULL pointer dereference in the
|
||||
mysql plugin.
|
||||
* Revert a change in 23.02 where `SLURM_NTASKS` was no longer
|
||||
set in the job's environment when `--ntasks-per-node` was
|
||||
requested.
|
||||
* Limit periodic node registrations to 50 instead of the full
|
||||
`TreeWidth`.
|
||||
Since unresolvable `cloud/dynamic` nodes must disable fanout by
|
||||
setting `TreeWidth` to a large number, this would cause all nodes
|
||||
to register at once.
|
||||
* Fix regression in 23.02.3 which broken x11 forwarding for hosts
|
||||
when `MUNGE` sends a localhost address in the encode host field.
|
||||
This is caused when the node hostname is mapped to 127.0.0.1
|
||||
(or similar) in `/etc/hosts`.
|
||||
* `openapi/[db]v0.0.39` - fix memory leak on parsing error.
|
||||
* `data_parser/v0.0.39` - fix updating qos for associations.
|
||||
* `openapi/dbv0.0.39` - fix updating values for associations with
|
||||
null users.
|
||||
* Fix minor memory leak with `--tres-per-task` and licenses.
|
||||
* Fix cyclic socket cpu distribution for tasks in a step where
|
||||
`--cpus-per-task` < usable threads per core.
|
||||
* Changes in Slurm 23.02.5
|
||||
* Add the JobId to debug() messages indicating when cpus_per_task/mem_per_cpu
|
||||
or pn_min_cpus are being automatically adjusted.
|
||||
* Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
|
||||
a node features plugin is configured.
|
||||
* Fix and prevent reoccurring reservations from overlapping.
|
||||
* job_container/tmpfs - Avoid attempts to share BasePath between nodes.
|
||||
* Change the log message warning for rate limited users from verbose to info.
|
||||
* With CR_Cpu_Memory, fix node selection for jobs that request gres and
|
||||
*-mem-per-cpu.
|
||||
* Fix a regression from 22.05.7 in which some jobs were allocated too few
|
||||
nodes, thus overcommitting cpus to some tasks.
|
||||
* Fix a job being stuck in the completing state if the job ends while the
|
||||
primary controller is down or unresponsive and the backup controller has
|
||||
not yet taken over.
|
||||
* Fix slurmctld segfault when a node registers with a configured CpuSpecList
|
||||
while slurmctld configuration has the node without CpuSpecList.
|
||||
* Fix cloud nodes getting stuck in POWERED_DOWN+NO_RESPOND state after not
|
||||
registering by ResumeTimeout.
|
||||
* slurmstepd - Avoid cleanup of config.json-less containers spooldir getting
|
||||
skipped.
|
||||
* slurmstepd - Cleanup per task generated environment for containers in
|
||||
spooldir.
|
||||
* Fix scontrol segfault when 'completing' command requested repeatedly in
|
||||
interactive mode.
|
||||
* Properly handle a race condition between bind() and listen() calls in the
|
||||
network stack when running with SrunPortRange set.
|
||||
* Federation - Fix revoked jobs being returned regardless of the -a/--all
|
||||
option for privileged users.
|
||||
* Federation - Fix canceling pending federated jobs from non-origin clusters
|
||||
which could leave federated jobs orphaned from the origin cluster.
|
||||
* Fix sinfo segfault when printing multiple clusters with --noheader option.
|
||||
* Federation - fix clusters not syncing if clusters are added to a federation
|
||||
before they have registered with the dbd.
|
||||
* Change pmi2 plugin to honor the SrunPortRange option. This matches the new
|
||||
behavior of the pmix plugin in 23.02.0. Note that neither of these plugins
|
||||
makes use of the "MpiParams=ports=" option, and previously were only limited
|
||||
by the systems ephemeral port range.
|
||||
* node_features/helpers - Fix node selection for jobs requesting changeable
|
||||
features with the '|' operator, which could prevent jobs from running on
|
||||
some valid nodes.
|
||||
* node_features/helpers - Fix inconsistent handling of '&' and '|', where an
|
||||
AND'd feature was sometimes AND'd to all sets of features instead of just
|
||||
the current set. E.g. "foo|bar&baz" was interpreted as {foo,baz} or
|
||||
{bar,baz} instead of how it is documented: "{foo} or {bar,baz}".
|
||||
* Fix job accounting so that when a job is requeued its allocated node count
|
||||
is cleared. After the requeue, sacct will correctly show that the job has
|
||||
0 AllocNodes while it is pending or if it is canceled before restarting.
|
||||
* sacct - AllocCPUS now correctly shows 0 if a job has not yet received an
|
||||
allocation or if the job was canceled before getting one.
|
||||
* Fix intel oneapi autodetect: detect the /dev/dri/renderD[0-9]+ gpus, and do
|
||||
not detect /dev/dri/card[0*9]+.
|
||||
* Format batch, extern, interactive, and pending step ids into strings that
|
||||
are human readable.
|
||||
* Fix node selection for jobs that request --gpus and a number of tasks fewer
|
||||
than gpus, which resulted in incorrectly rejecting these jobs.
|
||||
* Remove MYSQL_OPT_RECONNECT completely.
|
||||
* Fix cloud nodes in POWERING_UP state disappearing (getting set to FUTURE)
|
||||
when an `scontrol reconfigure` happens.
|
||||
* openapi/dbv0.0.39 - Avoid assert / segfault on missing coordinators list.
|
||||
* slurmrestd - Correct memory leak while parsing OpenAPI specification
|
||||
templates with server overrides.
|
||||
* slurmrestd - Reduce memory usage when printing out job CPU frequency.
|
||||
* Fix overwriting user node reason with system message.
|
||||
* Remove --uid / --gid options from salloc and srun commands.
|
||||
* Prevent deadlock when rpc_queue is enabled.
|
||||
* slurmrestd - Correct OpenAPI specification generation bug where fields with
|
||||
overlapping parent paths would not get generated.
|
||||
* Fix memory leak as a result of a partition info query.
|
||||
* Fix memory leak as a result of a job info query.
|
||||
* slurmrestd - For 'GET /slurm/v0.0.39/node[s]', change format of node's
|
||||
energy field "current_watts" to a dictionary to account for unset value
|
||||
instead of dumping 4294967294.
|
||||
* slurmrestd - For 'GET /slurm/v0.0.39/qos', change format of QOS's
|
||||
field "priority" to a dictionary to account for unset value instead of
|
||||
dumping 4294967294.
|
||||
* slurmrestd - For 'GET /slurm/v0.0.39/job[s]', the 'return code' code field
|
||||
in v0.0.39_job_exit_code will be set to *127 instead of being left unset
|
||||
where job does not have a relevant return code.
|
||||
* data_parser/v0.0.39 - Add required/memory_per_cpu and
|
||||
required/memory_per_node to `sacct *-json` and `sacct --yaml` and
|
||||
'GET /slurmdb/v0.0.39/jobs' from slurmrestd.
|
||||
* For step allocations, fix --gres=none sometimes not ignoring gres from the
|
||||
job.
|
||||
* Fix --exclusive jobs incorrectly gang-scheduling where they shouldn't.
|
||||
* Fix allocations with CR_SOCKET, gres not assigned to a specific socket, and
|
||||
block core distribion potentially allocating more sockets than required.
|
||||
* gpu/oneapi - Store cores correctly so CPU affinity is tracked.
|
||||
* Revert a change in 23.02.3 where Slurm would kill a script's process group
|
||||
as soon as the script ended instead of waiting as long as any process in
|
||||
that process group held the stdout/stderr file descriptors open. That change
|
||||
broke some scripts that relied on the previous behavior. Setting time limits
|
||||
for scripts (such as PrologEpilogTimeout) is strongly encouraged to avoid
|
||||
Slurm waiting indefinitely for scripts to finish.
|
||||
* Allow slurmdbd -R to work if the root assoc id is not 1.
|
||||
* Fix slurmdbd -R not returning an error under certain conditions.
|
||||
* slurmdbd - Avoid potential NULL pointer dereference in the mysql plugin.
|
||||
* Revert a change in 23.02 where SLURM_NTASKS was no longer set in the job's
|
||||
environment when *-ntasks-per-node was requested.
|
||||
* Limit periodic node registrations to 50 instead of the full TreeWidth.
|
||||
Since unresolvable cloud/dynamic nodes must disable fanout by setting
|
||||
TreeWidth to a large number, this would cause all nodes to register at
|
||||
once.
|
||||
* Fix regression in 23.02.3 which broken x11 forwarding for hosts when
|
||||
MUNGE sends a localhost address in the encode host field. This is caused
|
||||
when the node hostname is mapped to 127.0.0.1 (or similar) in /etc/hosts.
|
||||
* openapi/[db]v0.0.39 - fix memory leak on parsing error.
|
||||
* data_parser/v0.0.39 - fix updating qos for associations.
|
||||
* openapi/dbv0.0.39 - fix updating values for associations with null users.
|
||||
* Fix minor memory leak with --tres-per-task and licenses.
|
||||
* Fix cyclic socket cpu distribution for tasks in a step where
|
||||
--cpus-per-task < usable threads per core.
|
||||
- Changes in Slurm 23.02.4
|
||||
* Fix `sbatch` return code when --wait is requested on a job array.
|
||||
* `switch/hpe_slingshot` - avoid segfault when running with old
|
||||
libcxi.
|
||||
* Avoid slurmctld segfault when specifying
|
||||
`AccountingStorageExternalHost`.
|
||||
* Fix collected `GPUUtilization` values for `acct_gather_profile`
|
||||
plugins.
|
||||
* Fix sbatch return code when **wait is requested on a job array.
|
||||
* switch/hpe_slingshot * avoid segfault when running with old libcxi.
|
||||
* Avoid slurmctld segfault when specifying AccountingStorageExternalHost.
|
||||
* Fix collected GPUUtilization values for acct_gather_profile plugins.
|
||||
* Fix slurmrestd handling of job hold/release operations.
|
||||
* Make spank `S_JOB_ARGV` item value hold the requested command
|
||||
argv instead of the srun `--bcast` value when `--bcast` requested
|
||||
(only in local context).
|
||||
* Fix step running indefinitely when slurmctld takes more than
|
||||
`MessageTimeout` to respond. Now, `slurmctld` will cancel the
|
||||
step when detected, preventing following steps from getting stuck
|
||||
waiting for resources to be released.
|
||||
* Fix regression to make job_desc.min_cpus accurate again in
|
||||
job_submit when requesting a job with `--ntasks-per-node`.
|
||||
* `scontrol` - Permit changes to `StdErr` and `StdIn` for pending
|
||||
jobs.
|
||||
* `scontrol` - Reset std{err,in,out} when set to empty string.
|
||||
* `slurmrestd` - mark environment as a required field for job
|
||||
submission descriptions.
|
||||
* `slurmrestd` - avoid dumping null in OpenAPI schema required
|
||||
fields.
|
||||
`data_parser/v0.0.39` - avoid rejecting valid `memory_per_node`
|
||||
formatted as dictionary provided with a job description.
|
||||
* `data_parser/v0.0.39` - avoid rejecting valid `memory_per_cpu`
|
||||
formatted as dictionary provided with a job description.
|
||||
* `slurmrestd` - Return HTTP error code 404 when job query fails.
|
||||
* `slurmrestd` - Add return schema to error response to job and
|
||||
license query.
|
||||
* Make spank S_JOB_ARGV item value hold the requested command argv instead of
|
||||
the srun **bcast value when **bcast requested (only in local context).
|
||||
* Fix step running indefinitely when slurmctld takes more than MessageTimeout
|
||||
to respond. Now, slurmctld will cancel the step when detected, preventing
|
||||
following steps from getting stuck waiting for resources to be released.
|
||||
* Fix regression to make job_desc.min_cpus accurate again in job_submit when
|
||||
requesting a job with **ntasks*per*node.
|
||||
* scontrol * Permit changes to StdErr and StdIn for pending jobs.
|
||||
* scontrol * Reset std{err,in,out} when set to empty string.
|
||||
* slurmrestd * mark environment as a required field for job submission
|
||||
descriptions.
|
||||
* slurmrestd * avoid dumping null in OpenAPI schema required fields.
|
||||
* data_parser/v0.0.39 * avoid rejecting valid memory_per_node formatted as
|
||||
dictionary provided with a job description.
|
||||
* data_parser/v0.0.39 * avoid rejecting valid memory_per_cpu formatted as
|
||||
dictionary provided with a job description.
|
||||
* slurmrestd * Return HTTP error code 404 when job query fails.
|
||||
* slurmrestd * Add return schema to error response to job and license query.
|
||||
* Fix handling of ArrayTaskThrottle in backfill.
|
||||
* Fix regression in 23.02.2 when checking gres state on `slurmctld`
|
||||
startup or reconfigure. Gres changes in the configuration were
|
||||
not updated on `slurmctld` startup. On startup or reconfigure,
|
||||
these messages were present in the log:
|
||||
"`error: Attempt to change gres/gpu Count`".
|
||||
* Fix regression in 23.02.2 when checking gres state on slurmctld startup or
|
||||
reconfigure. Gres changes in the configuration were not updated on slurmctld
|
||||
startup. On startup or reconfigure, these messages were present in the log:
|
||||
"error: Attempt to change gres/gpu Count".
|
||||
* Fix potential double count of gres when dealing with limits.
|
||||
* `switch/hpe_slingshot` - support alternate traffic class names
|
||||
with "`TC_`" prefix.
|
||||
* `scrontab` - Fix cutting off the final character of quoted
|
||||
variables.
|
||||
* Fix `slurmstepd` segfault when `ContainerPath` is not set in
|
||||
`oci.conf`.
|
||||
* Change the log message warning for rate limited users from
|
||||
debug to verbose.
|
||||
* Fixed an issue where jobs requesting licenses were incorrectly
|
||||
rejected.
|
||||
* `smail` - Fix issues where emails at job completion were not
|
||||
being sent.
|
||||
* `scontrol/slurmctld` - fix comma parsing when updating a
|
||||
reservation's nodes.
|
||||
* `cgroup/v2` - Avoid capturing log output for ebpf when
|
||||
constraining devices, as this can lead to inadvertent failure
|
||||
if the log buffer is too small.
|
||||
* Fix --gpu-bind=single binding tasks to wrong gpus, leading to
|
||||
some gpus having more tasks than they should and other gpus being
|
||||
unused.
|
||||
* Fix main scheduler loop not starting after failover to backup
|
||||
controller.
|
||||
* Added error message when attempting to use sattach on batch or
|
||||
extern steps.
|
||||
* Fix regression in 23.02 that causes slurmstepd to crash when
|
||||
`srun` requests more than `TreeWidth` nodes in a step and uses
|
||||
the `pmi2` or `pmix` plugin.
|
||||
* Reject job `ArrayTaskThrottle` update requests from unprivileged
|
||||
users.
|
||||
* `data_parser/v0.0.39` - populate description fields of property
|
||||
objects in generated OpenAPI specifications where defined.
|
||||
* `slurmstepd` - Avoid segfault caused by ContainerPath not being
|
||||
terminated by '`/`' in `oci.conf`.
|
||||
* `data_parser/v0.0.39` - Change `v0.0.39_job_info` response to tag
|
||||
`exit_code` field as being complex instead of only an unsigned
|
||||
integer.
|
||||
* `job_container/tmpfs` - Fix %h and %n substitution in `BasePath`
|
||||
where `%h` was substituted as the `NodeName` instead of the
|
||||
hostname, and `%n` was substituted as an empty string.
|
||||
* Fix regression where --cpu-bind=verbose would override
|
||||
`TaskPluginParam`.
|
||||
* `scancel` - Fix `--clusters`/`-M` for federations. Only filtered
|
||||
jobs (e.g. -A, -u, -p, etc.) from the specified clusters will be
|
||||
canceled, rather than all jobs in the federation.
|
||||
Specific jobids will still be routed to the origin cluster
|
||||
for cancellation.
|
||||
* switch/hpe_slingshot * support alternate traffic class names with "TC_"
|
||||
prefix.
|
||||
* scrontab * Fix cutting off the final character of quoted variables.
|
||||
* Fix slurmstepd segfault when ContainerPath is not set in oci.conf
|
||||
* Change the log message warning for rate limited users from debug to verbose.
|
||||
* Fixed an issue where jobs requesting licenses were incorrectly rejected.
|
||||
* smail * Fix issues where e*mails at job completion were not being sent.
|
||||
* scontrol/slurmctld * fix comma parsing when updating a reservation's nodes.
|
||||
* cgroup/v2 * Avoid capturing log output for ebpf when constraining devices,
|
||||
as this can lead to inadvertent failure if the log buffer is too small.
|
||||
* Fix **gpu*bind=single binding tasks to wrong gpus, leading to some gpus
|
||||
having more tasks than they should and other gpus being unused.
|
||||
* Fix main scheduler loop not starting after failover to backup controller.
|
||||
* Added error message when attempting to use sattach on batch or extern steps.
|
||||
* Fix regression in 23.02 that causes slurmstepd to crash when srun requests
|
||||
more than TreeWidth nodes in a step and uses the pmi2 or pmix plugin.
|
||||
* Reject job ArrayTaskThrottle update requests from unprivileged users.
|
||||
* data_parser/v0.0.39 * populate description fields of property objects in
|
||||
generated OpenAPI specifications where defined.
|
||||
* slurmstepd * Avoid segfault caused by ContainerPath not being terminated by
|
||||
'/' in oci.conf.
|
||||
* data_parser/v0.0.39 * Change v0.0.39_job_info response to tag exit_code
|
||||
field as being complex instead of only an unsigned integer.
|
||||
* job_container/tmpfs * Fix %h and %n substitution in BasePath where %h was
|
||||
substituted as the NodeName instead of the hostname, and %n was substituted
|
||||
as an empty string.
|
||||
* Fix regression where **cpu*bind=verbose would override TaskPluginParam.
|
||||
* scancel * Fix **clusters/*M for federations. Only filtered jobs (e.g. *A,
|
||||
*u, *p, etc.) from the specified clusters will be canceled, rather than all
|
||||
jobs in the federation. Specific jobids will still be routed to the origin
|
||||
cluster for cancellation.
|
||||
|
||||
|
||||
-------------------------------------------------------------------
|
||||
Mon Jan 29 13:47:55 UTC 2024 - Egbert Eich <eich@suse.com>
|
||||
@ -2758,6 +2207,7 @@ Fri Jul 2 08:01:32 UTC 2021 - Christian Goll <cgoll@suse.com>
|
||||
- Updated to 20.11.8:
|
||||
* slurmctld - fix erroneous "StepId=CORRUPT" messages in error logs.
|
||||
* Correct the error given when auth plugin fails to pack a credential.
|
||||
* Fix unused-variable compiler warning on FreeBSD in fd_resolve_path().
|
||||
* acct_gather_filesystem/lustre - only emit collection error once per step.
|
||||
* Add GRES environment variables (e.g., CUDA_VISIBLE_DEVICES) into the
|
||||
interactive step, the same as is done for the batch step.
|
||||
|
66
slurm.spec
66
slurm.spec
@ -1,5 +1,5 @@
|
||||
#
|
||||
# spec file for package slurm
|
||||
# spec file
|
||||
#
|
||||
# Copyright (c) 2024 SUSE LLC
|
||||
#
|
||||
@ -17,10 +17,10 @@
|
||||
|
||||
|
||||
# Check file META in sources: update so_version to (API_CURRENT - API_AGE)
|
||||
%define so_version 41
|
||||
%define so_version 40
|
||||
# Make sure to update `upgrades` as well!
|
||||
%define ver 24.05.4
|
||||
%define _ver _24_05
|
||||
%define ver 23.11.5
|
||||
%define _ver _23_11
|
||||
%define dl_ver %{ver}
|
||||
# so-version is 0 and seems to be stable
|
||||
%define pmi_so 0
|
||||
@ -59,9 +59,6 @@ ExclusiveArch: do_not_build
|
||||
%if 0%{?sle_version} == 150500 || 0%{?sle_version} == 150600
|
||||
%define base_ver 2302
|
||||
%endif
|
||||
%if 0%{?sle_version} == 150500 || 0%{?sle_version} == 150600
|
||||
%define base_ver 2302
|
||||
%endif
|
||||
|
||||
%define ver_m %{lua:x=string.gsub(rpm.expand("%ver"),"%.[^%.]*$","");print(x)}
|
||||
# Keep format_spec_file from botching the define below:
|
||||
@ -173,6 +170,8 @@ Source20: test_setup.tar.gz
|
||||
Source21: README_Testsuite.md
|
||||
Patch0: Remove-rpath-from-build.patch
|
||||
Patch2: pam_slurm-Initialize-arrays-and-pass-sizes.patch
|
||||
Patch10: Fix-test-21.41.patch
|
||||
#Patch14: Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch
|
||||
Patch15: Fix-test7.2-to-find-libpmix-under-lib64-as-well.patch
|
||||
|
||||
%{upgrade_dep %pname}
|
||||
@ -407,6 +406,19 @@ Requires: %{name}-config = %{version}
|
||||
%description plugins
|
||||
This package contains the SLURM plugins (loadable shared objects)
|
||||
|
||||
%package plugin-ext-sensors-rrd
|
||||
Summary: SLURM ext_sensors/rrd Plugin (loadable shared objects)
|
||||
Group: Productivity/Clustering/Computing
|
||||
Requires: %{name}-plugins = %{version}
|
||||
%{upgrade_dep %{pname}-plugin-ext-sensors-rrd}
|
||||
# file was moved from slurm-plugins to here
|
||||
Conflicts: %{pname}-plugins < %{version}
|
||||
|
||||
%description plugin-ext-sensors-rrd
|
||||
This package contains the ext_sensors/rrd plugin used to read data
|
||||
using RRD, a tool that creates and manages a linear database for
|
||||
sampling and logging data.
|
||||
|
||||
%package torque
|
||||
Summary: Wrappers for transitition from Torque/PBS to SLURM
|
||||
Group: Productivity/Clustering/Computing
|
||||
@ -517,7 +529,6 @@ This package contains just the minmal code to run a compute node.
|
||||
%package config
|
||||
Summary: Config files and directories for slurm services
|
||||
Group: Productivity/Clustering/Computing
|
||||
%{?sysusers_requires}
|
||||
Requires: logrotate
|
||||
BuildArch: noarch
|
||||
%if 0%{?suse_version} <= 1140
|
||||
@ -751,15 +762,9 @@ rm -rf %{buildroot}/%{_libdir}/slurm/*.{a,la} \
|
||||
%{buildroot}/%{_libdir}/*.la \
|
||||
%{buildroot}/%_lib/security/*.la
|
||||
|
||||
# Fix perl
|
||||
rm %{buildroot}%{perl_archlib}/perllocal.pod \
|
||||
%{buildroot}%{perl_sitearch}/auto/Slurm/.packlist \
|
||||
%{buildroot}%{perl_sitearch}/auto/Slurmdb/.packlist
|
||||
|
||||
mkdir -p %{buildroot}%{perl_vendorarch}
|
||||
|
||||
mv %{buildroot}%{perl_sitearch}/* \
|
||||
%{buildroot}%{perl_vendorarch}
|
||||
rm %{buildroot}/%{perl_archlib}/perllocal.pod \
|
||||
%{buildroot}/%{perl_vendorarch}/auto/Slurm/.packlist \
|
||||
%{buildroot}/%{perl_vendorarch}/auto/Slurmdb/.packlist
|
||||
|
||||
# Remove Cray specific binaries
|
||||
rm -f %{buildroot}/%{_sbindir}/capmc_suspend \
|
||||
@ -1081,6 +1086,7 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
|
||||
%{?have_netloc:%{_bindir}/netloc_to_topology}
|
||||
%{_sbindir}/sackd
|
||||
%{_sbindir}/slurmctld
|
||||
%{_sbindir}/slurmsmwd
|
||||
%dir %{_libdir}/slurm/src
|
||||
%{_unitdir}/slurmctld.service
|
||||
%{_sbindir}/rcslurmctld
|
||||
@ -1158,10 +1164,9 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
|
||||
%files -n perl-%{name}
|
||||
%{perl_vendorarch}/Slurm.pm
|
||||
%{perl_vendorarch}/Slurm
|
||||
%{perl_vendorarch}/Slurmdb.pm
|
||||
%{perl_vendorarch}/auto/Slurm
|
||||
%{perl_vendorarch}/Slurmdb.pm
|
||||
%{perl_vendorarch}/auto/Slurmdb
|
||||
%dir %{perl_vendorarch}/auto
|
||||
%{_mandir}/man3/Slurm*.3pm.*
|
||||
|
||||
%files slurmdbd
|
||||
@ -1184,7 +1189,6 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
|
||||
%dir %{_libdir}/slurm
|
||||
%{_libdir}/slurm/libslurmfull.so
|
||||
%{_libdir}/slurm/accounting_storage_slurmdbd.so
|
||||
%{_libdir}/slurm/accounting_storage_ctld_relay.so
|
||||
%{_libdir}/slurm/acct_gather_energy_pm_counters.so
|
||||
%{_libdir}/slurm/acct_gather_energy_gpu.so
|
||||
%{_libdir}/slurm/acct_gather_energy_ibmaem.so
|
||||
@ -1193,7 +1197,6 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
|
||||
%{_libdir}/slurm/acct_gather_filesystem_lustre.so
|
||||
%{_libdir}/slurm/burst_buffer_lua.so
|
||||
%{_libdir}/slurm/burst_buffer_datawarp.so
|
||||
%{_libdir}/slurm/data_parser_v0_0_41.so
|
||||
%{_libdir}/slurm/data_parser_v0_0_40.so
|
||||
%{_libdir}/slurm/data_parser_v0_0_39.so
|
||||
%{_libdir}/slurm/cgroup_v1.so
|
||||
@ -1211,13 +1214,12 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
|
||||
%{_libdir}/slurm/gres_nic.so
|
||||
%{_libdir}/slurm/gres_shard.so
|
||||
%{_libdir}/slurm/hash_k12.so
|
||||
%{_libdir}/slurm/hash_sha3.so
|
||||
%{_libdir}/slurm/tls_none.so
|
||||
%{_libdir}/slurm/jobacct_gather_cgroup.so
|
||||
%{_libdir}/slurm/jobacct_gather_linux.so
|
||||
%{_libdir}/slurm/jobcomp_filetxt.so
|
||||
%{_libdir}/slurm/jobcomp_lua.so
|
||||
%{_libdir}/slurm/jobcomp_script.so
|
||||
%{_libdir}/slurm/job_container_cncu.so
|
||||
%{_libdir}/slurm/job_container_tmpfs.so
|
||||
%{_libdir}/slurm/job_submit_all_partitions.so
|
||||
%{_libdir}/slurm/job_submit_defaults.so
|
||||
@ -1251,7 +1253,6 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
|
||||
%{_libdir}/slurm/serializer_url_encoded.so
|
||||
%{_libdir}/slurm/serializer_yaml.so
|
||||
%{_libdir}/slurm/site_factor_example.so
|
||||
%{_libdir}/slurm/switch_nvidia_imex.so
|
||||
%{_libdir}/slurm/task_affinity.so
|
||||
%{_libdir}/slurm/task_cgroup.so
|
||||
%{_libdir}/slurm/topology_3d_torus.so
|
||||
@ -1271,6 +1272,9 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
|
||||
%{_libdir}/slurm/acct_gather_profile_influxdb.so
|
||||
%{_libdir}/slurm/jobcomp_elasticsearch.so
|
||||
|
||||
%files plugin-ext-sensors-rrd
|
||||
%{_libdir}/slurm/ext_sensors_rrd.so
|
||||
|
||||
%files lua
|
||||
%{_libdir}/slurm/job_submit_lua.so
|
||||
|
||||
@ -1306,6 +1310,8 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
|
||||
%{_libdir}/slurm/openapi_slurmdbd.so
|
||||
%{_libdir}/slurm/openapi_dbv0_0_39.so
|
||||
%{_libdir}/slurm/openapi_v0_0_39.so
|
||||
%{_libdir}/slurm/openapi_dbv0_0_38.so
|
||||
%{_libdir}/slurm/openapi_v0_0_38.so
|
||||
%{_libdir}/slurm/rest_auth_local.so
|
||||
%endif
|
||||
|
||||
@ -1342,10 +1348,12 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
|
||||
%files config-man
|
||||
%{_mandir}/man5/acct_gather.conf.*
|
||||
%{_mandir}/man5/burst_buffer.conf.*
|
||||
%{_mandir}/man5/ext_sensors.conf.*
|
||||
%{_mandir}/man5/slurm.*
|
||||
%{_mandir}/man5/cgroup.*
|
||||
%{_mandir}/man5/gres.*
|
||||
%{_mandir}/man5/helpers.*
|
||||
#%%{_mandir}/man5/nonstop.conf.5.*
|
||||
%{_mandir}/man5/oci.conf.5.gz
|
||||
%{_mandir}/man5/topology.*
|
||||
%{_mandir}/man5/knl.conf.5.*
|
||||
@ -1360,7 +1368,17 @@ rm -rf /srv/slurm-testsuite/src /srv/slurm-testsuite/testsuite \
|
||||
%endif
|
||||
|
||||
%files cray
|
||||
# do not remove cray sepcific packages from SLES update
|
||||
# Only for Cray
|
||||
%{_libdir}/slurm/core_spec_cray_aries.so
|
||||
%{_libdir}/slurm/job_submit_cray_aries.so
|
||||
%{_libdir}/slurm/select_cray_aries.so
|
||||
%{_libdir}/slurm/switch_cray_aries.so
|
||||
%{_libdir}/slurm/task_cray_aries.so
|
||||
%{_libdir}/slurm/proctrack_cray_aries.so
|
||||
%{_libdir}/slurm/mpi_cray_shasta.so
|
||||
%{_libdir}/slurm/node_features_knl_cray.so
|
||||
%{_libdir}/slurm/power_cray_aries.so
|
||||
|
||||
%if 0%{?slurm_testsuite}
|
||||
%files testsuite
|
||||
|
Loading…
Reference in New Issue
Block a user