b318dfaa59
- Removed the testsuite as it's not part of normal release any more - Removed Fix-test7.2-to-find-libpmix-under-lib64-as-well.patch - Update to version 25.11.2.1 * slurmstepd - Revert regression that would apply job environment to container runtime invocation. * Fix issue where reservations may start while required GRES resources are still being used by jobs. * Fix slurmctld segfault when using --consolidate-segments. * Expose slurm.CONSOLIDATE_SEGMENTS flag in lua. * Expose the job record's segment_size in lua. * job_submit/lua - Expose the job_desc's segment_size in lua. * Prevent PMIx 5.0.8 and 5.0.9 clients from hanging when connecting to the PMIx server. * Clarify warning when BPF tokens are not supported. * slurmctld - Ensure we close already accepted conn before RPC flush check * slurmctld - Fix rpc_queue feature causing statesave corruption while shutdown * slurmctld - Ensure backfill has finished before saving state. * slurmctld - Ensure main scheduler has finished before saving state. * slurmctld - Fix error message while shutting down and state cannot be saved. * Fix slurmctld double free that occurs when purging array jobs from memory only when using the topology/block plugin. * Fix steps being rejected inside a batch job when using --cpus-per-task and --mem-per-cpu, and the job was submitted to multiple partitions, but not all of them had the same MaxMemPerCPU limit in place. * slurmctld - Fix crash after failed reconfiguration while running jobs and priority/multifactor enabled. * slurmctld - Fix jobs' QOS/association usage leading to potential underflow errors after a failed reconfiguration attempt.
Egbert Eich2026-02-11 19:19:57 +00:00
fc445417ff
Accepting request 1280446 from network:cluster
Ana Guerrero2025-05-27 16:43:12 +00:00
fd9c2e5d9a
Accepting request 1272424 from network:cluster
Ana Guerrero2025-04-24 15:27:08 +00:00
933ba9114e
- removed openmpi4-hpc dependency for test suite
Christian Goll2025-04-24 12:32:13 +00:00
2fafb11ecf
Accepting request 1272423 from home:mslacken:branches:network:cluster
Christian Goll2025-04-24 12:32:13 +00:00
a78ab40480
* Update to version 24.11.3. * Sync upgrades file to relfect last updated versions. * Pass '-DH5_USE_112_API -DDH5Oget_info_vers=1' to CFLAGS to allow building with hdf5 1.14 as slurm does not yet support HDF5 v114 API.
Christian Goll2025-04-24 12:18:13 +00:00
d8ba8ce721
Accepting request 1253003 from home:badshah400:hdf5-update:Staging
Christian Goll2025-04-24 12:18:13 +00:00
a28d9f9bdb
Update to version 24.11.1: * With client commands MIN_MEMORY will show mem_per_tres if specified. * Fix errno message about bad constraint. * slurmctld - Fix crash and possible split brain issue if the backup controller handles an scontrol reconfigure while in control before the primary resumes operation. * Fix stepmgr not getting dynamic node addrs from the controller * stepmgr - avoid "Unexpected missing socket" errors. * Fix scontrol show steps with dynamic stepmgr. * Deny jobs using the "R:" option of --signal if PreemptMode=OFF globally. * Force jobs using the "R:" option of --signal to be preemptable. by requeue or cancel only. If PreemptMode on the partition or QOS is off or suspend, the job will default to using PreemptMode=cancel. * If --mem-per-cpu exceeds MaxMemPerCPU, the number of CPUs per task will always be increased even if --cpus-per-task was specified. This is needed to ensure each task gets the expected amount of memory. * Fix compilation issue on OpenSUSE Leap 15. * Fix jobs using more nodes than needed when not using -N. * Fix issue with allocation being allocated less resources. than needed when using --gres-flags=enforce-binding. * select/cons_tres - Fix errors with MaxCpusPerSocket partition limit. Used CPUs/cores weren't counted properly, nor limiting free ones to avail, when the socket was partially allocated, or the job request went beyond this limit. * Fix issue when jobs were preempted for licenses even if there
Egbert Eich2025-02-08 09:34:28 +00:00
991730f2b9
Accepting request 1244326 from home:eeich:branches:network:cluster
Egbert Eich2025-02-08 09:34:28 +00:00
247a29f2a0
* slurmrestd - Remove deprecated fields from the following .result from POST /slurm/v0.0.42/job/submit. .job_id, .step_id, .job_submit_user_msg from POST /slurm/v0.0.42/job/{job_id}. .job.exclusive, .jobs[].exclusive to POST /slurm/v0.0.42/job/submit. .jobs[].exclusive from GET /slurm/v0.0.42/job/{job_id}. .jobs[].exclusive from GET /slurm/v0.0.42/jobs. .job.oversubscribe, .jobs[].oversubscribe to POST /slurm/v0.0.42/job/submit. .jobs[].oversubscribe from GET /slurm/v0.0.42/job/{job_id}. .jobs[].oversubscribe from GET /slurm/v0.0.42/jobs. DELETE /slurm/v0.0.40/jobsDELETE /slurm/v0.0.41/jobsDELETE /slurm/v0.0.42/jobs allocation is granted. job|socket|task or cpus|mem per GRES. node update whereas previously only single nodes could be updated through /node/<nodename> endpoint: POST /slurm/v0.0.42/nodes partition as this is a cluster-wide option. REQUEST_NODE_INFO RPC. the db server is not reachable. (.jobs[].priority_by_partition) to JSON and YAML output. connection error if the error was the result of an authentication failure. errors with the SLURM_PROTOCOL_AUTHENTICATION_ERROR error code. of Unspecified error if querying the following endpoints fails: GET /slurm/v0.0.40/diag/GET /slurm/v0.0.41/diag/GET /slurm/v0.0.42/diag/`
Egbert Eich2025-01-17 21:14:19 +00:00
e91b0b1f3d
Accepting request 1238576 from home:eeich:branches:network:cluster
Egbert Eich2025-01-17 21:14:19 +00:00
3a3588a812
- Make test suite package work on SLE-12.
Egbert Eich2025-01-17 20:34:50 +00:00
074269c693
Accepting request 1238572 from home:eeich:branches:network:cluster
Egbert Eich2025-01-17 20:34:50 +00:00
3b4d2235f3
Accepting request 1236247 from network:cluster
Ana Guerrero2025-01-12 10:14:54 +00:00
deb23af74b
Accepting request 1236247 from network:cluster
Ana Guerrero2025-01-12 10:14:54 +00:00
9b10244202
Accepting request 1236246 from home:eeich:slurmtest
Egbert Eich2025-01-09 15:43:36 +00:00
e8b6930a42
Accepting request 1235784 from network:cluster
Ana Guerrero2025-01-09 14:07:22 +00:00
ffa6c9dc65
Accepting request 1235784 from network:cluster
Ana Guerrero2025-01-09 14:07:22 +00:00
626fb47a3b
- Update to version 24.11 * slurmctld - Reject arbitrary distribution jobs that do not specifying a task count. * Fix backwards compatibility of the RESPONSE_JOB_INFO RPC (used by squeue, scontrol show job, etc.) with Slurm clients version 24.05 and below. This was a regression in 24.11.0rc1. * Do not let slurmctld/slurmd start if there are more nodes defined in slurm.conf than the maximum supported amount (64k nodes). * slurmctld - Set job's exit code to 1 when a job fails with state JOB_NODE_FAIL. This fixes sbatch --wait not being able to exit with error code when a job fails for this reason in some cases. * Fix certain reservation updates requested from 23.02 clients. * slurmrestd - Fix populating non-required object fields of objects as {} in JSON/YAML instead of null causing compiled OpenAPI clients to reject the response to GET /slurm/v0.0.40/jobs due to validation failure of .jobs[].job_resources. * Fix issue where older versions of Slurm talking to a 24.11 dbd could loose step accounting. * Fix minor memory leaks. * Fix bad memory reference when xstrchr fails to find char. * Remove duplicate checks for a data structure. * Fix race condition in stepmgr step completion handling. * slurm.spec - add ability to specify patches to apply on the command line. * slurm.spec - add ability to supply extra version information. * Fix 24.11 HA issues. * Fix requeued jobs keeping their priority until the decay thread
Egbert Eich2025-01-08 06:03:29 +00:00
eb440fa877
Accepting request 1235783 from home:eeich:branches:network:cluster
Egbert Eich2025-01-08 06:03:29 +00:00
b1107f7a34
- Update to version 24.05.4 & fix for CVE-2024-48936. * Fix generic int sort functions. * Fix user look up using possible unrealized uid in the dbd. * slurmrestd - Fix regressions that allowed slurmrestd to be run as SlurmUser when SlurmUser was not root. * mpi/pmix fix race conditions with het jobs at step start/end which could make srun to hang. * Fix not showing some SelectTypeParameters in scontrol show config. * Avoid assert when dumping removed certain fields in JSON/YAML. * Improve how shards are scheduled with affinity in mind. * Fix MaxJobsAccruePU not being respected when MaxJobsAccruePA is set in the same QOS. * Prevent backfill from planning jobs that use overlapping resources for the same time slot if the job's time limit is less than bf_resolution. * Fix memory leak when requesting typed gres and --[cpus|mem]-per-gpu. * Prevent backfill from breaking out due to "system state changed" every 30 seconds if reservations use REPLACE or REPLACE_DOWN flags. * slurmrestd - Make sure that scheduler_unset parameter defaults to true even when the following flags are also set: show_duplicates, skip_steps, disable_truncate_usage_time, run_away_jobs, whole_hetjob, disable_whole_hetjob, disable_wait_for_result, usage_time_as_submit_time, show_batch_script, and or show_job_environment. Additionaly, always make sure show_duplicates and disable_truncate_usage_time default to true when the following flags are also set: scheduler_unset, scheduled_on_submit,
Egbert Eich2024-11-01 13:22:34 +00:00
b5c2003459
Accepting request 1220075 from home:eeich:branches:network:cluster
Egbert Eich2024-11-01 13:22:34 +00:00
3133935d61
Accepting request 1217321 from network:cluster
Ana Guerrero2024-10-24 13:42:28 +00:00
098440c057
Accepting request 1217321 from network:cluster
Ana Guerrero2024-10-24 13:42:28 +00:00
427f09ad29
- Add %(?%sysusers_requires} to slurm-config. This fixes issues when building against Slurm.
Egbert Eich2024-10-23 09:42:56 +00:00
08067e260e
Accepting request 1217300 from home:eeich:branches:network:cluster
Egbert Eich2024-10-23 09:42:56 +00:00
de9dc95156
Accepting request 1208086 from network:cluster
Ana Guerrero2024-10-15 13:01:34 +00:00
9d6a481e12
Accepting request 1208086 from network:cluster
Ana Guerrero2024-10-15 13:01:34 +00:00
1cc2983ebe
- Removed Fix-test-21.41.patch as upstream test changed. - Dropped package plugin-ext-sensors-rrd as the plugin module no longer exists.
Egbert Eich2024-10-15 10:19:24 +00:00
fa96a81dac
- Removed Fix-test-21.41.patch as upstream test changed. - Dropped package plugin-ext-sensors-rrd as the plugin module no longer exists.
Egbert Eich2024-10-15 10:19:24 +00:00
b2f6e848a1
- Update to version 24.05.3 * data_parser/v0.0.40 - Added field descriptions. * slurmrestd - Avoid creating new slurmdbd connection per request to * /slurm/slurmctld/*/* endpoints. * Fix compilation issue with switch/hpe_slingshot plugin. * Fix gres per task allocation with threads-per-core. * data_parser/v0.0.41 - Added field descriptions. * slurmrestd - Change back generated OpenAPI schema for DELETE /slurm/v0.0.40/jobs/ to RequestBody instead of using parameters for request. slurmrestd will continue accept endpoint requests via RequestBody or HTTP query. * topology/tree - Fix issues with switch distance optimization. * Fix potential segfault of secondary slurmctld when falling back to the primary when running with a JobComp plugin. * Enable --json/--yaml=v0.0.39 options on client commands to dump data using data_parser/v0.0.39 instead or outputting nothing. * switch/hpe_slingshot - Fix issue that could result in a 0 length state file. * Fix unnecessary message protocol downgrade for unregistered nodes. * Fix unnecessarily packing alias addrs when terminating jobs with a mix of non-cloud/dynamic nodes and powered down cloud/dynamic nodes. * accounting_storage/mysql - Fix issue when deleting a qos that could remove too many commas from the qos and/or delta_qos fields of the assoc table. * slurmctld - Fix memory leak when using RestrictedCoresPerGPU. * Fix allowing access to reservations without MaxStartDelay set. * Fix regression introduced in 24.05.0rc1 breaking srun --send-libs parsing. * Fix slurmd vsize memory leak when using job submission/allocation
Egbert Eich2024-10-15 06:51:09 +00:00
4da5a0dbb6
Accepting request 1208035 from home:eeich:branches:network:cluster
Egbert Eich2024-10-15 06:51:09 +00:00
fc209e050f
- updated to new release 24.05.0 with following major changes - IMPORTANT NOTES: If using the slurmdbd (Slurm DataBase Daemon) you must update this first. NOTE: If using a backup DBD you must start the primary first to do any database conversion, the backup will not start until this has happened. The 24.05 slurmdbd will work with Slurm daemons of version 23.02 and above. You will not need to update all clusters at the same time, but it is very important to update slurmdbd first and having it running before updating any other clusters making use of it. - HIGHLIGHTS * Federation - allow client command operation when slurmdbd is unavailable. * burst_buffer/lua - Added two new hooks: slurm_bb_test_data_in and slurm_bb_test_data_out. The syntax and use of the new hooks are documented in etc/burst_buffer.lua.example. These are required to exist. slurmctld now checks on startup if the burst_buffer.lua script loads and contains all required hooks; slurmctld will exit with a fatal error if this is not successful. Added PollInterval to burst_buffer.conf. Removed the arbitrary limit of 512 copies of the script running simultaneously. * Add QOS limit MaxTRESRunMinsPerAccount. * Add QOS limit MaxTRESRunMinsPerUser. * Add ELIGIBLE environment variable to jobcomp/script plugin. * Always use the QOS name for SLURM_JOB_QOS environment variables. Previously the batch environment would use the description field, which was usually equivalent to the name. * cgroup/v2 - Require dbus-1 version >= 1.11.16. * Allow NodeSet names to be used in SuspendExcNodes.
Egbert Eich2024-10-14 10:03:00 +00:00
8817725c9b
Accepting request 1178495 from home:mslacken:branches:network:cluster
Egbert Eich2024-10-14 10:03:00 +00:00
61add11d2b
Accepting request 1161658 from network:cluster
Ana Guerrero2024-03-26 18:27:40 +00:00
4ea9d6675e
Accepting request 1161658 from network:cluster
Ana Guerrero2024-03-26 18:27:40 +00:00
cda5ce024e
Accepting request 1161499 from home:mslacken:branches:network:cluster
Christian Goll2024-03-26 08:40:44 +00:00
2f8f05b750
Accepting request 1161499 from home:mslacken:branches:network:cluster
Christian Goll2024-03-26 08:40:44 +00:00
4ec0f5cd48
Accepting request 1151965 from network:cluster
Ana Guerrero2024-02-27 21:47:57 +00:00
5b034001ae
Accepting request 1151965 from network:cluster
Ana Guerrero2024-02-27 21:47:57 +00:00
fb460ebe6a
Accepting request 1150524 from home:eeich:branches:network:cluster
Egbert Eich2024-02-26 21:40:59 +00:00
c65ee9d140
Accepting request 1150524 from home:eeich:branches:network:cluster
Egbert Eich2024-02-26 21:40:59 +00:00
6a021ebb80
Accepting request 1141442 from network:cluster
Ana Guerrero2024-01-25 17:41:05 +00:00
d9576d9326
Accepting request 1141442 from network:cluster
Ana Guerrero2024-01-25 17:41:05 +00:00
f98ecb23d5
- Remove last change. This is not how it is intended to work
Egbert Eich2024-01-25 07:58:54 +00:00
27950e7688
- Remove last change. This is not how it is intended to work
Egbert Eich2024-01-25 07:58:54 +00:00
a95f2355d0
Accepting request 1141020 from home:dimstar:Factory
Christian Goll2024-01-24 14:43:56 +00:00
ee5117c41a
Accepting request 1141020 from home:dimstar:Factory
Christian Goll2024-01-24 14:43:56 +00:00
e59754da76
CVE-2023-49933, CVE-2023-49934, CVE-2023-49935, CVE-2023-49936 and CVE-2023-49937 * Substantially overhauled the SlurmDBD association management code. For clusters updated to 23.11, account and user additions or removals are significantly faster than in prior releases. * Overhauled scontrol reconfigure to prevent configuration mistakes from disabling slurmctld and slurmd. Instead, an error will be returned, and the running configuration will persist. This does require updates to the systemd service files to use the --systemd option to slurmctld and slurmd. * Added a new internal auth/cred plugin - auth/slurm. This builds off the prior auth/jwt model, and permits operation of the slurmdbd and slurmctld without access to full directory information with a suitable configuration. * Added a new --external-launcher option to srun, which is automatically set by common MPI launcher implementations and ensures processes using those non-srun launchers have full access to all resources allocated on each node. * Reworked the dynamic/cloud modes of operation to allow for "fanout" - where Slurm communication can be automatically offloaded to compute nodes for increased cluster scalability. * Overhauled and extended the Reservation subsystem to allow for most of the same resource requirements as are placed on the job. Notably, this permits reservations to now reserve GRES directly. * Fix scontrol update job=... TimeLimit+=/-= when used with a raw JobId of job array element. * Reject TimeLimit increment/decrement when called on job with TimeLimit=UNLIMITED.
Egbert Eich2024-01-22 16:26:43 +00:00
f99aa61fe3
CVE-2023-49933, CVE-2023-49934, CVE-2023-49935, CVE-2023-49936 and CVE-2023-49937 * Substantially overhauled the SlurmDBD association management code. For clusters updated to 23.11, account and user additions or removals are significantly faster than in prior releases. * Overhauled scontrol reconfigure to prevent configuration mistakes from disabling slurmctld and slurmd. Instead, an error will be returned, and the running configuration will persist. This does require updates to the systemd service files to use the --systemd option to slurmctld and slurmd. * Added a new internal auth/cred plugin - auth/slurm. This builds off the prior auth/jwt model, and permits operation of the slurmdbd and slurmctld without access to full directory information with a suitable configuration. * Added a new --external-launcher option to srun, which is automatically set by common MPI launcher implementations and ensures processes using those non-srun launchers have full access to all resources allocated on each node. * Reworked the dynamic/cloud modes of operation to allow for "fanout" - where Slurm communication can be automatically offloaded to compute nodes for increased cluster scalability. * Overhauled and extended the Reservation subsystem to allow for most of the same resource requirements as are placed on the job. Notably, this permits reservations to now reserve GRES directly. * Fix scontrol update job=... TimeLimit+=/-= when used with a raw JobId of job array element. * Reject TimeLimit increment/decrement when called on job with TimeLimit=UNLIMITED.
Egbert Eich2024-01-22 16:26:43 +00:00
e7275730c8
Accepting request 1138332 from home:mslacken:branches:network:cluster
Egbert Eich2024-01-22 15:21:33 +00:00
b53ef1c220
Accepting request 1138332 from home:mslacken:branches:network:cluster
Egbert Eich2024-01-22 15:21:33 +00:00