247a29f2a0* slurmrestd - Remove deprecated fields from the following .result from POST /slurm/v0.0.42/job/submit. .job_id, .step_id, .job_submit_user_msg from POST /slurm/v0.0.42/job/{job_id}. .job.exclusive, .jobs[].exclusive to POST /slurm/v0.0.42/job/submit. .jobs[].exclusive from GET /slurm/v0.0.42/job/{job_id}. .jobs[].exclusive from GET /slurm/v0.0.42/jobs. .job.oversubscribe, .jobs[].oversubscribe to POST /slurm/v0.0.42/job/submit. .jobs[].oversubscribe from GET /slurm/v0.0.42/job/{job_id}. .jobs[].oversubscribe from GET /slurm/v0.0.42/jobs. DELETE /slurm/v0.0.40/jobsDELETE /slurm/v0.0.41/jobsDELETE /slurm/v0.0.42/jobs allocation is granted. job|socket|task or cpus|mem per GRES. node update whereas previously only single nodes could be updated through /node/<nodename> endpoint: POST /slurm/v0.0.42/nodes partition as this is a cluster-wide option. REQUEST_NODE_INFO RPC. the db server is not reachable. (.jobs[].priority_by_partition) to JSON and YAML output. connection error if the error was the result of an authentication failure. errors with the SLURM_PROTOCOL_AUTHENTICATION_ERROR error code. of Unspecified error if querying the following endpoints fails: GET /slurm/v0.0.40/diag/GET /slurm/v0.0.41/diag/GET /slurm/v0.0.42/diag/`
devel
Egbert Eich2025-01-17 21:14:19 +0000
3a3588a812- Make test suite package work on SLE-12.Egbert Eich2025-01-17 20:34:50 +0000
3b4d2235f3Accepting request 1236247 from network:cluster
Ana Guerrero
2025-01-12 10:14:54 +0000
e8b6930a42Accepting request 1235784 from network:cluster
Ana Guerrero
2025-01-09 14:07:22 +0000
626fb47a3b- Update to version 24.11 * slurmctld - Reject arbitrary distribution jobs that do not specifying a task count. * Fix backwards compatibility of the RESPONSE_JOB_INFO RPC (used by squeue, scontrol show job, etc.) with Slurm clients version 24.05 and below. This was a regression in 24.11.0rc1. * Do not let slurmctld/slurmd start if there are more nodes defined in slurm.conf than the maximum supported amount (64k nodes). * slurmctld - Set job's exit code to 1 when a job fails with state JOB_NODE_FAIL. This fixes sbatch --wait not being able to exit with error code when a job fails for this reason in some cases. * Fix certain reservation updates requested from 23.02 clients. * slurmrestd - Fix populating non-required object fields of objects as {} in JSON/YAML instead of null causing compiled OpenAPI clients to reject the response to GET /slurm/v0.0.40/jobs due to validation failure of .jobs[].job_resources. * Fix issue where older versions of Slurm talking to a 24.11 dbd could loose step accounting. * Fix minor memory leaks. * Fix bad memory reference when xstrchr fails to find char. * Remove duplicate checks for a data structure. * Fix race condition in stepmgr step completion handling. * slurm.spec - add ability to specify patches to apply on the command line. * slurm.spec - add ability to supply extra version information. * Fix 24.11 HA issues. * Fix requeued jobs keeping their priority until the decay threadEgbert Eich2025-01-08 06:03:29 +0000
17d576bce0Accepting request 1220076 from network:cluster
Dominique Leuenberger
2024-11-01 20:07:50 +0000
b1107f7a34- Update to version 24.05.4 & fix for CVE-2024-48936. * Fix generic int sort functions. * Fix user look up using possible unrealized uid in the dbd. * slurmrestd - Fix regressions that allowed slurmrestd to be run as SlurmUser when SlurmUser was not root. * mpi/pmix fix race conditions with het jobs at step start/end which could make srun to hang. * Fix not showing some SelectTypeParameters in scontrol show config. * Avoid assert when dumping removed certain fields in JSON/YAML. * Improve how shards are scheduled with affinity in mind. * Fix MaxJobsAccruePU not being respected when MaxJobsAccruePA is set in the same QOS. * Prevent backfill from planning jobs that use overlapping resources for the same time slot if the job's time limit is less than bf_resolution. * Fix memory leak when requesting typed gres and --[cpus|mem]-per-gpu. * Prevent backfill from breaking out due to "system state changed" every 30 seconds if reservations use REPLACE or REPLACE_DOWN flags. * slurmrestd - Make sure that scheduler_unset parameter defaults to true even when the following flags are also set: show_duplicates, skip_steps, disable_truncate_usage_time, run_away_jobs, whole_hetjob, disable_whole_hetjob, disable_wait_for_result, usage_time_as_submit_time, show_batch_script, and or show_job_environment. Additionaly, always make sure show_duplicates and disable_truncate_usage_time default to true when the following flags are also set: scheduler_unset, scheduled_on_submit,Egbert Eich2024-11-01 13:22:34 +0000
3133935d61Accepting request 1217321 from network:cluster
Ana Guerrero
2024-10-24 13:42:28 +0000
427f09ad29- Add %(?%sysusers_requires} to slurm-config. This fixes issues when building against Slurm.Egbert Eich2024-10-23 09:42:56 +0000
de9dc95156Accepting request 1208086 from network:cluster
Ana Guerrero
2024-10-15 13:01:34 +0000
1cc2983ebe- Removed Fix-test-21.41.patch as upstream test changed. - Dropped package plugin-ext-sensors-rrd as the plugin module no longer exists.Egbert Eich2024-10-15 10:19:24 +0000
b2f6e848a1- Update to version 24.05.3 * data_parser/v0.0.40 - Added field descriptions. * slurmrestd - Avoid creating new slurmdbd connection per request to * /slurm/slurmctld/*/* endpoints. * Fix compilation issue with switch/hpe_slingshot plugin. * Fix gres per task allocation with threads-per-core. * data_parser/v0.0.41 - Added field descriptions. * slurmrestd - Change back generated OpenAPI schema for DELETE /slurm/v0.0.40/jobs/ to RequestBody instead of using parameters for request. slurmrestd will continue accept endpoint requests via RequestBody or HTTP query. * topology/tree - Fix issues with switch distance optimization. * Fix potential segfault of secondary slurmctld when falling back to the primary when running with a JobComp plugin. * Enable --json/--yaml=v0.0.39 options on client commands to dump data using data_parser/v0.0.39 instead or outputting nothing. * switch/hpe_slingshot - Fix issue that could result in a 0 length state file. * Fix unnecessary message protocol downgrade for unregistered nodes. * Fix unnecessarily packing alias addrs when terminating jobs with a mix of non-cloud/dynamic nodes and powered down cloud/dynamic nodes. * accounting_storage/mysql - Fix issue when deleting a qos that could remove too many commas from the qos and/or delta_qos fields of the assoc table. * slurmctld - Fix memory leak when using RestrictedCoresPerGPU. * Fix allowing access to reservations without MaxStartDelay set. * Fix regression introduced in 24.05.0rc1 breaking srun --send-libs parsing. * Fix slurmd vsize memory leak when using job submission/allocationEgbert Eich2024-10-15 06:51:09 +0000
fc209e050f- updated to new release 24.05.0 with following major changes - IMPORTANT NOTES: If using the slurmdbd (Slurm DataBase Daemon) you must update this first. NOTE: If using a backup DBD you must start the primary first to do any database conversion, the backup will not start until this has happened. The 24.05 slurmdbd will work with Slurm daemons of version 23.02 and above. You will not need to update all clusters at the same time, but it is very important to update slurmdbd first and having it running before updating any other clusters making use of it. - HIGHLIGHTS * Federation - allow client command operation when slurmdbd is unavailable. * burst_buffer/lua - Added two new hooks: slurm_bb_test_data_in and slurm_bb_test_data_out. The syntax and use of the new hooks are documented in etc/burst_buffer.lua.example. These are required to exist. slurmctld now checks on startup if the burst_buffer.lua script loads and contains all required hooks; slurmctld will exit with a fatal error if this is not successful. Added PollInterval to burst_buffer.conf. Removed the arbitrary limit of 512 copies of the script running simultaneously. * Add QOS limit MaxTRESRunMinsPerAccount. * Add QOS limit MaxTRESRunMinsPerUser. * Add ELIGIBLE environment variable to jobcomp/script plugin. * Always use the QOS name for SLURM_JOB_QOS environment variables. Previously the batch environment would use the description field, which was usually equivalent to the name. * cgroup/v2 - Require dbus-1 version >= 1.11.16. * Allow NodeSet names to be used in SuspendExcNodes.Egbert Eich2024-10-14 10:03:00 +0000
61add11d2bAccepting request 1161658 from network:cluster
Ana Guerrero
2024-03-26 18:27:40 +0000
cda5ce024eAccepting request 1161499 from home:mslacken:branches:network:clusterChristian Goll2024-03-26 08:40:44 +0000
4ec0f5cd48Accepting request 1151965 from network:cluster
Ana Guerrero
2024-02-27 21:47:57 +0000
fb460ebe6aAccepting request 1150524 from home:eeich:branches:network:clusterEgbert Eich2024-02-26 21:40:59 +0000
6a021ebb80Accepting request 1141442 from network:cluster
Ana Guerrero
2024-01-25 17:41:05 +0000
f98ecb23d5- Remove last change. This is not how it is intended to workEgbert Eich2024-01-25 07:58:54 +0000
a95f2355d0Accepting request 1141020 from home:dimstar:FactoryChristian Goll2024-01-24 14:43:56 +0000
e59754da76CVE-2023-49933, CVE-2023-49934, CVE-2023-49935, CVE-2023-49936 and CVE-2023-49937 * Substantially overhauled the SlurmDBD association management code. For clusters updated to 23.11, account and user additions or removals are significantly faster than in prior releases. * Overhauled scontrol reconfigure to prevent configuration mistakes from disabling slurmctld and slurmd. Instead, an error will be returned, and the running configuration will persist. This does require updates to the systemd service files to use the --systemd option to slurmctld and slurmd. * Added a new internal auth/cred plugin - auth/slurm. This builds off the prior auth/jwt model, and permits operation of the slurmdbd and slurmctld without access to full directory information with a suitable configuration. * Added a new --external-launcher option to srun, which is automatically set by common MPI launcher implementations and ensures processes using those non-srun launchers have full access to all resources allocated on each node. * Reworked the dynamic/cloud modes of operation to allow for "fanout" - where Slurm communication can be automatically offloaded to compute nodes for increased cluster scalability. * Overhauled and extended the Reservation subsystem to allow for most of the same resource requirements as are placed on the job. Notably, this permits reservations to now reserve GRES directly. * Fix scontrol update job=... TimeLimit+=/-= when used with a raw JobId of job array element. * Reject TimeLimit increment/decrement when called on job with TimeLimit=UNLIMITED.Egbert Eich2024-01-22 16:26:43 +0000
e7275730c8Accepting request 1138332 from home:mslacken:branches:network:clusterEgbert Eich2024-01-22 15:21:33 +0000
1f813cb386Accepting request 1137045 from network:cluster
Dominique Leuenberger
2024-01-05 20:45:15 +0000
af603b8163Accepting request 1136624 from home:eeich:branches:network:clusterEgbert Eich2024-01-05 12:29:13 +0000
0db8ed8d95Accepting request 1130097 from network:cluster
Ana Guerrero
2023-12-04 21:59:28 +0000
bbe01bb79fAccepting request 1130096 from home:eeich:branches:network:clusterEgbert Eich2023-11-30 19:27:08 +0000
5a1d72f62cAccepting request 1129638 from home:eeich:branches:network:clusterEgbert Eich2023-11-28 18:02:52 +0000
1e8971e87aAccepting request 1129192 from network:cluster
Ana Guerrero
2023-11-27 21:44:42 +0000
cd2c5bfc50Accepting request 1117145 from home:mslacken:branches:network:clusterChristian Goll2023-10-12 09:09:32 +0000
90bba6a8aaAccepting request 1117137 from home:mslacken:branches:network:clusterEgbert Eich2023-10-12 08:49:44 +0000
12bf38b1d0Accepting request 1111943 from network:cluster
Dominique Leuenberger
2023-09-20 11:26:46 +0000
f0b994e220plugins makes use of the MpiParams=ports= option, and previously features with the | operator, which could prevent jobs from + node_features/helpers - Fix inconsistent handling of & and |, instead of just the current set. E.g. foo|bar&baz was interpreted {foo} or {bar,baz}. tasks fewer than GPUs, which resulted in incorrectly rejecting these jobs. + slurmrestd - For GET /slurm/v0.0.39/node[s], change format of node's energy field current_watts to a dictionary to account for + slurmrestd - For GET /slurm/v0.0.39/qos, change format of QOS's + slurmrestd - For GET /slurm/v0.0.39/job[s], the 'return code' GET /slurmdb/v0.0.39/jobs from slurmrestd. were present in the log: error: Attempt to change gres/gpu Count. + Hold the job with (Reservation ... invalid) state reason if theEgbert Eich2023-09-18 05:43:58 +0000
74529b6cc2- Updated to version 23.02.5 with the following changes: * Bug Fixes: + Revert a change in 23.02 where SLURM_NTASKS was no longer set in the job's environment when --ntasks-per-node was requested. The method that is is being set, however, is different and should be more accurate in more situations. + Change pmi2 plugin to honor the SrunPortRange option. This matches the new behavior of the pmix plugin in 23.02.0. Note that neither of these plugins makes use of the "MpiParams=ports=" option, and previously were only limited by the systems ephemeral port range. + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if a node features plugin is configured. + Fix and prevent reoccurring reservations from overlapping. + job_container/tmpfs - Avoid attempts to share BasePath between nodes. + With CR_Cpu_Memory, fix node selection for jobs that request gres and --mem-per-cpu. + Fix a regression from 22.05.7 in which some jobs were allocated too few nodes, thus overcommitting cpus to some tasks. + Fix a job being stuck in the completing state if the job ends while the primary controller is down or unresponsive and the backup controller has not yet taken over. + Fix slurmctld segfault when a node registers with a configured CpuSpecList while slurmctld configuration has the node without CpuSpecList. + Fix cloud nodes getting stuck in POWERED_DOWN+NO_RESPOND state after not registering by ResumeTimeout. + slurmstepd - Avoid cleanup of config.json-less containers spooldir getting skipped. + Fix scontrol segfault when 'completing' command requested repeatedly in interactive mode.Egbert Eich2023-09-18 05:24:51 +0000
3825e9fab0Accepting request 1110422 from network:cluster
Ana Guerrero
2023-09-12 19:02:53 +0000
a323feff42Accepting request 1110421 from home:eeich:branches:network:clusterEgbert Eich2023-09-12 04:52:56 +0000
3bcde4bfd9Accepting request 1110259 from network:cluster
Ana Guerrero
2023-09-11 19:22:19 +0000
f9646ba945- Updated to 23.02.4 with the following changes: * Bug Fixes: + Fix main scheduler loop not starting after a failover to backup controller. Avoid slurmctld segfault when specifying AccountingStorageExternalHost (bsc#1214983). + Fix sbatch return code when --wait is requested on a job array. + Fix collected GPUUtilization values for acct_gather_profile plugins. + Fix slurmrestd handling of job hold/release operations. + Fix step running indefinitely when slurmctld takes more than MessageTimeout to respond. Now, slurmctld will cancel the step when detected, preventing following steps from getting stuck waiting for resources to be released. + Fix regression to make job_desc.min_cpus accurate again in job_submit when requesting a job with --ntasks-per-node. + Fix handling of ArrayTaskThrottle in backfill. + Fix regression in 23.02.2 when checking gres state on slurmctld startup or reconfigure. Gres changes in the configuration were not updated on slurmctld startup. On startup or reconfigure, these messages were present in the log: "error: Attempt to change gres/gpu Count". + Fix potential double count of gres when dealing with limits. + Fix slurmstepd segfault when ContainerPath is not set in oci.conf + Fixed an issue where jobs requesting licenses were incorrectly rejected. + scrontab - Fix cutting off the final character of quoted variables. + smail - Fix issues where e-mails at job completion were not being sent. + scontrol/slurmctld - fix comma parsing when updating a reservation's nodes. + Fix --gpu-bind=single binding tasks to wrong gpus, leading to some gpus having more tasks than they should and other gpus being unused. + Fix regression in 23.02 that causes slurmstepd to crash when srun requests more than TreeWidth nodes in a step and uses the pmi2 orEgbert Eich2023-09-11 07:21:32 +0000
6b47182efeAccepting request 1109308 from network:cluster
Ana Guerrero
2023-09-07 19:12:41 +0000
c63b605916- Fixes since 23.02.03: Highlights: * Fix main scheduler loop not starting after a failover to backup controller. * Avoid slurmctld segfault when specifying AccountingStorageExternalHost (bsc#1214983). Other: * Fix sbatch return code when --wait is requested on a job array. * Fix collected GPUUtilization values for acct_gather_profile plugins. * Fix slurmrestd handling of job hold/release operations. * Make spank S_JOB_ARGV item value hold the requested command argv instead of the srun --bcast value when --bcast requested (only in local context). * Fix step running indefinitely when slurmctld takes more than MessageTimeout to respond. Now, slurmctld will cancel the step when detected, preventing following steps from getting stuck waiting for resources to be released. * Fix regression to make job_desc.min_cpus accurate again in job_submit when requesting a job with --ntasks-per-node. * Fix handling of ArrayTaskThrottle in backfill. * Fix regression in 23.02.2 when checking gres state on slurmctld startup or reconfigure. Gres changes in the configuration were not updated on slurmctld startup. On startup or reconfigure, these messages were present in the log: "error: Attempt to change gres/gpu Count". * Fix potential double count of gres when dealing with limits. * Fix slurmstepd segfault when ContainerPath is not set in oci.conf * Fixed an issue where jobs requesting licenses were incorrectly rejected. * scrontab - Fix cutting off the final character of quoted variables. * smail - Fix issues where e-mails at job completion were not being sent. * scontrol/slurmctld - fix comma parsing when updating a reservation's nodes.Egbert Eich2023-09-06 17:11:37 +0000
51bec69223Accepting request 1109029 from network:cluster
Ana Guerrero
2023-09-06 16:57:11 +0000
47d665607bAccepting request 1109009 from home:mslacken:branches:network:clusterChristian Goll2023-09-05 11:47:06 +0000
03d2eefa9eAccepting request 1085677 from network:cluster
Dominique Leuenberger
2023-05-09 11:09:16 +0000
532aa1e96dAccepting request 1085668 from home:mslacken:branches:network:clusterEgbert Eich2023-05-09 10:35:16 +0000
0d5e08df4bAccepting request 1083466 from network:cluster
Dominique Leuenberger
2023-04-28 14:23:13 +0000
33bf8791ac- Require slurm-munge if munge authentication is installed. - Replace 'Require: config(pam)' by 'Require: pam'.Egbert Eich2023-04-28 07:46:44 +0000
392bec3223Accepting request 1082770 from home:eeich:branches:network:clusterChristian Goll2023-04-27 13:24:37 +0000
e27e58c1b6Accepting request 1076522 from network:cluster
Dominique Leuenberger
2023-04-01 17:32:20 +0000
5a68fc8e5f- updated to 23.02.1 with the following changes: - removed right-pmix-path.patch as fixed upstreamEgbert Eich2023-03-31 15:48:27 +0000
d2a2e0a1e8Accepting request 1076461 from home:mslacken:branches:network:clusterEgbert Eich2023-03-31 15:44:08 +0000
c7d67ed696Accepting request 1072592 from network:cluster
Dominique Leuenberger
2023-03-17 16:05:03 +0000
5c3d4865a1Accepting request 1072591 from home:mslacken:branches:network:clusterChristian Goll2023-03-17 10:52:44 +0000
9883ad6d58Accepting request 1072585 from home:mslacken:branches:network:clusterChristian Goll2023-03-17 10:42:09 +0000
2de2dcca49Accepting request 1072087 from network:cluster
Dominique Leuenberger
2023-03-15 17:56:12 +0000
521f372d87Accepting request 1072084 from home:mslacken:branches:network:clusterChristian Goll2023-03-15 10:57:09 +0000
c224ea00c3Accepting request 1070214 from network:cluster
Dominique Leuenberger
2023-03-09 16:45:23 +0000
e85b508441Accepting request 1070212 from home:eeich:branches:network:clusterEgbert Eich2023-03-08 15:43:28 +0000
86940cb8c4Accepting request 1070094 from home:eeich:branches:network:clusterEgbert Eich2023-03-08 07:58:58 +0000
0f04c66747Accepting request 1070043 from home:eeich:branches:network:clusterEgbert Eich2023-03-07 22:14:15 +0000
da464bfaaeAccepting request 1070038 from home:eeich:branches:network:clusterEgbert Eich2023-03-07 21:33:03 +0000
50b2b76a05Accepting request 1068523 from network:cluster
Dominique Leuenberger
2023-03-02 22:03:34 +0000
6997bacde0Accepting request 1068522 from home:eeich:branches:network:clusterEgbert Eich2023-03-01 17:58:54 +0000
8a8f7dcb78Accepting request 1068320 from network:cluster
Dominique Leuenberger
2023-03-01 15:14:17 +0000
8899aac00b- testsuite: on later SUSE versions claim ownership of directoryEgbert Eich2023-02-28 20:34:03 +0000
18aa012ab9Accepting request 1068316 from home:eeich:branches:network:clusterEgbert Eich2023-02-28 20:30:32 +0000
ef6d6521aaAccepting request 1067475 from home:eeich:branches:network:clusterEgbert Eich2023-02-23 19:32:51 +0000
d1ebf00ba6Accepting request 1063957 from network:cluster
Dominique Leuenberger
2023-02-09 15:23:26 +0000
4693e39860Accepting request 1063954 from home:eeich:branches:network:clusterEgbert Eich2023-02-09 08:22:55 +0000
a4484c7dc2Accepting request 1042071 from network:cluster
Dominique Leuenberger
2022-12-11 16:16:58 +0000
6f080824a4Accepting request 1039957 from home:eeich:branches:network:clusterEgbert Eich2022-12-11 07:58:12 +0000
30dd030610Accepting request 1031255 from network:cluster
Dominique Leuenberger
2022-10-26 10:32:00 +0000
212048404b* Improve setup-testsuite.sh: copy ssh fingerprints from all nodes.Egbert Eich2022-10-26 06:23:36 +0000
776ce8f23b- Test Suite fixes: * Update README_Testsuite.md. * Clean up left over files when de-installing test suite. * Adjustment to test suite package: for SLE mark the openmpi4 devel package and slurm-hdf5 optional. * Add -ffat-lto-objects to the build flags when LTO is set to make sure the object files we ship with the test suite still work correctly.Egbert Eich2022-10-25 11:33:49 +0000
642a47efa7- Adjustment to test suite package: only recommend openmpi4Egbert Eich2022-10-24 08:54:35 +0000
52046053d5Accepting request 1030610 from home:eeich:branches:network:clusterEgbert Eich2022-10-24 05:31:40 +0000
220eec76a4Accepting request 1030432 from network:cluster
Dominique Leuenberger
2022-10-22 12:13:18 +0000
c2551ab47fAccepting request 1010642 from home:mslacken:branches:network:clusterEgbert Eich2022-10-21 15:00:25 +0000
edd405b2c8Accepting request 1006180 from network:cluster
Dominique Leuenberger
2022-09-26 16:48:44 +0000
09aecc2015Accepting request 1005746 from home:eeich:branches:network:clusterEgbert Eich2022-09-26 15:01:51 +0000
ae04ec8787Accepting request 1005247 from network:cluster
Dominique Leuenberger
2022-09-22 12:49:55 +0000
3f68233e21Accepting request 1005246 from home:eeich:branches:network:clusterEgbert Eich2022-09-21 15:33:09 +0000
d3bcbab808Accepting request 992362 from network:cluster
Dominique Leuenberger
2022-08-02 20:09:54 +0000
b60ac5f569Accepting request 992353 from home:eeich:branches:network:clusterEgbert Eich2022-08-02 15:34:01 +0000
fd509c0258Accepting request 990637 from home:bmwiedemann:branches:network:clusterEgbert Eich2022-08-02 13:14:07 +0000
7a8e082057Accepting request 990643 from network:cluster
Richard Brown
2022-07-22 17:21:25 +0000
e067a36989- Fix a typo which prevented the nproc limit for slurmd to be up-ed for the test suite.Egbert Eich2022-07-15 07:15:34 +0000
69890cab1eAccepting request 989256 from home:eeich:branches:network:clusterEgbert Eich2022-07-15 07:13:32 +0000