f0b994e220
plugins makes use of the MpiParams=ports= option, and previously features with the | operator, which could prevent jobs from + node_features/helpers - Fix inconsistent handling of & and |, instead of just the current set. E.g. foo|bar&baz was interpreted {foo} or {bar,baz}. tasks fewer than GPUs, which resulted in incorrectly rejecting these jobs. + slurmrestd - For GET /slurm/v0.0.39/node[s], change format of node's energy field current_watts to a dictionary to account for + slurmrestd - For GET /slurm/v0.0.39/qos, change format of QOS's + slurmrestd - For GET /slurm/v0.0.39/job[s], the 'return code' GET /slurmdb/v0.0.39/jobs from slurmrestd. were present in the log: error: Attempt to change gres/gpu Count. + Hold the job with (Reservation ... invalid) state reason if the
Egbert Eich2023-09-18 05:43:58 +00:00
a4f697f06d
plugins makes use of the MpiParams=ports= option, and previously features with the | operator, which could prevent jobs from + node_features/helpers - Fix inconsistent handling of & and |, instead of just the current set. E.g. foo|bar&baz was interpreted {foo} or {bar,baz}. tasks fewer than GPUs, which resulted in incorrectly rejecting these jobs. + slurmrestd - For GET /slurm/v0.0.39/node[s], change format of node's energy field current_watts to a dictionary to account for + slurmrestd - For GET /slurm/v0.0.39/qos, change format of QOS's + slurmrestd - For GET /slurm/v0.0.39/job[s], the 'return code' GET /slurmdb/v0.0.39/jobs from slurmrestd. were present in the log: error: Attempt to change gres/gpu Count. + Hold the job with (Reservation ... invalid) state reason if the
Egbert Eich2023-09-18 05:43:58 +00:00
74529b6cc2
- Updated to version 23.02.5 with the following changes: * Bug Fixes: + Revert a change in 23.02 where SLURM_NTASKS was no longer set in the job's environment when --ntasks-per-node was requested. The method that is is being set, however, is different and should be more accurate in more situations. + Change pmi2 plugin to honor the SrunPortRange option. This matches the new behavior of the pmix plugin in 23.02.0. Note that neither of these plugins makes use of the "MpiParams=ports=" option, and previously were only limited by the systems ephemeral port range. + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if a node features plugin is configured. + Fix and prevent reoccurring reservations from overlapping. + job_container/tmpfs - Avoid attempts to share BasePath between nodes. + With CR_Cpu_Memory, fix node selection for jobs that request gres and --mem-per-cpu. + Fix a regression from 22.05.7 in which some jobs were allocated too few nodes, thus overcommitting cpus to some tasks. + Fix a job being stuck in the completing state if the job ends while the primary controller is down or unresponsive and the backup controller has not yet taken over. + Fix slurmctld segfault when a node registers with a configured CpuSpecList while slurmctld configuration has the node without CpuSpecList. + Fix cloud nodes getting stuck in POWERED_DOWN+NO_RESPOND state after not registering by ResumeTimeout. + slurmstepd - Avoid cleanup of config.json-less containers spooldir getting skipped. + Fix scontrol segfault when 'completing' command requested repeatedly in interactive mode.
Egbert Eich2023-09-18 05:24:51 +00:00
7c740289ad
- Updated to version 23.02.5 with the following changes: * Bug Fixes: + Revert a change in 23.02 where SLURM_NTASKS was no longer set in the job's environment when --ntasks-per-node was requested. The method that is is being set, however, is different and should be more accurate in more situations. + Change pmi2 plugin to honor the SrunPortRange option. This matches the new behavior of the pmix plugin in 23.02.0. Note that neither of these plugins makes use of the "MpiParams=ports=" option, and previously were only limited by the systems ephemeral port range. + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if a node features plugin is configured. + Fix and prevent reoccurring reservations from overlapping. + job_container/tmpfs - Avoid attempts to share BasePath between nodes. + With CR_Cpu_Memory, fix node selection for jobs that request gres and --mem-per-cpu. + Fix a regression from 22.05.7 in which some jobs were allocated too few nodes, thus overcommitting cpus to some tasks. + Fix a job being stuck in the completing state if the job ends while the primary controller is down or unresponsive and the backup controller has not yet taken over. + Fix slurmctld segfault when a node registers with a configured CpuSpecList while slurmctld configuration has the node without CpuSpecList. + Fix cloud nodes getting stuck in POWERED_DOWN+NO_RESPOND state after not registering by ResumeTimeout. + slurmstepd - Avoid cleanup of config.json-less containers spooldir getting skipped. + Fix scontrol segfault when 'completing' command requested repeatedly in interactive mode.
Egbert Eich2023-09-18 05:24:51 +00:00
3825e9fab0
Accepting request 1110422 from network:cluster
Ana Guerrero2023-09-12 19:02:53 +00:00
5e252b9d68
Accepting request 1110422 from network:cluster
Ana Guerrero2023-09-12 19:02:53 +00:00
a323feff42
Accepting request 1110421 from home:eeich:branches:network:cluster
Egbert Eich2023-09-12 04:52:56 +00:00
09d371a40e
Accepting request 1110421 from home:eeich:branches:network:cluster
Egbert Eich2023-09-12 04:52:56 +00:00
3bcde4bfd9
Accepting request 1110259 from network:cluster
Ana Guerrero2023-09-11 19:22:19 +00:00
9f1000ec21
Accepting request 1110259 from network:cluster
Ana Guerrero2023-09-11 19:22:19 +00:00
f9646ba945
- Updated to 23.02.4 with the following changes: * Bug Fixes: + Fix main scheduler loop not starting after a failover to backup controller. Avoid slurmctld segfault when specifying AccountingStorageExternalHost (bsc#1214983). + Fix sbatch return code when --wait is requested on a job array. + Fix collected GPUUtilization values for acct_gather_profile plugins. + Fix slurmrestd handling of job hold/release operations. + Fix step running indefinitely when slurmctld takes more than MessageTimeout to respond. Now, slurmctld will cancel the step when detected, preventing following steps from getting stuck waiting for resources to be released. + Fix regression to make job_desc.min_cpus accurate again in job_submit when requesting a job with --ntasks-per-node. + Fix handling of ArrayTaskThrottle in backfill. + Fix regression in 23.02.2 when checking gres state on slurmctld startup or reconfigure. Gres changes in the configuration were not updated on slurmctld startup. On startup or reconfigure, these messages were present in the log: "error: Attempt to change gres/gpu Count". + Fix potential double count of gres when dealing with limits. + Fix slurmstepd segfault when ContainerPath is not set in oci.conf + Fixed an issue where jobs requesting licenses were incorrectly rejected. + scrontab - Fix cutting off the final character of quoted variables. + smail - Fix issues where e-mails at job completion were not being sent. + scontrol/slurmctld - fix comma parsing when updating a reservation's nodes. + Fix --gpu-bind=single binding tasks to wrong gpus, leading to some gpus having more tasks than they should and other gpus being unused. + Fix regression in 23.02 that causes slurmstepd to crash when srun requests more than TreeWidth nodes in a step and uses the pmi2 or
Egbert Eich2023-09-11 07:21:32 +00:00
6ad091ecc0
- Updated to 23.02.4 with the following changes: * Bug Fixes: + Fix main scheduler loop not starting after a failover to backup controller. Avoid slurmctld segfault when specifying AccountingStorageExternalHost (bsc#1214983). + Fix sbatch return code when --wait is requested on a job array. + Fix collected GPUUtilization values for acct_gather_profile plugins. + Fix slurmrestd handling of job hold/release operations. + Fix step running indefinitely when slurmctld takes more than MessageTimeout to respond. Now, slurmctld will cancel the step when detected, preventing following steps from getting stuck waiting for resources to be released. + Fix regression to make job_desc.min_cpus accurate again in job_submit when requesting a job with --ntasks-per-node. + Fix handling of ArrayTaskThrottle in backfill. + Fix regression in 23.02.2 when checking gres state on slurmctld startup or reconfigure. Gres changes in the configuration were not updated on slurmctld startup. On startup or reconfigure, these messages were present in the log: "error: Attempt to change gres/gpu Count". + Fix potential double count of gres when dealing with limits. + Fix slurmstepd segfault when ContainerPath is not set in oci.conf + Fixed an issue where jobs requesting licenses were incorrectly rejected. + scrontab - Fix cutting off the final character of quoted variables. + smail - Fix issues where e-mails at job completion were not being sent. + scontrol/slurmctld - fix comma parsing when updating a reservation's nodes. + Fix --gpu-bind=single binding tasks to wrong gpus, leading to some gpus having more tasks than they should and other gpus being unused. + Fix regression in 23.02 that causes slurmstepd to crash when srun requests more than TreeWidth nodes in a step and uses the pmi2 or
Egbert Eich2023-09-11 07:21:32 +00:00
6b47182efe
Accepting request 1109308 from network:cluster
Ana Guerrero2023-09-07 19:12:41 +00:00
e167499b83
Accepting request 1109308 from network:cluster
Ana Guerrero2023-09-07 19:12:41 +00:00
c63b605916
- Fixes since 23.02.03: Highlights: * Fix main scheduler loop not starting after a failover to backup controller. * Avoid slurmctld segfault when specifying AccountingStorageExternalHost (bsc#1214983). Other: * Fix sbatch return code when --wait is requested on a job array. * Fix collected GPUUtilization values for acct_gather_profile plugins. * Fix slurmrestd handling of job hold/release operations. * Make spank S_JOB_ARGV item value hold the requested command argv instead of the srun --bcast value when --bcast requested (only in local context). * Fix step running indefinitely when slurmctld takes more than MessageTimeout to respond. Now, slurmctld will cancel the step when detected, preventing following steps from getting stuck waiting for resources to be released. * Fix regression to make job_desc.min_cpus accurate again in job_submit when requesting a job with --ntasks-per-node. * Fix handling of ArrayTaskThrottle in backfill. * Fix regression in 23.02.2 when checking gres state on slurmctld startup or reconfigure. Gres changes in the configuration were not updated on slurmctld startup. On startup or reconfigure, these messages were present in the log: "error: Attempt to change gres/gpu Count". * Fix potential double count of gres when dealing with limits. * Fix slurmstepd segfault when ContainerPath is not set in oci.conf * Fixed an issue where jobs requesting licenses were incorrectly rejected. * scrontab - Fix cutting off the final character of quoted variables. * smail - Fix issues where e-mails at job completion were not being sent. * scontrol/slurmctld - fix comma parsing when updating a reservation's nodes.
Egbert Eich2023-09-06 17:11:37 +00:00
8b706ae37a
- Fixes since 23.02.03: Highlights: * Fix main scheduler loop not starting after a failover to backup controller. * Avoid slurmctld segfault when specifying AccountingStorageExternalHost (bsc#1214983). Other: * Fix sbatch return code when --wait is requested on a job array. * Fix collected GPUUtilization values for acct_gather_profile plugins. * Fix slurmrestd handling of job hold/release operations. * Make spank S_JOB_ARGV item value hold the requested command argv instead of the srun --bcast value when --bcast requested (only in local context). * Fix step running indefinitely when slurmctld takes more than MessageTimeout to respond. Now, slurmctld will cancel the step when detected, preventing following steps from getting stuck waiting for resources to be released. * Fix regression to make job_desc.min_cpus accurate again in job_submit when requesting a job with --ntasks-per-node. * Fix handling of ArrayTaskThrottle in backfill. * Fix regression in 23.02.2 when checking gres state on slurmctld startup or reconfigure. Gres changes in the configuration were not updated on slurmctld startup. On startup or reconfigure, these messages were present in the log: "error: Attempt to change gres/gpu Count". * Fix potential double count of gres when dealing with limits. * Fix slurmstepd segfault when ContainerPath is not set in oci.conf * Fixed an issue where jobs requesting licenses were incorrectly rejected. * scrontab - Fix cutting off the final character of quoted variables. * smail - Fix issues where e-mails at job completion were not being sent. * scontrol/slurmctld - fix comma parsing when updating a reservation's nodes.
Egbert Eich2023-09-06 17:11:37 +00:00
51bec69223
Accepting request 1109029 from network:cluster
Ana Guerrero2023-09-06 16:57:11 +00:00
5e2c599785
Accepting request 1109029 from network:cluster
Ana Guerrero2023-09-06 16:57:11 +00:00
47d665607b
Accepting request 1109009 from home:mslacken:branches:network:cluster
Christian Goll2023-09-05 11:47:06 +00:00
8f857e2839
Accepting request 1109009 from home:mslacken:branches:network:cluster
Christian Goll2023-09-05 11:47:06 +00:00
212048404b
* Improve setup-testsuite.sh: copy ssh fingerprints from all nodes.
Egbert Eich2022-10-26 06:23:36 +00:00
eac06a3bc4
* Improve setup-testsuite.sh: copy ssh fingerprints from all nodes.
Egbert Eich2022-10-26 06:23:36 +00:00
776ce8f23b
- Test Suite fixes: * Update README_Testsuite.md. * Clean up left over files when de-installing test suite. * Adjustment to test suite package: for SLE mark the openmpi4 devel package and slurm-hdf5 optional. * Add -ffat-lto-objects to the build flags when LTO is set to make sure the object files we ship with the test suite still work correctly.
Egbert Eich2022-10-25 11:33:49 +00:00
371becf26d
- Test Suite fixes: * Update README_Testsuite.md. * Clean up left over files when de-installing test suite. * Adjustment to test suite package: for SLE mark the openmpi4 devel package and slurm-hdf5 optional. * Add -ffat-lto-objects to the build flags when LTO is set to make sure the object files we ship with the test suite still work correctly.
Egbert Eich2022-10-25 11:33:49 +00:00
642a47efa7
- Adjustment to test suite package: only recommend openmpi4
Egbert Eich2022-10-24 08:54:35 +00:00
4a0e30d273
- Adjustment to test suite package: only recommend openmpi4
Egbert Eich2022-10-24 08:54:35 +00:00
52046053d5
Accepting request 1030610 from home:eeich:branches:network:cluster
Egbert Eich2022-10-24 05:31:40 +00:00
c7e02dc61a
Accepting request 1030610 from home:eeich:branches:network:cluster
Egbert Eich2022-10-24 05:31:40 +00:00