slurm

pool/slurm

SHA256

Author	SHA256	Message	Date
Egbert Eich	74529b6cc2	- Updated to version 23.02.5 with the following changes: * Bug Fixes: + Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the job's environment when `--ntasks-per-node` was requested. The method that is is being set, however, is different and should be more accurate in more situations. + Change pmi2 plugin to honor the `SrunPortRange` option. This matches the new behavior of the pmix plugin in 23.02.0. Note that neither of these plugins makes use of the "`MpiParams=ports=`" option, and previously were only limited by the systems ephemeral port range. + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if a node features plugin is configured. + Fix and prevent reoccurring reservations from overlapping. + `job_container/tmpfs` - Avoid attempts to share BasePath between nodes. + With `CR_Cpu_Memory`, fix node selection for jobs that request gres and `--mem-per-cpu`. + Fix a regression from 22.05.7 in which some jobs were allocated too few nodes, thus overcommitting cpus to some tasks. + Fix a job being stuck in the completing state if the job ends while the primary controller is down or unresponsive and the backup controller has not yet taken over. + Fix `slurmctld` segfault when a node registers with a configured `CpuSpecList` while `slurmctld` configuration has the node without `CpuSpecList`. + Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after not registering by `ResumeTimeout`. + `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir getting skipped. + Fix scontrol segfault when 'completing' command requested repeatedly in interactive mode. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=264	2023-09-18 05:24:51 +00:00
Egbert Eich	a323feff42	Accepting request 1110421 from home:eeich:branches:network:cluster - Create a macro for upgrade dependency to ensure uniform handling. OBS-URL: https://build.opensuse.org/request/show/1110421 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=262	2023-09-12 04:52:56 +00:00
Egbert Eich	f9646ba945	- Updated to 23.02.4 with the following changes: * Bug Fixes: + Fix main scheduler loop not starting after a failover to backup controller. Avoid slurmctld segfault when specifying `AccountingStorageExternalHost` (bsc#1214983). + Fix sbatch return code when `--wait` is requested on a job array. + Fix collected `GPUUtilization` values for `acct_gather_profile` plugins. + Fix `slurmrestd` handling of job hold/release operations. + Fix step running indefinitely when slurmctld takes more than `MessageTimeout` to respond. Now, `slurmctld` will cancel the step when detected, preventing following steps from getting stuck waiting for resources to be released. + Fix regression to make `job_desc.min_cpus` accurate again in `job_submit` when requesting a job with `--ntasks-per-node`. + Fix handling of `ArrayTaskThrottle` in backfill. + Fix regression in 23.02.2 when checking gres state on `slurmctld` startup or reconfigure. Gres changes in the configuration were not updated on slurmctld startup. On startup or reconfigure, these messages were present in the log: `"error: Attempt to change gres/gpu Count`". + Fix potential double count of gres when dealing with limits. + Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf` + Fixed an issue where jobs requesting licenses were incorrectly rejected. + `scrontab` - Fix cutting off the final character of quoted variables. + `smail` - Fix issues where e-mails at job completion were not being sent. + `scontrol/slurmctld` - fix comma parsing when updating a reservation's nodes. + Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus having more tasks than they should and other gpus being unused. + Fix regression in 23.02 that causes slurmstepd to crash when `srun` requests more than `TreeWidth` nodes in a step and uses the pmi2 or OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=260	2023-09-11 07:21:32 +00:00
Egbert Eich	c63b605916	- Fixes since 23.02.03: Highlights: * Fix main scheduler loop not starting after a failover to backup controller. * Avoid slurmctld segfault when specifying `AccountingStorageExternalHost` (bsc#1214983). Other: * Fix sbatch return code when `--wait` is requested on a job array. * Fix collected `GPUUtilization` values for `acct_gather_profile` plugins. * Fix `slurmrestd` handling of job hold/release operations. * Make spank `S_JOB_ARGV` item value hold the requested command `argv` instead of the `srun --bcast` value when `--bcast` requested (only in local context). * Fix step running indefinitely when slurmctld takes more than `MessageTimeout` to respond. Now, slurmctld will cancel the step when detected, preventing following steps from getting stuck waiting for resources to be released. * Fix regression to make `job_desc.min_cpus` accurate again in job_submit when requesting a job with `--ntasks-per-node`. * Fix handling of `ArrayTaskThrottle` in backfill. * Fix regression in 23.02.2 when checking gres state on `slurmctld` startup or reconfigure. Gres changes in the configuration were not updated on slurmctld startup. On startup or reconfigure, these messages were present in the log: `"error: Attempt to change gres/gpu Count`". * Fix potential double count of gres when dealing with limits. * Fix slurmstepd segfault when ContainerPath is not set in `oci.conf` * Fixed an issue where jobs requesting licenses were incorrectly rejected. * `scrontab` - Fix cutting off the final character of quoted variables. * `smail` - Fix issues where e-mails at job completion were not being sent. * `scontrol/slurmctld` - fix comma parsing when updating a reservation's nodes. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=258	2023-09-06 17:11:37 +00:00
Christian Goll	47d665607b	Accepting request 1109009 from home:mslacken:branches:network:cluster - updated to 23.02.04 which includes following changes: * fixing the main scheduler loop not starting on the backup controller after a failover event, a segfault when attempting to use * AccountingStorageExternalHost, and an issue where steps could continue running indefinitely if the slurmctld takes too long to respond (bsc#1214983) * include a fix for a potential slurmctld crashes when the backup slurmctld takes over. * This also fixes some issues when using older versions of the command line tools with a 23.02 controller. * srun/sbatch/salloc - In order to support user namespaces, process user and group ids are no longer used unless explicitly requested as an argument and are left as nobody(99) by default. Any cli_filters or SPANK plugins need to ignore any uid or gid that equal SLURM_AUTH_NOBODY (99). User and group ids are now resolved by the active auth plugin. To determine the actual job uid or gid you should use the RESPONSE_RESOURCE_ALLOCATION RPC. - removed Fix-test-3.13.patch as fixed upstream - removed Fix-test-38.11.patch as test changed upstream OBS-URL: https://build.opensuse.org/request/show/1109009 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=256	2023-09-05 11:47:06 +00:00
Egbert Eich	532aa1e96d	Accepting request 1085668 from home:mslacken:branches:network:cluster - updated to 23.02.02 which includes a number of fixes to Slurm stability * Includes a fix for a regression in 23.02 that caused openmpi mpirun to fail to launch tasks. * It also includes two functional changes: Don't update the cron job tasks if the whole crontab file is left untouched after opening it with scrontab -e * Sort dynamic nodes and include them in topology after scontrol reconfigure or a slurmctld restart. OBS-URL: https://build.opensuse.org/request/show/1085668 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=254	2023-05-09 10:35:16 +00:00
Egbert Eich	33bf8791ac	- Require slurm-munge if munge authentication is installed. - Replace 'Require: config(pam)' by 'Require: pam'. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=252	2023-04-28 07:46:44 +00:00
Christian Goll	392bec3223	Accepting request 1082770 from home:eeich:branches:network:cluster - Web-configurator: changed presets to SUSE defaults. - If %_restart_on_update is no longer defined replace by own macro. - Marked slurm-openlava, slurm-seff and slurm-sjstat noarch. - rpmlint: * dropped some rpmlint filters which are no longer relevant. * added/refreshed filters. For Details, see rpmlintrc. - Remove workaround to fix the restart issue in an Slurm package described in bsc#1088693. The Slurm version in this package as 16.05. Any attempt to directly migrate to the current version is bound to fail anyway. OBS-URL: https://build.opensuse.org/request/show/1082770 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=251	2023-04-27 13:24:37 +00:00
Egbert Eich	5a68fc8e5f	- updated to 23.02.1 with the following changes: - removed right-pmix-path.patch as fixed upstream OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=249	2023-03-31 15:48:27 +00:00
Egbert Eich	d2a2e0a1e8	Accepting request 1076461 from home:mslacken:branches:network:cluster - updated to 23.02.1 with following chnages: * job_container/tmpfs - cleanup job container even if namespace mount is already unmounted. * openapi/dbv0.0.38 - Fix not displaying an error when updating QOS or associations fails. * Fix nodes remaining as PLANNED after slurmctld save state recovery. * Add cgroup.conf EnableControllers option for cgroup/v2. * Get correct cgroup root to allow slurmd to run in containers like Docker. * slurmctld - add missing PrivateData=jobs check to step ContainerID lookup requests originated from 'scontrol show step container-id=<id>' or certain scrun operations when container state can't be directly queried. * Fix nodes un-draining after being drained due to unkillable step. * Fix remote licenses allowed percentages reset to 0 during upgrade. * sacct - Avoid truncating time strings when using SLURM_TIME_FORMAT with the --parsable option. * Fix regression in 22.05.0rc1 that broke Nodes=ALL in a NodeSet. * openapi/v0.0.39 - fix jobs submitted via slurmrestd being allocated fewer CPUs than tasks when requesting multiple tasks. * Fix job not being scheduled on valid nodes and potentially being rejected when using parentheses at the beginning of square brackets in a feature request, for example: "feat1&[(feat2\|feat3)]". * Fix regression in 23.02.0rc1 which made --gres-flags=enforce-binding no longer enforce optimal core-gpu job placement. * mpi/pmix - Fix v5 to load correctly when libpmix.so isn't in the normal lib path. * data_parser/v0.0.39 - fix regression where "memory_per_node" would be rejected for job submission. * data_parser/v0.0.39 - fix regression where "memory_per_cpu" would be rejected for job submission. * slurmctld - add an assert to check for magic number presence before deleting OBS-URL: https://build.opensuse.org/request/show/1076461 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=248	2023-03-31 15:44:08 +00:00
Christian Goll	5c3d4865a1	Accepting request 1072591 from home:mslacken:branches:network:cluster added: right-pmix-path.patch OBS-URL: https://build.opensuse.org/request/show/1072591 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=246	2023-03-17 10:52:44 +00:00
Christian Goll	9883ad6d58	Accepting request 1072585 from home:mslacken:branches:network:cluster - use libpmix.so.2 instead of libpmix.so to fix (bsc#1209260) this removes the need of pmix-pluginlib OBS-URL: https://build.opensuse.org/request/show/1072585 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=245	2023-03-17 10:42:09 +00:00
Christian Goll	521f372d87	Accepting request 1072084 from home:mslacken:branches:network:cluster - slurm-plugins need to require pmix-pluginlib (bsc#1209260) OBS-URL: https://build.opensuse.org/request/show/1072084 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=243	2023-03-15 10:57:09 +00:00
Egbert Eich	e85b508441	Accepting request 1070212 from home:eeich:branches:network:cluster - Fixing dependencies for slurm--plugin-ext-sensors-rrd again. OBS-URL: https://build.opensuse.org/request/show/1070212 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=241	2023-03-08 15:43:28 +00:00
Egbert Eich	86940cb8c4	Accepting request 1070094 from home:eeich:branches:network:cluster - Fix conflicts for plugin-ext-sensors-rrd OBS-URL: https://build.opensuse.org/request/show/1070094 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=240	2023-03-08 07:58:58 +00:00
Egbert Eich	0f04c66747	Accepting request 1070043 from home:eeich:branches:network:cluster - Fixup previous submission. OBS-URL: https://build.opensuse.org/request/show/1070043 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=239	2023-03-07 22:14:15 +00:00
Egbert Eich	da464bfaae	Accepting request 1070038 from home:eeich:branches:network:cluster - Stop pulling firewall rules from github. There is no benefit to host these separately. - Remove pre-sle12 pieces. - Add missing Provides:, Conflicts: and Obsoletes: to slurm-cray, slurm-hdf5 and slurm-testsuite to avoid package conflicts. - Unify Obsoletes:. - Consolidate spec files between different Slurm releases in Leap/SLE maintenance. OBS-URL: https://build.opensuse.org/request/show/1070038 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=238	2023-03-07 21:33:03 +00:00
Egbert Eich	6997bacde0	Accepting request 1068522 from home:eeich:branches:network:cluster - Add missing Provides: and Obsoletes: to slurm-cray, slurm-hdf5 and slurm-testsuite to avoid package conflicts. - Add dependency for the general plugin package to the AcctGatherProfile HDF5 plugin. - Adjust node RealMemory in slurm.conf of test suite for 8G test nodes. OBS-URL: https://build.opensuse.org/request/show/1068522 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=236	2023-03-01 17:58:54 +00:00
Egbert Eich	e60f39a466	- updated to 23.02.0 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=234	2023-02-28 20:50:48 +00:00
Egbert Eich	8899aac00b	- testsuite: on later SUSE versions claim ownership of directory OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=233	2023-02-28 20:34:03 +00:00
Egbert Eich	18aa012ab9	Accepting request 1068316 from home:eeich:branches:network:cluster + Fixed GpuFreqDef option. When set in slurm.conf, it will be used if --gpu-freq was not explicitly set by the job step. + topology/tree - Add new TopologyParam=SwitchAsNodeRank option to reorder nodes based on switch layout. This can be useful if the naming convention for the nodes does not natually map to the network topology. + Removed the default setting for GpuFreqDef. If unset, no attempt to change the GPU frequency will be made if --gpu-freq is not set for the step. OBS-URL: https://build.opensuse.org/request/show/1068316 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=232	2023-02-28 20:30:32 +00:00
Egbert Eich	ef6d6521aa	Accepting request 1067475 from home:eeich:branches:network:cluster - updated to 23.02.0-0rc1 * Highlights + slurmctld - Add new RPC rate limiting feature. This is enabled through SlurmctldParameters=rl_enable, otherwise disabled by default. + Make scontrol reconfigure and sending a SIGHUP to the slurmctld behave the same. If you were using SIGHUP as a 'lighter' scontrol reconfigure to rotate logs please update your scripts to use SIGUSR2 instead. + Change cloud nodes to show by default. PrivateData=cloud is no longer needed. + sreport - Count planned (FKA reserved) time for jobs running in IGNORE_JOBS reservations. Previously was lumped into IDLE time. + job_container/tmpfs - Support running with an arbitrary list of private mount points (/tmp and /dev/shm are the default, but not required). + job_container/tmpfs - Set more environment variables in InitScript. + Make all cgroup directories created by Slurm owned by root. This was the behavior in cgroup/v2 but not in cgroup/v1 where by default the step directories ownership were set to the user and group of the job. + accounting_storage/mysql - change purge/archive to calculate record ages based on end time, rather than start or submission times. + job_submit/lua - add support for log_user() from slurm_job_modify(). + Run the following scripts in slurmscriptd instead of slurmctld: ResumeProgram, ResumeFailProgram, SuspendProgram, ResvProlog, ResvEpilog, and RebootProgram (only with SlurmctldParameters=reboot_from_controller). + Only permit changing log levels with 'srun --slurmd-debug' by root or SlurmUser. + slurmctld will fatal() when reconfiguring the job_submit plugin fails. + Add PowerDownOnIdle partition option to power down nodes after nodes become idle. + Add "[jobid.stepid]" prefix from slurmstepd and "slurmscriptd" prefix from slurmcriptd to Syslog logging. Previously was only happening when OBS-URL: https://build.opensuse.org/request/show/1067475 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=231	2023-02-23 19:32:51 +00:00
Egbert Eich	4693e39860	Accepting request 1063954 from home:eeich:branches:network:cluster - testsuite: on laster SUSE versions claim ownership of directory /etc/security/limits.d. OBS-URL: https://build.opensuse.org/request/show/1063954 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=229	2023-02-09 08:22:55 +00:00
Egbert Eich	6f080824a4	Accepting request 1039957 from home:eeich:branches:network:cluster - Move the ext_sensors/rrd plugin to a separate package: this plugin requires librrd which in turn requires huge parts of the client side X Window System stack. There is probably no use in cluttering up a system for a plugin that probably only used by a few. OBS-URL: https://build.opensuse.org/request/show/1039957 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=227	2022-12-11 07:58:12 +00:00
Egbert Eich	212048404b	* Improve setup-testsuite.sh: copy ssh fingerprints from all nodes. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=225	2022-10-26 06:23:36 +00:00
Egbert Eich	776ce8f23b	- Test Suite fixes: * Update README_Testsuite.md. * Clean up left over files when de-installing test suite. * Adjustment to test suite package: for SLE mark the openmpi4 devel package and slurm-hdf5 optional. * Add -ffat-lto-objects to the build flags when LTO is set to make sure the object files we ship with the test suite still work correctly. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=224	2022-10-25 11:33:49 +00:00
Egbert Eich	642a47efa7	- Adjustment to test suite package: only recommend openmpi4 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=223	2022-10-24 08:54:35 +00:00
Egbert Eich	52046053d5	Accepting request 1030610 from home:eeich:branches:network:cluster - Update README_Testsuite.md. - Make hdf5 package optional for test suite. - Clean up left over files when de-installing test suite. - set environment variable SUSE_ZNOW to 0 in %build to avoid module load failures due to unresolved symbols as module take advantage of lazy bindings (bsc#1200030). OBS-URL: https://build.opensuse.org/request/show/1030610 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=222	2022-10-24 05:31:40 +00:00
Egbert Eich	c2551ab47f	Accepting request 1010642 from home:mslacken:branches:network:cluster - updated to 22.05.5 - NOTE: Slurm validates that libraries are of the same version. Unfortunately, due to an oversight, we failed to notice that the slurmstepd loads the hash_k12 library only after a job has completed. This means that if the hash_k12 library is upgraded before a job finishes, the slurmstepd will load the new library when the job finishes, and will fail due to a mismatch of versions. This results in nodes with slurmstepd processes stuck indefinitely. These processes require manual intervention to clean up. There is no clean way to resolve these hung slurmstepd processes. The only recommended way to upgrade between minor versions of 22.05 with RPM’s or upgrades that replace current binaries and libraries is to drain the nodes of running jobs first. - Fixes a number of moderate severity issues, noteable are: * Load hash plugin at slurmstepd launch time to prevent issues loading the plugin at step completion if the Slurm installation is upgraded. * Update nvml plugin to match the unique id format for MIG devices in new Nvidia drivers. * Fix multi-node step launch failure when nodes in the controller aren't in natural order. This can happen with inconsistent node naming (such as node15 and node052) or with dynamic nodes which can register in any order. * job_container/tmpfs - cleanup containers even when the .ns file isn't mounted anymore. * Wait up to PrologEpilogTimeout before shutting down slurmd to allow prolog and epilog scripts to complete or timeout. Previously, slurmd waited 120 seconds before timing out and killing prolog and epilog scripts. OBS-URL: https://build.opensuse.org/request/show/1010642 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=220	2022-10-21 15:00:25 +00:00
Egbert Eich	09aecc2015	Accepting request 1005746 from home:eeich:branches:network:cluster - Do not deduplicate files of testsuite Slurm configuration. This directory is supposed to be mounted over /etc/slurm therefore it must not contain softlinks to the files in this directory. - Improve .a and .o file collection for test suite: find these files even if there are multiple ones in a single line. OBS-URL: https://build.opensuse.org/request/show/1005746 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=218	2022-09-26 15:01:51 +00:00
Egbert Eich	3f68233e21	Accepting request 1005246 from home:eeich:branches:network:cluster - Fix build for older product version. OBS-URL: https://build.opensuse.org/request/show/1005246 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=216	2022-09-21 15:33:09 +00:00
Egbert Eich	b60ac5f569	Accepting request 992353 from home:eeich:branches:network:cluster - Fix a potential security vulnerability in the test package (bsc#1201674, CVE-2022-31251). - Patch NOFILE Limit in the slurmd.service copy for the testsuite. OBS-URL: https://build.opensuse.org/request/show/992353 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=214	2022-08-02 15:34:01 +00:00
Egbert Eich	fd509c0258	Accepting request 990637 from home:bmwiedemann:branches:network:cluster make slurmtest.tar reproducible OBS-URL: https://build.opensuse.org/request/show/990637 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=213	2022-08-02 13:14:07 +00:00
Egbert Eich	e067a36989	- Fix a typo which prevented the nproc limit for slurmd to be up-ed for the test suite. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=211	2022-07-15 07:15:34 +00:00
Egbert Eich	69890cab1e	Accepting request 989256 from home:eeich:branches:network:cluster - Improve check for mpicc in testsuite package: if binary isn't found, don't crash. OBS-URL: https://build.opensuse.org/request/show/989256 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=210	2022-07-15 07:13:32 +00:00
Egbert Eich	167150eca6	- Fix a typo OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=209	2022-07-15 07:12:53 +00:00
Egbert Eich	7d13a7ba97	Accepting request 988732 from home:eeich:branches:network:cluster - Package the Slurm testsuite for QA purposes. * Fixes for test suite: Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch Fix-test-21.41.patch Fix-test-38.11.patch Fix-test-32.8.patch Fix-test-3.13.patch Fix-test7.2-to-find-libpmix-under-lib64-as-well.patch * Add documentation: README_Testsuite.md - Allow log in as user 'slurm'. This allows admins to run certain priviledged commands more easily without becoming root. OBS-URL: https://build.opensuse.org/request/show/988732 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=207	2022-07-12 20:03:18 +00:00
Christian Goll	52adf61c22	Accepting request 983910 from home:mslacken:branches:network:cluster - update to 22.05.2 with following fixes: * Fix regression which allowed the oversubscription of licenses. * Fix a segfault in slurmctld when requesting gres in job arrays. OBS-URL: https://build.opensuse.org/request/show/983910 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=206	2022-06-20 11:58:11 +00:00
Egbert Eich	2951a00ce2	- Package the Slrum testsuite for QA purposes. NOTE: This package is not meant to be used for testing by the user but rather for testing by the maintainers to ensure the package is working properly. DO NOT report test suite failures unless you are able to confirm that the failure is really a bug. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=205	2022-06-08 13:21:55 +00:00
Christian Goll	faa19fe22b	Accepting request 980093 from home:mslacken:branches:network:cluster - update to 22.05.0 with following changes: - Support for dynamic node addition and removal - Support for native Linux cgroup v2 operation - Newly added plugins to support HPE Slingshot 11 networks (switch/hpe_slingshot), and Intel Xe GPUs (gpu/oneapi) - Added new acct_gather_interconnect/sysfs plugin to collect statistics from arbitrary network interfaces. - Expanded and synced set of environment variables available in the Prolog/Epilog/PrologSlurmctld/EpilogSlurmctld scripts. - New "--prefer" option to job submissions to allow for a "soft constraint" request to influence node selection. - Optional support for license planning in the backfill scheduler with "bf_licenses" option in SchedulerParameters. - removed file slurm-2.4.4-init.patch as sysvinit is now realy deprecated - removed file load-pmix-major-version.patch as fixed upstream OBS-URL: https://build.opensuse.org/request/show/980093 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=203	2022-05-31 13:38:54 +00:00
Egbert Eich	a07f819c2f	- Update to 21.08.8 which fixes CVE-2022-29500 (bsc#1199278), CVE-2022-29501 (bsc#1199279), and CVE-2022-29502 (bsc#1199281). OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=201	2022-05-11 10:26:59 +00:00
Egbert Eich	5f6ca5dea6	Accepting request 976056 from home:eeich:branches:network:cluster - Add a comment about the CommunicationParameters=block_null_hash option warning users who migrate - just in case. OBS-URL: https://build.opensuse.org/request/show/976056 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=200	2022-05-11 10:25:15 +00:00
Christian Goll	950ae37e78	Accepting request 975374 from home:mslacken:branches:network:cluster - Update to 21.08.8 which fixes CVE-2022-29500, CVE-2022-29501 and CVE-2022-29502 - Added 'CommunicationParameters=block_null_hash' to slurm.conf, please add this parameter to existing configurations. OBS-URL: https://build.opensuse.org/request/show/975374 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=198	2022-05-06 15:13:12 +00:00
Christian Goll	30c749c9e0	Accepting request 974433 from home:mslacken:branches:network:cluster - Update to 21.08.7 with following changes: * openapi/v0.0.37 - correct calculation for bf_queue_len_mean in /diag. * Avoid shrinking a reservation when overlapping with downed nodes. * Only check TRES limits against current usage for TRES requested by the job. * Do not allocate shared gres (MPS) in whole-node allocations * Constrain slurmstepd to job/step cgroup like in previous versions of Slurm. * Fix warnings on 32-bit compilers related to printf() formats. * Fix reconfigure issues after disabling/reenabling the GANG PreemptMode. * Fix race condition where a cgroup was being deleted while another step was creating it. * Set the slurmd port correctly if multi-slurmd * Fix FAIL mail not being sent if a job was cancelled due to preemption. * slurmrestd - move debug logs for HTTP handling to be gated by debugflag NETWORK to avoid unnecessary logging of communication contents. * Fix issue with bad memory access when shrinking running steps. * Fix various issues with internal job accounting with GRES when jobs are shrunk. * Fix ipmi polling on slurmd reconfig or restart. * Fix srun crash when reserved ports are being used and het step fails to launch. * openapi/dbv0.0.37 - fix DELETE execution path on /user/{user_name}. * slurmctld - Properly requeue all components of a het job if PrologSlurmctld fails. * rlimits - remove final calls to limit nofiles to 4096 but to instead use the max possible nofiles in slurmd and slurmdbd. * Allow the DBD agent to load large messages (up to MAX_BUF_SIZE) from state. * Fix potential deadlock during slurmctld restart when there is a completing job. * slurmstepd - reduce user requested soft rlimits when they are above max hard rlimits to avoid rlimit request being completely ignored and OBS-URL: https://build.opensuse.org/request/show/974433 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=196	2022-05-02 17:06:13 +00:00
Christian Goll	d442993ff4	Accepting request 942081 from home:mslacken:branches:network:cluster - update to 21.08.5 with following changes: * Fix issue where typeless GRES node updates were not immediately reflected. * Fix setting the default scrontab job working directory so that it's the home of the different user (u <user>) and not that of root or SlurmUser editor. Fix stepd not respecting SlurmdSyslogDebug. * Fix concurrency issue with squeue. * Fix job start time not being reset after launch when job is packed onto already booting node. * Fix updating SLURM_NODE_ALIASES for jobs packed onto powering up nodes. * Cray - Fix issues with starting hetjobs. * auth/jwks - Print fatal() message when jwks is configured but file could not be opened. * If sacctmgr has an association with an unknown qos as the default qos print 'UNKN###' instead of leaving a blank name. Correctly determine task count when giving --cpus-per-gpu, --gpus and -ntasks-per-node without task count. slurmctld - Fix places where the global last_job_update was not being set to the time of update when a job's reason and description were updated. * slurmctld - Fix case where a job submitted with more than one partition would not have its reason updated while waiting to start. * Fix memory leak in node feature rebooting. * Fix time limit permanetly set to 1 minute by backfill for job array tasks higher than the first with QOS NoReserve flag and PreemptMode configured. * Fix sacct -N to show jobs that started in the current second * Fix issue on running steps where both SLURM_NTASKS_PER_TRES and SLURM_NTASKS_PER_GPU are set. * Handle oversubscription request correctly when also requesting -ntasks-per-tres. Correctly detect when a step requests bad gres inside an allocation. * slurmstepd - Correct possible deadlock when UnkillableStepTimeout triggers. OBS-URL: https://build.opensuse.org/request/show/942081 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=195	2021-12-23 10:26:41 +00:00
Christian Goll	350be975f5	Accepting request 932063 from home:aginies:branches:network:cluster add a ref to SLE-22741 (firewall config) in changelog OBS-URL: https://build.opensuse.org/request/show/932063 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=194	2021-11-18 09:37:45 +00:00
Christian Goll	d4c2b2bcf3	- updated to 21.08.4 which fixes (CVE-2021-43337) which is only present in 21.08 tree. * CVE-2021-43337: For sites using the new AccountingStoreFlags=job_script and/or job_env options, an issue was reported with the access control rules in SlurmDBD that will permit users to request job scripts and environment files that they should not have access to. (Scripts/environments are meant to only be accessible by user accounts with administrator privileges, by account coordinators for jobs submitted under their account, and by the user themselves.) - changes from 21.08.3: * This includes a number of fixes since the last release a month ago, including one critical fix to prevent a communication issue between slurmctld and slurmdbd for sites that have started using the new AccountingStoreFlags=job_script functionality. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=193	2021-11-17 08:37:51 +00:00
Egbert Eich	c67f43163f	Accepting request 928191 from home:eeich:branches:network:cluster - Utilize sysuser infrastructure to set user/group slurm. For munge authentication slurm should have a fixed UID across all nodes including the management server. Set it to 120 - Limit firewalld service definitions to SUSE versions >= 15. OBS-URL: https://build.opensuse.org/request/show/928191 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=192	2021-10-29 17:38:05 +00:00
Christian Goll	f4a3f06e75	Accepting request 926016 from home:mslacken:branches:network:cluster - added service definitions for firewalld OBS-URL: https://build.opensuse.org/request/show/926016 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=191	2021-10-29 14:17:34 +00:00
Christian Goll	7a20fda376	Accepting request 923425 from home:mslacken:branches:network:cluster - update to 21.08.2 - major change: * removed of support of the TaskAffinity=yes option in cgroup.conf. Please consider using "TaskPlugins=cgroup,affinity" in slurm.conf as an option. - minor changes and bugfixes: * slurmctld - fix how the max number of cores on a node in a partition are calculated when the partition contains multisocket nodes. This in turn corrects certain jobs node count estimations displayed clientside. * job_submit/cray_aries - fix "craynetwork" GRES specification after changes introduced in 21.08.0rc1 that made TRES always have a type prefix. * Ignore nonsensical check in the slurmd for [Pro\|Epi]logSlurmctld. * Fix writing to stderr/syslog when systemd runs slurmctld in the foreground. * Fix issue with updating job started with node range. * Fix issue with nodes not clearing state in the database when the slurmctld is started with cleanstart. Fix hetjob components > 1 timing out due to InactiveLimit. * Fix sprio printing -nan for normalized association priority if PriorityWeightAssoc was not defined. * Disallow FirstJobId=0. * Preserve job start info in the database for a requeued job that hadn't registered the first time in the database yet. * Only send one message on prolog failure from the slurmd. * Remove support for TaskAffinity=yes in cgroup.conf. * accounting_storage/mysql - fix issue where querying jobs via sacct -whole-hetjob=yes or slurmrestd (which automatically includes this flag) could in some cases return more records than expected. Fix issue for preemption of job array task that makes afterok dependency fail. Additionally, send emails when requeueing happens due to preemption. * Fix sending requeue mail type. * Properly resize a job's GRES bitmaps and counts when resizing the job. OBS-URL: https://build.opensuse.org/request/show/923425 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=190	2021-10-11 08:40:56 +00:00

1 2 3 4

180 Commits