slurm

pool/slurm

SHA256

Author	SHA256	Message	Date
Egbert Eich	52046053d5	Accepting request 1030610 from home:eeich:branches:network:cluster - Update README_Testsuite.md. - Make hdf5 package optional for test suite. - Clean up left over files when de-installing test suite. - set environment variable SUSE_ZNOW to 0 in %build to avoid module load failures due to unresolved symbols as module take advantage of lazy bindings (bsc#1200030). OBS-URL: https://build.opensuse.org/request/show/1030610 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=222	2022-10-24 05:31:40 +00:00
Egbert Eich	c2551ab47f	Accepting request 1010642 from home:mslacken:branches:network:cluster - updated to 22.05.5 - NOTE: Slurm validates that libraries are of the same version. Unfortunately, due to an oversight, we failed to notice that the slurmstepd loads the hash_k12 library only after a job has completed. This means that if the hash_k12 library is upgraded before a job finishes, the slurmstepd will load the new library when the job finishes, and will fail due to a mismatch of versions. This results in nodes with slurmstepd processes stuck indefinitely. These processes require manual intervention to clean up. There is no clean way to resolve these hung slurmstepd processes. The only recommended way to upgrade between minor versions of 22.05 with RPM’s or upgrades that replace current binaries and libraries is to drain the nodes of running jobs first. - Fixes a number of moderate severity issues, noteable are: * Load hash plugin at slurmstepd launch time to prevent issues loading the plugin at step completion if the Slurm installation is upgraded. * Update nvml plugin to match the unique id format for MIG devices in new Nvidia drivers. * Fix multi-node step launch failure when nodes in the controller aren't in natural order. This can happen with inconsistent node naming (such as node15 and node052) or with dynamic nodes which can register in any order. * job_container/tmpfs - cleanup containers even when the .ns file isn't mounted anymore. * Wait up to PrologEpilogTimeout before shutting down slurmd to allow prolog and epilog scripts to complete or timeout. Previously, slurmd waited 120 seconds before timing out and killing prolog and epilog scripts. OBS-URL: https://build.opensuse.org/request/show/1010642 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=220	2022-10-21 15:00:25 +00:00
Egbert Eich	09aecc2015	Accepting request 1005746 from home:eeich:branches:network:cluster - Do not deduplicate files of testsuite Slurm configuration. This directory is supposed to be mounted over /etc/slurm therefore it must not contain softlinks to the files in this directory. - Improve .a and .o file collection for test suite: find these files even if there are multiple ones in a single line. OBS-URL: https://build.opensuse.org/request/show/1005746 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=218	2022-09-26 15:01:51 +00:00
Egbert Eich	3f68233e21	Accepting request 1005246 from home:eeich:branches:network:cluster - Fix build for older product version. OBS-URL: https://build.opensuse.org/request/show/1005246 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=216	2022-09-21 15:33:09 +00:00
Egbert Eich	b60ac5f569	Accepting request 992353 from home:eeich:branches:network:cluster - Fix a potential security vulnerability in the test package (bsc#1201674, CVE-2022-31251). - Patch NOFILE Limit in the slurmd.service copy for the testsuite. OBS-URL: https://build.opensuse.org/request/show/992353 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=214	2022-08-02 15:34:01 +00:00
Egbert Eich	fd509c0258	Accepting request 990637 from home:bmwiedemann:branches:network:cluster make slurmtest.tar reproducible OBS-URL: https://build.opensuse.org/request/show/990637 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=213	2022-08-02 13:14:07 +00:00
Egbert Eich	e067a36989	- Fix a typo which prevented the nproc limit for slurmd to be up-ed for the test suite. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=211	2022-07-15 07:15:34 +00:00
Egbert Eich	69890cab1e	Accepting request 989256 from home:eeich:branches:network:cluster - Improve check for mpicc in testsuite package: if binary isn't found, don't crash. OBS-URL: https://build.opensuse.org/request/show/989256 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=210	2022-07-15 07:13:32 +00:00
Egbert Eich	167150eca6	- Fix a typo OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=209	2022-07-15 07:12:53 +00:00
Egbert Eich	7d13a7ba97	Accepting request 988732 from home:eeich:branches:network:cluster - Package the Slurm testsuite for QA purposes. * Fixes for test suite: Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch Fix-test-21.41.patch Fix-test-38.11.patch Fix-test-32.8.patch Fix-test-3.13.patch Fix-test7.2-to-find-libpmix-under-lib64-as-well.patch * Add documentation: README_Testsuite.md - Allow log in as user 'slurm'. This allows admins to run certain priviledged commands more easily without becoming root. OBS-URL: https://build.opensuse.org/request/show/988732 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=207	2022-07-12 20:03:18 +00:00
Christian Goll	52adf61c22	Accepting request 983910 from home:mslacken:branches:network:cluster - update to 22.05.2 with following fixes: * Fix regression which allowed the oversubscription of licenses. * Fix a segfault in slurmctld when requesting gres in job arrays. OBS-URL: https://build.opensuse.org/request/show/983910 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=206	2022-06-20 11:58:11 +00:00
Egbert Eich	2951a00ce2	- Package the Slrum testsuite for QA purposes. NOTE: This package is not meant to be used for testing by the user but rather for testing by the maintainers to ensure the package is working properly. DO NOT report test suite failures unless you are able to confirm that the failure is really a bug. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=205	2022-06-08 13:21:55 +00:00
Christian Goll	faa19fe22b	Accepting request 980093 from home:mslacken:branches:network:cluster - update to 22.05.0 with following changes: - Support for dynamic node addition and removal - Support for native Linux cgroup v2 operation - Newly added plugins to support HPE Slingshot 11 networks (switch/hpe_slingshot), and Intel Xe GPUs (gpu/oneapi) - Added new acct_gather_interconnect/sysfs plugin to collect statistics from arbitrary network interfaces. - Expanded and synced set of environment variables available in the Prolog/Epilog/PrologSlurmctld/EpilogSlurmctld scripts. - New "--prefer" option to job submissions to allow for a "soft constraint" request to influence node selection. - Optional support for license planning in the backfill scheduler with "bf_licenses" option in SchedulerParameters. - removed file slurm-2.4.4-init.patch as sysvinit is now realy deprecated - removed file load-pmix-major-version.patch as fixed upstream OBS-URL: https://build.opensuse.org/request/show/980093 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=203	2022-05-31 13:38:54 +00:00
Egbert Eich	a07f819c2f	- Update to 21.08.8 which fixes CVE-2022-29500 (bsc#1199278), CVE-2022-29501 (bsc#1199279), and CVE-2022-29502 (bsc#1199281). OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=201	2022-05-11 10:26:59 +00:00
Egbert Eich	5f6ca5dea6	Accepting request 976056 from home:eeich:branches:network:cluster - Add a comment about the CommunicationParameters=block_null_hash option warning users who migrate - just in case. OBS-URL: https://build.opensuse.org/request/show/976056 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=200	2022-05-11 10:25:15 +00:00
Christian Goll	950ae37e78	Accepting request 975374 from home:mslacken:branches:network:cluster - Update to 21.08.8 which fixes CVE-2022-29500, CVE-2022-29501 and CVE-2022-29502 - Added 'CommunicationParameters=block_null_hash' to slurm.conf, please add this parameter to existing configurations. OBS-URL: https://build.opensuse.org/request/show/975374 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=198	2022-05-06 15:13:12 +00:00
Christian Goll	30c749c9e0	Accepting request 974433 from home:mslacken:branches:network:cluster - Update to 21.08.7 with following changes: * openapi/v0.0.37 - correct calculation for bf_queue_len_mean in /diag. * Avoid shrinking a reservation when overlapping with downed nodes. * Only check TRES limits against current usage for TRES requested by the job. * Do not allocate shared gres (MPS) in whole-node allocations * Constrain slurmstepd to job/step cgroup like in previous versions of Slurm. * Fix warnings on 32-bit compilers related to printf() formats. * Fix reconfigure issues after disabling/reenabling the GANG PreemptMode. * Fix race condition where a cgroup was being deleted while another step was creating it. * Set the slurmd port correctly if multi-slurmd * Fix FAIL mail not being sent if a job was cancelled due to preemption. * slurmrestd - move debug logs for HTTP handling to be gated by debugflag NETWORK to avoid unnecessary logging of communication contents. * Fix issue with bad memory access when shrinking running steps. * Fix various issues with internal job accounting with GRES when jobs are shrunk. * Fix ipmi polling on slurmd reconfig or restart. * Fix srun crash when reserved ports are being used and het step fails to launch. * openapi/dbv0.0.37 - fix DELETE execution path on /user/{user_name}. * slurmctld - Properly requeue all components of a het job if PrologSlurmctld fails. * rlimits - remove final calls to limit nofiles to 4096 but to instead use the max possible nofiles in slurmd and slurmdbd. * Allow the DBD agent to load large messages (up to MAX_BUF_SIZE) from state. * Fix potential deadlock during slurmctld restart when there is a completing job. * slurmstepd - reduce user requested soft rlimits when they are above max hard rlimits to avoid rlimit request being completely ignored and OBS-URL: https://build.opensuse.org/request/show/974433 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=196	2022-05-02 17:06:13 +00:00
Christian Goll	d442993ff4	Accepting request 942081 from home:mslacken:branches:network:cluster - update to 21.08.5 with following changes: * Fix issue where typeless GRES node updates were not immediately reflected. * Fix setting the default scrontab job working directory so that it's the home of the different user (u <user>) and not that of root or SlurmUser editor. Fix stepd not respecting SlurmdSyslogDebug. * Fix concurrency issue with squeue. * Fix job start time not being reset after launch when job is packed onto already booting node. * Fix updating SLURM_NODE_ALIASES for jobs packed onto powering up nodes. * Cray - Fix issues with starting hetjobs. * auth/jwks - Print fatal() message when jwks is configured but file could not be opened. * If sacctmgr has an association with an unknown qos as the default qos print 'UNKN###' instead of leaving a blank name. Correctly determine task count when giving --cpus-per-gpu, --gpus and -ntasks-per-node without task count. slurmctld - Fix places where the global last_job_update was not being set to the time of update when a job's reason and description were updated. * slurmctld - Fix case where a job submitted with more than one partition would not have its reason updated while waiting to start. * Fix memory leak in node feature rebooting. * Fix time limit permanetly set to 1 minute by backfill for job array tasks higher than the first with QOS NoReserve flag and PreemptMode configured. * Fix sacct -N to show jobs that started in the current second * Fix issue on running steps where both SLURM_NTASKS_PER_TRES and SLURM_NTASKS_PER_GPU are set. * Handle oversubscription request correctly when also requesting -ntasks-per-tres. Correctly detect when a step requests bad gres inside an allocation. * slurmstepd - Correct possible deadlock when UnkillableStepTimeout triggers. OBS-URL: https://build.opensuse.org/request/show/942081 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=195	2021-12-23 10:26:41 +00:00
Christian Goll	350be975f5	Accepting request 932063 from home:aginies:branches:network:cluster add a ref to SLE-22741 (firewall config) in changelog OBS-URL: https://build.opensuse.org/request/show/932063 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=194	2021-11-18 09:37:45 +00:00
Christian Goll	d4c2b2bcf3	- updated to 21.08.4 which fixes (CVE-2021-43337) which is only present in 21.08 tree. * CVE-2021-43337: For sites using the new AccountingStoreFlags=job_script and/or job_env options, an issue was reported with the access control rules in SlurmDBD that will permit users to request job scripts and environment files that they should not have access to. (Scripts/environments are meant to only be accessible by user accounts with administrator privileges, by account coordinators for jobs submitted under their account, and by the user themselves.) - changes from 21.08.3: * This includes a number of fixes since the last release a month ago, including one critical fix to prevent a communication issue between slurmctld and slurmdbd for sites that have started using the new AccountingStoreFlags=job_script functionality. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=193	2021-11-17 08:37:51 +00:00
Egbert Eich	c67f43163f	Accepting request 928191 from home:eeich:branches:network:cluster - Utilize sysuser infrastructure to set user/group slurm. For munge authentication slurm should have a fixed UID across all nodes including the management server. Set it to 120 - Limit firewalld service definitions to SUSE versions >= 15. OBS-URL: https://build.opensuse.org/request/show/928191 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=192	2021-10-29 17:38:05 +00:00
Christian Goll	f4a3f06e75	Accepting request 926016 from home:mslacken:branches:network:cluster - added service definitions for firewalld OBS-URL: https://build.opensuse.org/request/show/926016 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=191	2021-10-29 14:17:34 +00:00
Christian Goll	7a20fda376	Accepting request 923425 from home:mslacken:branches:network:cluster - update to 21.08.2 - major change: * removed of support of the TaskAffinity=yes option in cgroup.conf. Please consider using "TaskPlugins=cgroup,affinity" in slurm.conf as an option. - minor changes and bugfixes: * slurmctld - fix how the max number of cores on a node in a partition are calculated when the partition contains multisocket nodes. This in turn corrects certain jobs node count estimations displayed clientside. * job_submit/cray_aries - fix "craynetwork" GRES specification after changes introduced in 21.08.0rc1 that made TRES always have a type prefix. * Ignore nonsensical check in the slurmd for [Pro\|Epi]logSlurmctld. * Fix writing to stderr/syslog when systemd runs slurmctld in the foreground. * Fix issue with updating job started with node range. * Fix issue with nodes not clearing state in the database when the slurmctld is started with cleanstart. Fix hetjob components > 1 timing out due to InactiveLimit. * Fix sprio printing -nan for normalized association priority if PriorityWeightAssoc was not defined. * Disallow FirstJobId=0. * Preserve job start info in the database for a requeued job that hadn't registered the first time in the database yet. * Only send one message on prolog failure from the slurmd. * Remove support for TaskAffinity=yes in cgroup.conf. * accounting_storage/mysql - fix issue where querying jobs via sacct -whole-hetjob=yes or slurmrestd (which automatically includes this flag) could in some cases return more records than expected. Fix issue for preemption of job array task that makes afterok dependency fail. Additionally, send emails when requeueing happens due to preemption. * Fix sending requeue mail type. * Properly resize a job's GRES bitmaps and counts when resizing the job. OBS-URL: https://build.opensuse.org/request/show/923425 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=190	2021-10-11 08:40:56 +00:00
Christian Goll	64b9f7f60a	macro fixed OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=189	2021-09-29 07:35:03 +00:00
Christian Goll	1b26b8910b	via the macro %_pam_moduledir OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=188	2021-09-29 07:08:48 +00:00
Christian Goll	728a1b3c1e	updated major version OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=187	2021-09-28 15:54:50 +00:00
Christian Goll	5b07269e3d	Accepting request 919668 from home:mslacken:branches:network:cluster - updated to 21.08.1 with following bug fixes: * Fix potential memory leak if a problem happens while allocating GRES for a job. * If an overallocation of GRES happens terminate the creation of a job. * AutoDetect=nvml: Fatal if no devices found in MIG mode. * Print federation and cluster sacctmgr error messages to stderr. * Fix off by one error in --gpu-bind=mask_gpu. * Add --gpu-bind=none to disable gpu binding when using --gpus-per-task. * Handle the burst buffer state "alloc-revoke" which previously would not display in the job correctly. * Fix issue in the slurmstepd SPANK prolog/epilog handler where configuration values were used before being initialized. * Restore a step's ability to utilize all of an allocations memory if --mem=0. * Fix --cpu-bind=verbose garbage taskid. * Fix cgroup task affinity issues from garbage taskid info. * Make gres_job_state_validate() client logging behavior as before 44466a4641. * Fix steps with --hint overriding an allocation with --threads-per-core. * Require requesting a GPU if --mem-per-gpu is requested. * Return error early if a job is requesting --ntasks-per-gpu and no gpus or task count. * Properly clear out pending step if unavailable to run with available resources. * Kill all processes spawned by burst_buffer.lua including decendents. * openapi/v0.0.{35,36,37} - Avoid setting default values of min_cpus, job name, cwd, mail_type, and contiguous on job update. * openapi/v0.0.{35,36,37} - Clear user hold on job update if hold=false. * Prevent CRON_JOB flag from being cleared when loading job state. * sacctmgr - Fix deleting WCKeys when not specifying a cluster. * Fix getting memory for a step when the first node in the step isn't the first node in the allocation. OBS-URL: https://build.opensuse.org/request/show/919668 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=186	2021-09-27 09:23:35 +00:00
Christian Goll	e22daa9ce5	Accepting request 917243 from home:eeich:branches:network:cluster - Fix-statement-condition-in-netloc-autoconf-macro.patch: Fix netloc check, reestablish netloc disable code. - Make configure arg '--with-pmix' conditional. - Move openapi plugins to package slurm-restd. OBS-URL: https://build.opensuse.org/request/show/917243 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=185	2021-09-08 07:34:10 +00:00
Christian Goll	562a595d05	Accepting request 915777 from home:mslacken:slurm_update - updated to 21.08.1, major changes: * A new "AccountingStoreFlags=job_script" option to store the job scripts directly in SlurmDBD. * Added "sacct -o SubmitLine" format option to get the submit line of a job/step. * Changes to the node state management so that nodes are marked as PLANNED instead of IDLE if the scheduler is still accumulating resources while waiting to launch a job on them. * RS256 token support in auth/jwt. * Overhaul of the cgroup subsystems to simplify operation, mitigate a number of inherent race conditions, and prepare for future cgroup v2 support. * Further improvements to cloud node power state management. * A new child process of the Slurm controller called "slurmscriptd" responsible for executing PrologSlurmctld and EpilogSlurmctld scripts, which significantly reduces performance issues associated with enabling those options. * A new burst_buffer/lua plugin allowing for site-specific asynchronous job data management. * Fixes to the job_container/tmpfs plugin to allow the slurmd process to be restarted while the job is running without issue. * Added json/yaml output to sacct, squeue, and sinfo commands. * Added a new node_features/helpers plugin to provide a generic way to change settings on a compute node across a reboot. * Added support for automatically detecting and broadcasting shared libraries for an executable launched with "srun --bcast". * Added initial OCI container execution support with a new --container option to sbatch and srun. * Improved "configless" support by allowing multiple control servers to be specified through the slurmd --conf-server option, and send additional configuration files at startup including cli_filter.lua. OBS-URL: https://build.opensuse.org/request/show/915777 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=184	2021-09-06 13:29:00 +00:00
Christian Goll	b61c5b25fa	Accepting request 903744 from home:mslacken:slurm_update - Updated to 20.11.8: * slurmctld - fix erroneous "StepId=CORRUPT" messages in error logs. * Correct the error given when auth plugin fails to pack a credential. * Fix unused-variable compiler warning on FreeBSD in fd_resolve_path(). * acct_gather_filesystem/lustre - only emit collection error once per step. * Add GRES environment variables (e.g., CUDA_VISIBLE_DEVICES) into the interactive step, the same as is done for the batch step. * Fix various potential deadlocks when altering objects in the database dealing with every cluster in the database. * slurmrestd: - handle slurmdbd connection failures without segfaulting. - fix segfault for searches in slurmdb/v0.0.36/jobs. - remove (non-functioning) users query parameter for slurmdb/v0.0.36/jobs from openapi.json - fix segfault in slurmrestd db/jobs with numeric queries - add argv handling for job/submit endpoint. - add description for slurmdb/job endpoint. * slurmrestd/dbv0.0.36: - Fix values dumped in job state/current and job step state. - Correct description for previous state property. * srun: - fix broken node step allocation in a heterogeneous allocation. - leave SLURM_DIST_UNKNOWN as default for --interactive. * Fail step creation if -n is not multiple of --ntasks-per-gpu. * job_container/tmpfs - Fix slowdown on teardown. * Fix problem with SlurmctldProlog where requeued jobs would never launch. * job_container/tmpfs - Fix issue when restarting slurmd where the namespace mount points could disappear. * sacct: OBS-URL: https://build.opensuse.org/request/show/903744 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=183	2021-07-02 15:32:26 +00:00
Egbert Eich	b4f7e9209d	- New features in 20.11.7: - New features in 20.11.6: - Fix Provides:/Conflicts: for libnss_slurm (bsc#1180700). OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=181	2021-05-19 18:34:28 +00:00
Christian Goll	89b4ed3f9f	- Updated to 20.11.7 which fixes CVE-2021-31215 (bsc#1186024) - New featuresi from 20.11.7: * slurmd - handle configless failures gracefully instead of hanging indefinitely. * select/cons_tres - fix Dragonfly topology not selecting nodes in the same leaf switch when it should as well as requests with -switches option. Fix issue where certain step requests wouldn't run if the first node in the job allocation was full and there were idle resources on other nodes in the job allocation. * Fix deadlock issue with <Prolog\|Epilog>Slurmctld. * torque/qstat - fix printf error message in output. * When adding associations or wckeys avoid checking multiple times a user or cluster name. * Fix wrong jobacctgather information on a step on multiple nodes due to timeouts sending its the information gathered on its node. * Fix missing xstrdup which could result in slurmctld segfault on array jobs. * Fix security issue in PrologSlurmctld and EpilogSlurmctld by always prepending SPANK_ to all user-set environment variables. CVE-2021-31215. - New features from 20.11.6: * Fix sacct assert with the --qos option. * Use pkg-config --atleast-version instead of --modversion for systemd. * common/fd - fix getsockopt() call in fd_get_socket_error(). * Properly handle the return from fd_get_socket_error() in _conn_readable(). * cons_res - Fix issue where running jobs were not taken into consideration when creating a reservation. * Avoid a deadlock between job_list for_each and assoc QOS_LOCK. * Fix TRESRunMins usage for partition qos on restart/reconfig. * Fix printing of number of tasks on a completed job that didn't request tasks. * Fix updating GrpTRESRunMins when decrementing job time is bigger than it. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=179	2021-05-14 10:35:47 +00:00
Egbert Eich	47fc726263	Accepting request 890261 from home:eeich:branches:network:cluster - Ship REST API version and auth plugins with slurmrestd. - Add YAML support for REST API to build (bsc#1185603). OBS-URL: https://build.opensuse.org/request/show/890261 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=177	2021-05-04 08:36:53 +00:00
Ana Guerrero	ff5dc58526	Accepting request 879659 from home:anag:branches:home:mslacken:slurm_up update + typo fix OBS-URL: https://build.opensuse.org/request/show/879659 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=175	2021-03-17 10:26:51 +00:00
Christian Goll	927cd6ab24	Accepting request 874647 from home:mslacken:branches:network:cluster - Udpate to 20.11.04 * Fix node selection for advanced reservations with features. * mpi/pmix: Handle pipe failure better when using ucx. * mpi/pmix: include PMIX_NODEID for each process entry. * Fix job getting rejected after being requeued on same node that died. * job_submit/lua - add "network" field. * Fix situations when a reoccuring reservation could erroneously skip a period. * Ensure that a reservations [pro\|epi]log are ran on reoccuring reservations. * Fix threads-per-core memory allocation issue when using CR_CPU_MEMORY. * Fix scheduling issue with --gpus. * Fix gpu allocations that request --cpus-per-task. * mpi/pmix: fixed print messages for all PMIXP_* macros * Add mapping for XCPU to --signal option. * Fix regression in 20.11 that prevented a full pass of the main scheduler from ever executing. * Work around a glibc bug in which "0" is incorrectly printed as "nan" which will result in corrupted association state on restart. * Fix regression in 20.11 which made slurmd incorrectly attempt to find the parent slurmd address when not applicable and send incorrect reversetree info to the slurmstepd. Fix cgroup ns detection when using containers (e.g. LXC or Docker). * scrontab - change temporary file handling to work with emacs. - Removed check-for-lipmix.so.MAJOR.patch - Added: load-pmix-major-version.patch OBS-URL: https://build.opensuse.org/request/show/874647 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=173	2021-02-24 09:49:16 +00:00
Ana Guerrero	4ab9986278	Accepting request 864993 from home:anag:branches:network:cluster - Update to 20.11.03 - This release includes a major functional change to how job step launch is handled compared to the previous 20.11 releases. This affects srun as well as MPI stacks - such as Open MPI - which may use srun internally as part of the process launch. One of the changes made in the Slurm 20.11 release was to the semantics for job steps launched through the 'srun' command. This also inadvertently impacts many MPI releases that use srun underneath their own mpiexec/mpirun command. For 20.11.{0,1,2} releases, the default behavior for srun was changed such that each step was allocated exactly what was requested by the options given to srun, and did not have access to all resources assigned to the job on the node by default. This change was equivalent to Slurm setting the --exclusive option by default on all job steps. Job steps desiring all resources on the node needed to explicitly request them through the new '--whole' option. In the 20.11.3 release, we have reverted to the 20.02 and older behavior of assigning all resources on a node to the job step by default. This reversion is a major behavioral change which we would not generally do on a maintenance release, but is being done in the interest of restoring compatibility with the large number of existing Open MPI (and other MPI flavors) and job scripts that exist in production, and to remove what has proven to be a significant hurdle in moving to the new release. Please note that one change to step launch remains - by default, in 20.11 steps are no longer permitted to overlap on the resources they have been assigned. If that behavior is desired, all steps must explicitly opt-in through the newly added '--overlap' option. Further details and a full explanation of the issue can be found at: https://bugs.schedmd.com/show_bug.cgi?id=10383#c63 OBS-URL: https://build.opensuse.org/request/show/864993 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=171	2021-01-20 13:58:46 +00:00
Egbert Eich	82c61d739d	Accepting request 861776 from home:eeich:branches:network:cluster - Fix fallout introduced by: "Replace '%service_del_postun -n' with '%service_del_postun_without_restart'" for older Leap/SLE versions. OBS-URL: https://build.opensuse.org/request/show/861776 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=169	2021-01-08 17:40:48 +00:00
Egbert Eich	0d02ad4cfa	- Fix Provides:/Conflicts: for libnss_slurm. OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=167	2021-01-08 12:21:49 +00:00
Egbert Eich	c50d4048dc	Accepting request 845752 from home:fbui:branches:network:cluster - Replace '%service_del_postun -n' with '%service_del_postun_without_restart' '-n' is deprecated and will be removed in the future. OBS-URL: https://build.opensuse.org/request/show/845752 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=166	2021-01-08 12:18:52 +00:00
Ana Guerrero	08c7233b38	Accepting request 860690 from home:anag:branches:network:cluster - Add support for configuration files from external plugins. While built-in plugins have their configuration added in slurm.conf, external SPANK plugins add their configuration to plugstack.conf To allow packaging easily spank plugins, their configuration files should be added independently at /etc/spack/plugstack.conf.d and plugstack.conf should be left with an oneliner including all the files under /etc/spack/plugstack.conf.d OBS-URL: https://build.opensuse.org/request/show/860690 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=164	2021-01-06 10:42:08 +00:00
Ana Guerrero	caa18eaeaa	Accepting request 859114 from home:anag:branches:network:cluster - Update to 20.11.02 * Fix older versions of sacct not working with 20.11. * Fix slurmctld crash when using a pre-20.11 srun in a job allocation. * Correct logic problem in _validate_user_access. * Fix libpmi to initialize Slurm configuration correctly. - Update to 20.11.01 * Fix spelling of "overcomited" to "overcomitted" in sreport's cluster utilization report. * Silence debug message about shutting down backup controllers if none are configured. * Don't create interactive srun until PrologSlurmctld is done. * Fix fd symlink path resolution. * Fix slurmctld segfault on subnode reservation restore after node configuration change. * Fix resource allocation response message environment allocation size. * Ensure that details->env_sup is NULL terminated. * select/cray_aries - Correctly remove jobs/steps from blades using NPC. * cons_tres - Avoid max_node_gres when entire node is allocated with --ntasks-per-gpu. * Allow NULL arg to data_get_type(). * In sreport have usage for a reservation contain all jobs that ran in the reservation instead of just the ones that ran in the time specified. This matches the report for the reservation is not truncated for a time period. * Fix issue with sending wrong batch step id to a < 20.11 slurmd. * Add a job's alloc_node to lua for job modification and completion. * Fix regression getting a slurmdbd connection through the perl API. * Stop the extern step terminate monitor right after proctrack_g_wait(). * Fix removing the normalized priority of assocs. * slurmrestd/v0.0.36 - Use correct name for partition field: "min nodes per job" -"min_nodes_per_job". OBS-URL: https://build.opensuse.org/request/show/859114 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=162	2020-12-29 03:15:30 +00:00
Egbert Eich	d5d3aa2162	Accepting request 852039 from home:eeich:branches:network:cluster - Update to version 20.11.0 Slurm 20.11 includes a number of new features including: * Overhaul of the job step management and launch code, alongside improved GPU task placement support. * A new "Interactive Step" mode of operation for salloc. * A new "scrontab" command that can be used to submit and manage periodically repeating jobs. * IPv6 support. * Changes to the reservation logic, with new options allowing users to delete reservations, allowing admins to skip the next occurance of a repeated reservation, and allowing for a job to be submitted and eligible to run within multiple reservations. * Dynamic Future Nodes - automatically associate a dynamically provisioned (or "cloud") node against a NodeName definition with matching hardware. * An experimental new RPC queuing mode for slurmctld to reduce thread contention on heavily loaded clusters. * SlurmDBD integration with the Slurm REST API. Also check https://github.com/SchedMD/slurm/blob/slurm-20-11-0-1/RELEASE_NOTES OBS-URL: https://build.opensuse.org/request/show/852039 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=160	2020-12-05 14:46:07 +00:00
Ana Guerrero	370ac32279	Accepting request 849252 from home:anag:branches:network:cluster - Updated to 20.02.6, addresses two security fixes: * PMIx - fix potential buffer overflows from use of unpackmem(). CVE-2020-27745 (bsc#1178890) * X11 forwarding - fix potential leak of the magic cookie when sent as an argument to the xauth command. CVE-2020-27746 (bsc#1178891) - And many other bugfixes, full log and details available at: * https://lists.schedmd.com/pipermail/slurm-announce/2020/000045.html OBS-URL: https://build.opensuse.org/request/show/849252 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=158	2020-11-18 09:57:56 +00:00
Egbert Eich	e481851f5a	Accepting request 845108 from home:anag:branches:network:cluster - Updated to 20.02.5, changes: * Fix leak of TRESRunMins when job time is changed with --time-min * pam_slurm - explicitly initialize slurm config to support configless mode. * scontrol - Fix exit code when creating/updating reservations with wrong Flags. * When a GRES has a no_consume flag, report 0 for allocated. * Fix cgroup cleanup by jobacct_gather/cgroup. * When creating reservations/jobs don't allow counts on a feature unless using an XOR. * Improve number of boards discovery * Fix updating a reservation NodeCnt on a zero-count reservation. * slurmrestd - provide an explicit error messages when PSK auth fails. * cons_tres - fix job requesting single gres per-node getting two or more nodes with less CPUs than requested per-task. * cons_tres - fix calculation of cores when using gres and cpus-per-task. * cons_tres - fix job not getting access to socket without GPU or with less than --gpus-per-socket when not enough cpus available on required socket and not using --gres-flags=enforce binding. * Fix HDF5 type version build error. * Fix creation of CoreCnt only reservations when the first node isn't available. * Fix wrong DBD Agent queue size in sdiag when using accounting_storage/none. * Improve job constraints XOR option logic. * Fix preemption of hetjobs when needed nodes not in leader component. * Fix wrong bit_or() messing potential preemptor jobs node bitmap, causing bad node deallocations and even allocation of nodes from other partitions. * Fix double-deallocation of preempted non-leader hetjob components. * slurmdbd - prevent truncation of the step nodelists over 4095. * Fix nodes remaining in drain state state after rebooting with ASAP option. - changes from 20.02.4: OBS-URL: https://build.opensuse.org/request/show/845108 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=156	2020-11-02 13:42:03 +00:00
Egbert Eich	e3512185d8	- Disable build on s390 (requires 64bit). OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=154	2020-07-07 20:14:00 +00:00
Egbert Eich	361d99b111	- Add support for openPMIx also for Leap/SLE 15.0/1 (bsc#1173805). OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=153	2020-07-07 16:20:06 +00:00
Christian Goll	4b04d88697	Accepting request 819233 from home:eeich:branches:network:cluster - Add support for openPMIx also for Leap/SLE 15.0/1. - Do not run %check on SLE-12-SP2: Some incompatibility in tcl makes this fail. - Remove unneeded build dependency to postgresql-devel. OBS-URL: https://build.opensuse.org/request/show/819233 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=152	2020-07-07 13:08:10 +00:00
Christian Goll	e8d4b0e920	Accepting request 811475 from home:eeich:branches:network:cluster - Bring QA to the package build: add %%check stage. - Remove cruft that isn't needed any longer. - Add 'ghosted' run-file. - Add rpmlint filter to handle issues with library packages for Leap and enterprise upgrade versions. - Treat libnss_slurm like any other package: add version string to upgrade package. OBS-URL: https://build.opensuse.org/request/show/811475 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=150	2020-06-17 11:15:39 +00:00
Egbert Eich	85a31ae1b5	- Updated to 20.02.3 which fixes CVE-2020-12693 (bsc#1172004). OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=148	2020-05-25 05:01:16 +00:00
Christian Goll	6f1a2e50da	Accepting request 808130 from home:mslacken:branches:network:cluster - Updated to 20.02.3 which fixes CVE-2020-12693 - Other changes are: * Factor in ntasks-per-core=1 with cons_tres. * Fix formatting in error message in cons_tres. * Fix calling stat on a NULL variable. * Fix minor memory leak when using reservations with flags=first_cores. * Fix gpu bind issue when CPUs=Cores and ThreadsPerCore > 1 on a node. * Fix --mem-per-gpu for heterogenous --gres requests. * Fix slurmctld load order in load_all_part_state(). * Fix race condition not finding jobacct gather task cgroup entry. * Suppress error message when selecting nodes on disjoint topologies. * Improve performance of _pack_default_job_details() with large number of job * arguments. * Fix archive loading previous to 17.11 jobs per-node req_mem. * Fix regresion validating that --gpus-per-socket requires --sockets-per-node * for steps. Should only validate allocation requests. * error() instead of fatal() when parsing an invalid hostlist. * nss_slurm - fix potential deadlock in slurmstepd on overloaded systems. * cons_tres - fix --gres-flags=enforce-binding and related --cpus-per-gres. * cons_tres - Allocate lowest numbered cores when filtering cores with gres. * Fix getting system counts for named GRES/TRES. * MySQL - Fix for handing typed GRES for association rollups. * Fix step allocations when tasks_per_core > 1. * Fix allocating more GRES than requested when asking for multiple GRES types. OBS-URL: https://build.opensuse.org/request/show/808130 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=147	2020-05-22 09:31:56 +00:00

1 2 3 4

153 Commits