- Test Suite fixes:
* Update README_Testsuite.md.
* Clean up left over files when de-installing test suite.
* Adjustment to test suite package: for SLE mark the openmpi4
devel package and slurm-hdf5 optional.
* Add -ffat-lto-objects to the build flags when LTO is set to
make sure the object files we ship with the test suite still
work correctly.
* Improve setup-testsuite.sh: copy ssh fingerprints from all nodes.
- set environment variable SUSE_ZNOW to 0 in %build to avoid module load
failures due to unresolved symbols as module take advantage of lazy
bindings (bsc#1200030).
OBS-URL: https://build.opensuse.org/request/show/1031255
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=80
* Update README_Testsuite.md.
* Clean up left over files when de-installing test suite.
* Adjustment to test suite package: for SLE mark the openmpi4
devel package and slurm-hdf5 optional.
* Add -ffat-lto-objects to the build flags when LTO is set to
make sure the object files we ship with the test suite still
work correctly.
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=224
- updated to 22.05.5
- NOTE: Slurm validates that libraries are of the same version. Unfortunately,
due to an oversight, we failed to notice that the slurmstepd loads the
hash_k12 library only after a job has completed. This means that if the
hash_k12 library is upgraded before a job finishes, the slurmstepd will load
the new library when the job finishes, and will fail due to a mismatch of
versions. This results in nodes with slurmstepd processes stuck
indefinitely. These processes require manual intervention to clean up. There
is no clean way to resolve these hung slurmstepd processes.
The only recommended way to upgrade between minor versions of 22.05 with
RPM’s or upgrades that replace current binaries and libraries is to drain the
nodes of running jobs first.
- Fixes a number of moderate severity issues, noteable are:
* Load hash plugin at slurmstepd launch time to prevent issues loading the
plugin at step completion if the Slurm installation is upgraded.
* Update nvml plugin to match the unique id format for MIG devices in new
Nvidia drivers.
* Fix multi-node step launch failure when nodes in the controller aren't in
natural order. This can happen with inconsistent node naming (such as
node15 and node052) or with dynamic nodes which can register in any order.
* job_container/tmpfs - cleanup containers even when the .ns file isn't
mounted anymore.
* Wait up to PrologEpilogTimeout before shutting down slurmd to allow prolog
and epilog scripts to complete or timeout. Previously, slurmd waited 120
seconds before timing out and killing prolog and epilog scripts. (forwarded request 1010642 from mslacken)
OBS-URL: https://build.opensuse.org/request/show/1030432
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=79
- updated to 22.05.5
- NOTE: Slurm validates that libraries are of the same version. Unfortunately,
due to an oversight, we failed to notice that the slurmstepd loads the
hash_k12 library only after a job has completed. This means that if the
hash_k12 library is upgraded before a job finishes, the slurmstepd will load
the new library when the job finishes, and will fail due to a mismatch of
versions. This results in nodes with slurmstepd processes stuck
indefinitely. These processes require manual intervention to clean up. There
is no clean way to resolve these hung slurmstepd processes.
The only recommended way to upgrade between minor versions of 22.05 with
RPM’s or upgrades that replace current binaries and libraries is to drain the
nodes of running jobs first.
- Fixes a number of moderate severity issues, noteable are:
* Load hash plugin at slurmstepd launch time to prevent issues loading the
plugin at step completion if the Slurm installation is upgraded.
* Update nvml plugin to match the unique id format for MIG devices in new
Nvidia drivers.
* Fix multi-node step launch failure when nodes in the controller aren't in
natural order. This can happen with inconsistent node naming (such as
node15 and node052) or with dynamic nodes which can register in any order.
* job_container/tmpfs - cleanup containers even when the .ns file isn't
mounted anymore.
* Wait up to PrologEpilogTimeout before shutting down slurmd to allow prolog
and epilog scripts to complete or timeout. Previously, slurmd waited 120
seconds before timing out and killing prolog and epilog scripts.
OBS-URL: https://build.opensuse.org/request/show/1010642
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=220
- Package the Slurm testsuite for QA purposes.
* Fixes for test suite:
Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch
Fix-test-21.41.patch
Fix-test-38.11.patch
Fix-test-32.8.patch
Fix-test-3.13.patch
Fix-test7.2-to-find-libpmix-under-lib64-as-well.patch
* Add documentation:
README_Testsuite.md
- Allow log in as user 'slurm'. This allows admins to run certain
priviledged commands more easily without becoming root. (forwarded request 988732 from eeich)
OBS-URL: https://build.opensuse.org/request/show/988733
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=74
- Package the Slurm testsuite for QA purposes.
* Fixes for test suite:
Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch
Fix-test-21.41.patch
Fix-test-38.11.patch
Fix-test-32.8.patch
Fix-test-3.13.patch
Fix-test7.2-to-find-libpmix-under-lib64-as-well.patch
* Add documentation:
README_Testsuite.md
- Allow log in as user 'slurm'. This allows admins to run certain
priviledged commands more easily without becoming root.
OBS-URL: https://build.opensuse.org/request/show/988732
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=207
NOTE: This package is not meant to be used for testing by the
user but rather for testing by the maintainers to ensure the
package is working properly.
DO NOT report test suite failures unless you are able to confirm
that the failure is really a bug.
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=205
- update to 22.05.0 with following changes:
- Support for dynamic node addition and removal
- Support for native Linux cgroup v2 operation
- Newly added plugins to support HPE Slingshot 11 networks
(switch/hpe_slingshot), and Intel Xe GPUs (gpu/oneapi)
- Added new acct_gather_interconnect/sysfs plugin to collect statistics
from arbitrary network interfaces.
- Expanded and synced set of environment variables available in the
Prolog/Epilog/PrologSlurmctld/EpilogSlurmctld scripts.
- New "--prefer" option to job submissions to allow for a "soft
constraint" request to influence node selection.
- Optional support for license planning in the backfill scheduler with
"bf_licenses" option in SchedulerParameters.
- removed file slurm-2.4.4-init.patch as sysvinit is now realy deprecated
- removed file load-pmix-major-version.patch as fixed upstream
OBS-URL: https://build.opensuse.org/request/show/980093
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=203
- Update to 21.08.7 with following changes:
* openapi/v0.0.37 - correct calculation for bf_queue_len_mean in /diag.
* Avoid shrinking a reservation when overlapping with downed nodes.
* Only check TRES limits against current usage for TRES requested by the job.
* Do not allocate shared gres (MPS) in whole-node allocations
* Constrain slurmstepd to job/step cgroup like in previous versions of Slurm.
* Fix warnings on 32-bit compilers related to printf() formats.
* Fix reconfigure issues after disabling/reenabling the GANG PreemptMode.
* Fix race condition where a cgroup was being deleted while another step
was creating it.
* Set the slurmd port correctly if multi-slurmd
* Fix FAIL mail not being sent if a job was cancelled due to preemption.
* slurmrestd - move debug logs for HTTP handling to be gated by debugflag
NETWORK to avoid unnecessary logging of communication contents.
* Fix issue with bad memory access when shrinking running steps.
* Fix various issues with internal job accounting with GRES when jobs are
shrunk.
* Fix ipmi polling on slurmd reconfig or restart.
* Fix srun crash when reserved ports are being used and het step fails
to launch.
* openapi/dbv0.0.37 - fix DELETE execution path on /user/{user_name}.
* slurmctld - Properly requeue all components of a het job if PrologSlurmctld
fails.
* rlimits - remove final calls to limit nofiles to 4096 but to instead use
the max possible nofiles in slurmd and slurmdbd.
* Allow the DBD agent to load large messages (up to MAX_BUF_SIZE) from state.
* Fix potential deadlock during slurmctld restart when there is a completing
job.
* slurmstepd - reduce user requested soft rlimits when they are above max
hard rlimits to avoid rlimit request being completely ignored and
OBS-URL: https://build.opensuse.org/request/show/974433
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=196
- update to 21.08.5 with following changes:
* Fix issue where typeless GRES node updates were not immediately reflected.
* Fix setting the default scrontab job working directory so that it's the home
of the different user (*u <user>) and not that of root or SlurmUser editor.
* Fix stepd not respecting SlurmdSyslogDebug.
* Fix concurrency issue with squeue.
* Fix job start time not being reset after launch when job is packed onto
already booting node.
* Fix updating SLURM_NODE_ALIASES for jobs packed onto powering up nodes.
* Cray - Fix issues with starting hetjobs.
* auth/jwks - Print fatal() message when jwks is configured but file could
not be opened.
* If sacctmgr has an association with an unknown qos as the default qos
print 'UNKN*###' instead of leaving a blank name.
* Correctly determine task count when giving --cpus-per-gpu, --gpus and
*-ntasks-per-node without task count.
* slurmctld - Fix places where the global last_job_update was not being set
to the time of update when a job's reason and description were updated.
* slurmctld - Fix case where a job submitted with more than one partition
would not have its reason updated while waiting to start.
* Fix memory leak in node feature rebooting.
* Fix time limit permanetly set to 1 minute by backfill for job array tasks
higher than the first with QOS NoReserve flag and PreemptMode configured.
* Fix sacct -N to show jobs that started in the current second
* Fix issue on running steps where both SLURM_NTASKS_PER_TRES and
SLURM_NTASKS_PER_GPU are set.
* Handle oversubscription request correctly when also requesting
*-ntasks-per-tres.
* Correctly detect when a step requests bad gres inside an allocation.
* slurmstepd - Correct possible deadlock when UnkillableStepTimeout triggers.
OBS-URL: https://build.opensuse.org/request/show/942081
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=195
in 21.08 tree.
* CVE-2021-43337:
For sites using the new AccountingStoreFlags=job_script and/or job_env
options, an issue was reported with the access control rules in SlurmDBD
that will permit users to request job scripts and environment files that
they should not have access to. (Scripts/environments are meant to only be
accessible by user accounts with administrator privileges, by account
coordinators for jobs submitted under their account, and by the user
themselves.)
- changes from 21.08.3:
* This includes a number of fixes since the last release a month ago,
including one critical fix to prevent a communication issue between
slurmctld and slurmdbd for sites that have started using the new
AccountingStoreFlags=job_script functionality.
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=193
- update to 21.08.2
- major change:
* removed of support of the TaskAffinity=yes option in cgroup.conf. Please
consider using "TaskPlugins=cgroup,affinity" in slurm.conf as an option.
- minor changes and bugfixes:
* slurmctld - fix how the max number of cores on a node in a partition are
calculated when the partition contains multi*socket nodes. This in turn
corrects certain jobs node count estimations displayed client*side.
* job_submit/cray_aries - fix "craynetwork" GRES specification after changes
introduced in 21.08.0rc1 that made TRES always have a type prefix.
* Ignore nonsensical check in the slurmd for [Pro|Epi]logSlurmctld.
* Fix writing to stderr/syslog when systemd runs slurmctld in the foreground.
* Fix issue with updating job started with node range.
* Fix issue with nodes not clearing state in the database when the slurmctld
is started with clean*start.
* Fix hetjob components > 1 timing out due to InactiveLimit.
* Fix sprio printing -nan for normalized association priority if
PriorityWeightAssoc was not defined.
* Disallow FirstJobId=0.
* Preserve job start info in the database for a requeued job that hadn't
registered the first time in the database yet.
* Only send one message on prolog failure from the slurmd.
* Remove support for TaskAffinity=yes in cgroup.conf.
* accounting_storage/mysql - fix issue where querying jobs via sacct
*-whole-hetjob=yes or slurmrestd (which automatically includes this flag)
could in some cases return more records than expected.
* Fix issue for preemption of job array task that makes afterok dependency
fail. Additionally, send emails when requeueing happens due to preemption.
* Fix sending requeue mail type.
* Properly resize a job's GRES bitmaps and counts when resizing the job.
OBS-URL: https://build.opensuse.org/request/show/923425
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=190
- updated to 21.08.1 with following bug fixes:
* Fix potential memory leak if a problem happens while allocating GRES for
a job.
* If an overallocation of GRES happens terminate the creation of a job.
* AutoDetect=nvml: Fatal if no devices found in MIG mode.
* Print federation and cluster sacctmgr error messages to stderr.
* Fix off by one error in --gpu-bind=mask_gpu.
* Add --gpu-bind=none to disable gpu binding when using --gpus-per-task.
* Handle the burst buffer state "alloc-revoke" which previously would not
display in the job correctly.
* Fix issue in the slurmstepd SPANK prolog/epilog handler where configuration
values were used before being initialized.
* Restore a step's ability to utilize all of an allocations memory if --mem=0.
* Fix --cpu-bind=verbose garbage taskid.
* Fix cgroup task affinity issues from garbage taskid info.
* Make gres_job_state_validate() client logging behavior as before 44466a4641.
* Fix steps with --hint overriding an allocation with --threads-per-core.
* Require requesting a GPU if --mem-per-gpu is requested.
* Return error early if a job is requesting --ntasks-per-gpu and no gpus or
task count.
* Properly clear out pending step if unavailable to run with available
resources.
* Kill all processes spawned by burst_buffer.lua including decendents.
* openapi/v0.0.{35,36,37} - Avoid setting default values of min_cpus,
job name, cwd, mail_type, and contiguous on job update.
* openapi/v0.0.{35,36,37} - Clear user hold on job update if hold=false.
* Prevent CRON_JOB flag from being cleared when loading job state.
* sacctmgr - Fix deleting WCKeys when not specifying a cluster.
* Fix getting memory for a step when the first node in the step isn't the
first node in the allocation.
OBS-URL: https://build.opensuse.org/request/show/919668
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=186