- Updated to 20.11.7 which fixes CVE-2021-31215 (bsc#1186024)
- New featuresi from 20.11.7:
* slurmd - handle configless failures gracefully instead of hanging
indefinitely.
* select/cons_tres - fix Dragonfly topology not selecting nodes in the same
leaf switch when it should as well as requests with *-switches option.
* Fix issue where certain step requests wouldn't run if the first node in the
job allocation was full and there were idle resources on other nodes in
the job allocation.
* Fix deadlock issue with <Prolog|Epilog>Slurmctld.
* torque/qstat - fix printf error message in output.
* When adding associations or wckeys avoid checking multiple times a user or
cluster name.
* Fix wrong jobacctgather information on a step on multiple nodes
due to timeouts sending its the information gathered on its node.
* Fix missing xstrdup which could result in slurmctld segfault on array jobs.
* Fix security issue in PrologSlurmctld and EpilogSlurmctld by always
prepending SPANK_ to all user-set environment variables. CVE-2021-31215.
- New features from 20.11.6:
* Fix sacct assert with the --qos option.
* Use pkg-config --atleast-version instead of --modversion for systemd.
* common/fd - fix getsockopt() call in fd_get_socket_error().
* Properly handle the return from fd_get_socket_error() in _conn_readable().
* cons_res - Fix issue where running jobs were not taken into consideration
when creating a reservation.
* Avoid a deadlock between job_list for_each and assoc QOS_LOCK.
* Fix TRESRunMins usage for partition qos on restart/reconfig.
* Fix printing of number of tasks on a completed job that didn't request
tasks.
* Fix updating GrpTRESRunMins when decrementing job time is bigger than it.
OBS-URL: https://build.opensuse.org/request/show/893087
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=59
- New featuresi from 20.11.7:
* slurmd - handle configless failures gracefully instead of hanging
indefinitely.
* select/cons_tres - fix Dragonfly topology not selecting nodes in the same
leaf switch when it should as well as requests with *-switches option.
* Fix issue where certain step requests wouldn't run if the first node in the
job allocation was full and there were idle resources on other nodes in
the job allocation.
* Fix deadlock issue with <Prolog|Epilog>Slurmctld.
* torque/qstat - fix printf error message in output.
* When adding associations or wckeys avoid checking multiple times a user or
cluster name.
* Fix wrong jobacctgather information on a step on multiple nodes
due to timeouts sending its the information gathered on its node.
* Fix missing xstrdup which could result in slurmctld segfault on array jobs.
* Fix security issue in PrologSlurmctld and EpilogSlurmctld by always
prepending SPANK_ to all user-set environment variables. CVE-2021-31215.
- New features from 20.11.6:
* Fix sacct assert with the --qos option.
* Use pkg-config --atleast-version instead of --modversion for systemd.
* common/fd - fix getsockopt() call in fd_get_socket_error().
* Properly handle the return from fd_get_socket_error() in _conn_readable().
* cons_res - Fix issue where running jobs were not taken into consideration
when creating a reservation.
* Avoid a deadlock between job_list for_each and assoc QOS_LOCK.
* Fix TRESRunMins usage for partition qos on restart/reconfig.
* Fix printing of number of tasks on a completed job that didn't request
tasks.
* Fix updating GrpTRESRunMins when decrementing job time is bigger than it.
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=179
- Udpate to 20.11.04
* Fix node selection for advanced reservations with features.
* mpi/pmix: Handle pipe failure better when using ucx.
* mpi/pmix: include PMIX_NODEID for each process entry.
* Fix job getting rejected after being requeued on same node that died.
* job_submit/lua - add "network" field.
* Fix situations when a reoccuring reservation could erroneously skip a
period.
* Ensure that a reservations [pro|epi]log are ran on reoccuring reservations.
* Fix threads-per-core memory allocation issue when using CR_CPU_MEMORY.
* Fix scheduling issue with --gpus.
* Fix gpu allocations that request --cpus-per-task.
* mpi/pmix: fixed print messages for all PMIXP_* macros
* Add mapping for XCPU to --signal option.
* Fix regression in 20.11 that prevented a full pass of the main scheduler
from ever executing.
* Work around a glibc bug in which "0" is incorrectly printed as "nan"
which will result in corrupted association state on restart.
* Fix regression in 20.11 which made slurmd incorrectly attempt to find the
parent slurmd address when not applicable and send incorrect reverse*tree
info to the slurmstepd.
* Fix cgroup ns detection when using containers (e.g. LXC or Docker).
* scrontab - change temporary file handling to work with emacs.
- Removed check-for-lipmix.so.MAJOR.patch
- Added: load-pmix-major-version.patch
OBS-URL: https://build.opensuse.org/request/show/874647
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=173
- Update to 20.11.03
- This release includes a major functional change to how job step launch is
handled compared to the previous 20.11 releases. This affects srun as
well as MPI stacks - such as Open MPI - which may use srun internally as
part of the process launch.
One of the changes made in the Slurm 20.11 release was to the semantics
for job steps launched through the 'srun' command. This also
inadvertently impacts many MPI releases that use srun underneath their
own mpiexec/mpirun command.
For 20.11.{0,1,2} releases, the default behavior for srun was changed
such that each step was allocated exactly what was requested by the
options given to srun, and did not have access to all resources assigned
to the job on the node by default. This change was equivalent to Slurm
setting the --exclusive option by default on all job steps. Job steps
desiring all resources on the node needed to explicitly request them
through the new '--whole' option.
In the 20.11.3 release, we have reverted to the 20.02 and older behavior
of assigning all resources on a node to the job step by default.
This reversion is a major behavioral change which we would not generally
do on a maintenance release, but is being done in the interest of
restoring compatibility with the large number of existing Open MPI (and
other MPI flavors) and job scripts that exist in production, and to
remove what has proven to be a significant hurdle in moving to the new
release.
Please note that one change to step launch remains - by default, in
20.11 steps are no longer permitted to overlap on the resources they
have been assigned. If that behavior is desired, all steps must
explicitly opt-in through the newly added '--overlap' option.
Further details and a full explanation of the issue can be found at:
https://bugs.schedmd.com/show_bug.cgi?id=10383#c63
OBS-URL: https://build.opensuse.org/request/show/864993
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=171
- Add support for configuration files from external plugins.
While built-in plugins have their configuration added in slurm.conf,
external SPANK plugins add their configuration to plugstack.conf
To allow packaging easily spank plugins, their configuration files
should be added independently at /etc/spack/plugstack.conf.d and
plugstack.conf should be left with an oneliner including all the
files under /etc/spack/plugstack.conf.d
OBS-URL: https://build.opensuse.org/request/show/860690
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=164
- Update to 20.11.02
* Fix older versions of sacct not working with 20.11.
* Fix slurmctld crash when using a pre-20.11 srun in a job allocation.
* Correct logic problem in _validate_user_access.
* Fix libpmi to initialize Slurm configuration correctly.
- Update to 20.11.01
* Fix spelling of "overcomited" to "overcomitted" in sreport's cluster
utilization report.
* Silence debug message about shutting down backup controllers if none are
configured.
* Don't create interactive srun until PrologSlurmctld is done.
* Fix fd symlink path resolution.
* Fix slurmctld segfault on subnode reservation restore after node
configuration change.
* Fix resource allocation response message environment allocation size.
* Ensure that details->env_sup is NULL terminated.
* select/cray_aries - Correctly remove jobs/steps from blades using NPC.
* cons_tres - Avoid max_node_gres when entire node is allocated with
--ntasks-per-gpu.
* Allow NULL arg to data_get_type().
* In sreport have usage for a reservation contain all jobs that ran in the
reservation instead of just the ones that ran in the time specified. This
matches the report for the reservation is not truncated for a time period.
* Fix issue with sending wrong batch step id to a < 20.11 slurmd.
* Add a job's alloc_node to lua for job modification and completion.
* Fix regression getting a slurmdbd connection through the perl API.
* Stop the extern step terminate monitor right after proctrack_g_wait().
* Fix removing the normalized priority of assocs.
* slurmrestd/v0.0.36 - Use correct name for partition field:
"min nodes per job" -"min_nodes_per_job".
OBS-URL: https://build.opensuse.org/request/show/859114
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=162
- Update to version 20.11.0
Slurm 20.11 includes a number of new features including:
* Overhaul of the job step management and launch code, alongside improved
GPU task placement support.
* A new "Interactive Step" mode of operation for salloc.
* A new "scrontab" command that can be used to submit and manage
periodically repeating jobs.
* IPv6 support.
* Changes to the reservation logic, with new options allowing users
to delete reservations, allowing admins to skip the next occurance of a
repeated reservation, and allowing for a job to be submitted and eligible
to run within multiple reservations.
* Dynamic Future Nodes - automatically associate a dynamically
provisioned (or "cloud") node against a NodeName definition with matching
hardware.
* An experimental new RPC queuing mode for slurmctld to reduce thread
contention on heavily loaded clusters.
* SlurmDBD integration with the Slurm REST API.
Also check
https://github.com/SchedMD/slurm/blob/slurm-20-11-0-1/RELEASE_NOTES (forwarded request 852039 from eeich)
OBS-URL: https://build.opensuse.org/request/show/853268
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=50
- Update to version 20.11.0
Slurm 20.11 includes a number of new features including:
* Overhaul of the job step management and launch code, alongside improved
GPU task placement support.
* A new "Interactive Step" mode of operation for salloc.
* A new "scrontab" command that can be used to submit and manage
periodically repeating jobs.
* IPv6 support.
* Changes to the reservation logic, with new options allowing users
to delete reservations, allowing admins to skip the next occurance of a
repeated reservation, and allowing for a job to be submitted and eligible
to run within multiple reservations.
* Dynamic Future Nodes - automatically associate a dynamically
provisioned (or "cloud") node against a NodeName definition with matching
hardware.
* An experimental new RPC queuing mode for slurmctld to reduce thread
contention on heavily loaded clusters.
* SlurmDBD integration with the Slurm REST API.
Also check
https://github.com/SchedMD/slurm/blob/slurm-20-11-0-1/RELEASE_NOTES
OBS-URL: https://build.opensuse.org/request/show/852039
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=160
- Updated to 20.02.5, changes:
* Fix leak of TRESRunMins when job time is changed with --time-min
* pam_slurm - explicitly initialize slurm config to support configless mode.
* scontrol - Fix exit code when creating/updating reservations with wrong
Flags.
* When a GRES has a no_consume flag, report 0 for allocated.
* Fix cgroup cleanup by jobacct_gather/cgroup.
* When creating reservations/jobs don't allow counts on a feature unless
using an XOR.
* Improve number of boards discovery
* Fix updating a reservation NodeCnt on a zero-count reservation.
* slurmrestd - provide an explicit error messages when PSK auth fails.
* cons_tres - fix job requesting single gres per-node getting two or more
nodes with less CPUs than requested per-task.
* cons_tres - fix calculation of cores when using gres and cpus-per-task.
* cons_tres - fix job not getting access to socket without GPU or with less
than --gpus-per-socket when not enough cpus available on required socket
and not using --gres-flags=enforce binding.
* Fix HDF5 type version build error.
* Fix creation of CoreCnt only reservations when the first node isn't
available.
* Fix wrong DBD Agent queue size in sdiag when using accounting_storage/none.
* Improve job constraints XOR option logic.
* Fix preemption of hetjobs when needed nodes not in leader component.
* Fix wrong bit_or() messing potential preemptor jobs node bitmap, causing
bad node deallocations and even allocation of nodes from other partitions.
* Fix double-deallocation of preempted non-leader hetjob components.
* slurmdbd - prevent truncation of the step nodelists over 4095.
* Fix nodes remaining in drain state state after rebooting with ASAP option.
- changes from 20.02.4:
OBS-URL: https://build.opensuse.org/request/show/845108
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=156
- Updated to 20.02.3 which fixes CVE-2020-12693 (bsc#1172004).
- Other changes are:
* Factor in ntasks-per-core=1 with cons_tres.
* Fix formatting in error message in cons_tres.
* Fix calling stat on a NULL variable.
* Fix minor memory leak when using reservations with flags=first_cores.
* Fix gpu bind issue when CPUs=Cores and ThreadsPerCore > 1 on a node.
* Fix --mem-per-gpu for heterogenous --gres requests.
* Fix slurmctld load order in load_all_part_state().
* Fix race condition not finding jobacct gather task cgroup entry.
* Suppress error message when selecting nodes on disjoint topologies.
* Improve performance of _pack_default_job_details() with large number of job
* arguments.
* Fix archive loading previous to 17.11 jobs per-node req_mem.
* Fix regresion validating that --gpus-per-socket requires --sockets-per-node
* for steps. Should only validate allocation requests.
* error() instead of fatal() when parsing an invalid hostlist.
* nss_slurm - fix potential deadlock in slurmstepd on overloaded systems.
* cons_tres - fix --gres-flags=enforce-binding and related --cpus-per-gres.
* cons_tres - Allocate lowest numbered cores when filtering cores with gres.
* Fix getting system counts for named GRES/TRES.
* MySQL - Fix for handing typed GRES for association rollups.
* Fix step allocations when tasks_per_core > 1.
* Fix allocating more GRES than requested when asking for multiple GRES types.
OBS-URL: https://build.opensuse.org/request/show/808569
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=45
- Updated to 20.02.3 which fixes CVE-2020-12693
- Other changes are:
* Factor in ntasks-per-core=1 with cons_tres.
* Fix formatting in error message in cons_tres.
* Fix calling stat on a NULL variable.
* Fix minor memory leak when using reservations with flags=first_cores.
* Fix gpu bind issue when CPUs=Cores and ThreadsPerCore > 1 on a node.
* Fix --mem-per-gpu for heterogenous --gres requests.
* Fix slurmctld load order in load_all_part_state().
* Fix race condition not finding jobacct gather task cgroup entry.
* Suppress error message when selecting nodes on disjoint topologies.
* Improve performance of _pack_default_job_details() with large number of job
* arguments.
* Fix archive loading previous to 17.11 jobs per-node req_mem.
* Fix regresion validating that --gpus-per-socket requires --sockets-per-node
* for steps. Should only validate allocation requests.
* error() instead of fatal() when parsing an invalid hostlist.
* nss_slurm - fix potential deadlock in slurmstepd on overloaded systems.
* cons_tres - fix --gres-flags=enforce-binding and related --cpus-per-gres.
* cons_tres - Allocate lowest numbered cores when filtering cores with gres.
* Fix getting system counts for named GRES/TRES.
* MySQL - Fix for handing typed GRES for association rollups.
* Fix step allocations when tasks_per_core > 1.
* Fix allocating more GRES than requested when asking for multiple GRES types.
OBS-URL: https://build.opensuse.org/request/show/808130
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=147
- Updated to 20.02.1 with following changes"
* Improve job state reason for jobs hitting partition_job_depth.
* Speed up testing of singleton dependencies.
* Fix negative loop bound in cons_tres.
* srun - capture the MPI plugin return code from mpi_hook_client_fini() and
use as final return code for step failure.
* Fix segfault in cli_filter/lua.
* Fix --gpu-bind=map_gpu reusability if tasks > elements.
* Make sure config_flags on a gres are sent to the slurmctld on node
registration.
* Prolog/Epilog - Fix missing GPU information.
* Fix segfault when using config parser for expanded lines.
* Fix bit overlap test function.
* Don't accrue time if job begin time is in the future.
* Remove accrue time when updating a job start/eligible time to the future.
* Fix regression in 20.02.0 that broke --depend=expand.
* Reset begin time on job release if it's not in the future.
* Fix for recovering burst buffers when using high-availability.
* Fix invalid read due to freeing an incorrectly allocated env array.
* Update slurmctld -i message to warn about losing data.
* Fix scontrol cancel_reboot so it clears the DRAIN flag and node reason for a
pending ASAP reboot.
OBS-URL: https://build.opensuse.org/request/show/788905
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=145
- Update to version 20.02.0 (jsc#SLE-8491)
* Fix minor memory leak in slurmd on reconfig.
* Fix invalid ptr reference when rolling up data in the database.
* Change shtml2html.py to require python3 for RHEL8 support, and match
man2html.py.
* slurm.spec - override "hardening" linker flags to ensure RHEL8 builds
in a usable manner.
* Fix type mismatches in the perl API.
* Prevent use of uninitialized slurmctld_diag_stats.
* Fixed various Coverity issues.
* Only show warning about root-less topology in daemons.
* Fix accounting of jobs in IGNORE_JOBS reservations.
* Fix issue with batch steps state not loading correctly when upgrading from
19.05.
* Deprecate max_depend_depth in SchedulerParameters and move it to
DependencyParameters.
* Silence erroneous error on slurmctld upgrade when loading federation state.
* Break infinite loop in cons_tres dealing with incorrect tasks per tres
request resulting in slurmctld hang.
* Improve handling of --gpus-per-task to make sure appropriate number of GPUs
is assigned to job.
* Fix seg fault on cons_res when requesting --spread-job.
- Move to python3 for everything but SLE-11-SP4
* For SLE-11-SP4 add a workaround to handle a python3 script (python2.7
compliant).
OBS-URL: https://build.opensuse.org/request/show/779379
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=136
- Update to version 20.02.0-rc1
* sbatch - fix segfault when no newline at the end of a burst buffer file.
* Change scancel to only check job's base state when matching -t options.
* Save job dependency list in state files.
* cons_tres - allow jobs to be run on systems with root-less topologies.
* Restore pre-20.02pre1 PrologSlurmctld synchonization behavior to avoid
various race conditions, and ensure proper batch job launch.
* Add new slurmrestd command/daemon which implements the Slurm REST API.
- Update to version 20.02.0-0pre1, highlights are
OBS-URL: https://build.opensuse.org/request/show/774250
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=134