SHA256
1
0
forked from pool/slurm
Commit Graph

272 Commits

Author SHA256 Message Date
Ana Guerrero
3825e9fab0 Accepting request 1110422 from network:cluster
- Create a macro for upgrade dependency to ensure uniform handling. (forwarded request 1110421 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1110422
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=94
2023-09-12 19:02:53 +00:00
a323feff42 Accepting request 1110421 from home:eeich:branches:network:cluster
- Create a macro for upgrade dependency to ensure uniform handling.

OBS-URL: https://build.opensuse.org/request/show/1110421
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=262
2023-09-12 04:52:56 +00:00
Ana Guerrero
3bcde4bfd9 Accepting request 1110259 from network:cluster
- Updated to 23.02.4 with the following changes:
  * Bug Fixes:
    + Fix main scheduler loop not starting after a failover to backup
      controller. Avoid slurmctld segfault when specifying
     `AccountingStorageExternalHost` (bsc#1214983).
    + Fix sbatch return code when `--wait` is requested on a job array.
    + Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
    + Fix `slurmrestd` handling of job hold/release operations.
    + Fix step running indefinitely when slurmctld takes more than
      `MessageTimeout` to respond. Now, `slurmctld` will cancel the step when
       detected, preventing following steps from getting stuck waiting for
       resources to be released.
    + Fix regression to make `job_desc.min_cpus` accurate again in `job_submit`
      when requesting a job with `--ntasks-per-node`.
    + Fix handling of `ArrayTaskThrottle` in backfill.
    + Fix regression in 23.02.2 when checking gres state on `slurmctld`
      startup  or reconfigure. Gres changes in the configuration were not
      updated on slurmctld startup. On startup or reconfigure, these messages
      were present in the log: `"error: Attempt to change gres/gpu Count`".
    + Fix potential double count of gres when dealing with limits.
    + Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
    + Fixed an issue where jobs requesting licenses were incorrectly rejected.
    + `scrontab` - Fix cutting off the final character of quoted variables.
    + `smail` - Fix issues where e-mails at job completion were not being sent.
    + `scontrol/slurmctld` - fix comma parsing when updating a reservation's
       nodes.
    + Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus
      having more tasks than they should and other gpus being unused.
    + Fix regression in 23.02 that causes slurmstepd to crash when `srun`
      requests more than `TreeWidth` nodes in a step and uses the pmi2 or

OBS-URL: https://build.opensuse.org/request/show/1110259
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=93
2023-09-11 19:22:19 +00:00
f9646ba945 - Updated to 23.02.4 with the following changes:
* Bug Fixes:
    + Fix main scheduler loop not starting after a failover to backup
      controller. Avoid slurmctld segfault when specifying
     `AccountingStorageExternalHost` (bsc#1214983).
    + Fix sbatch return code when `--wait` is requested on a job array.
    + Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
    + Fix `slurmrestd` handling of job hold/release operations.
    + Fix step running indefinitely when slurmctld takes more than
      `MessageTimeout` to respond. Now, `slurmctld` will cancel the step when
       detected, preventing following steps from getting stuck waiting for
       resources to be released.
    + Fix regression to make `job_desc.min_cpus` accurate again in `job_submit`
      when requesting a job with `--ntasks-per-node`.
    + Fix handling of `ArrayTaskThrottle` in backfill.
    + Fix regression in 23.02.2 when checking gres state on `slurmctld`
      startup  or reconfigure. Gres changes in the configuration were not
      updated on slurmctld startup. On startup or reconfigure, these messages
      were present in the log: `"error: Attempt to change gres/gpu Count`".
    + Fix potential double count of gres when dealing with limits.
    + Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
    + Fixed an issue where jobs requesting licenses were incorrectly rejected.
    + `scrontab` - Fix cutting off the final character of quoted variables.
    + `smail` - Fix issues where e-mails at job completion were not being sent.
    + `scontrol/slurmctld` - fix comma parsing when updating a reservation's
       nodes.
    + Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus
      having more tasks than they should and other gpus being unused.
    + Fix regression in 23.02 that causes slurmstepd to crash when `srun`
      requests more than `TreeWidth` nodes in a step and uses the pmi2 or

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=260
2023-09-11 07:21:32 +00:00
Ana Guerrero
6b47182efe Accepting request 1109308 from network:cluster
- Fixes since 23.02.03:
  Highlights:
  * Fix main scheduler loop not starting after a failover to backup controller.
  * Avoid slurmctld segfault when specifying `AccountingStorageExternalHost`
    (bsc#1214983).
  Other:
  * Fix sbatch return code when `--wait` is requested on a job array.
  * Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
  * Fix `slurmrestd` handling of job hold/release operations.
  * Make spank `S_JOB_ARGV` item value hold the requested command `argv`
    instead of the `srun --bcast` value when `--bcast` requested (only in local
    context).
  * Fix step running indefinitely when slurmctld takes more than
    `MessageTimeout` to respond. Now, slurmctld will cancel the step when
    detected, preventing following steps from getting stuck waiting for
    resources to be released.
  * Fix regression to make `job_desc.min_cpus` accurate again in job_submit when
    requesting a job with `--ntasks-per-node`.
  * Fix handling of `ArrayTaskThrottle` in backfill.
  * Fix regression in 23.02.2 when checking gres state on `slurmctld` startup or
    reconfigure. Gres changes in the configuration were not updated on slurmctld
    startup. On startup or reconfigure, these messages were present in the log:
    `"error: Attempt to change gres/gpu Count`".
  * Fix potential double count of gres when dealing with limits.
  * Fix slurmstepd segfault when ContainerPath is not set in `oci.conf`
  * Fixed an issue where jobs requesting licenses were incorrectly rejected.
  * `scrontab` - Fix cutting off the final character of quoted variables.
  * `smail` - Fix issues where e-mails at job completion were not being sent.
  * `scontrol/slurmctld` - fix comma parsing when updating a reservation's
    nodes.

OBS-URL: https://build.opensuse.org/request/show/1109308
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=92
2023-09-07 19:12:41 +00:00
c63b605916 - Fixes since 23.02.03:
Highlights:
  * Fix main scheduler loop not starting after a failover to backup controller.
  * Avoid slurmctld segfault when specifying `AccountingStorageExternalHost`
    (bsc#1214983).
  Other:
  * Fix sbatch return code when `--wait` is requested on a job array.
  * Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
  * Fix `slurmrestd` handling of job hold/release operations.
  * Make spank `S_JOB_ARGV` item value hold the requested command `argv`
    instead of the `srun --bcast` value when `--bcast` requested (only in local
    context).
  * Fix step running indefinitely when slurmctld takes more than
    `MessageTimeout` to respond. Now, slurmctld will cancel the step when
    detected, preventing following steps from getting stuck waiting for
    resources to be released.
  * Fix regression to make `job_desc.min_cpus` accurate again in job_submit when
    requesting a job with `--ntasks-per-node`.
  * Fix handling of `ArrayTaskThrottle` in backfill.
  * Fix regression in 23.02.2 when checking gres state on `slurmctld` startup or
    reconfigure. Gres changes in the configuration were not updated on slurmctld
    startup. On startup or reconfigure, these messages were present in the log:
    `"error: Attempt to change gres/gpu Count`".
  * Fix potential double count of gres when dealing with limits.
  * Fix slurmstepd segfault when ContainerPath is not set in `oci.conf`
  * Fixed an issue where jobs requesting licenses were incorrectly rejected.
  * `scrontab` - Fix cutting off the final character of quoted variables.
  * `smail` - Fix issues where e-mails at job completion were not being sent.
  * `scontrol/slurmctld` - fix comma parsing when updating a reservation's
    nodes.

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=258
2023-09-06 17:11:37 +00:00
Ana Guerrero
51bec69223 Accepting request 1109029 from network:cluster
- updated to 23.02.04 which includes following changes: 
  * fixing the main scheduler loop not starting on the backup controller after
    a failover event, a segfault when attempting to use
  * AccountingStorageExternalHost, and an issue where steps could continue
    running indefinitely if the slurmctld takes too long to respond (bsc#1214983)
  * include a fix for a potential slurmctld crashes when the backup slurmctld
    takes over.
  * This also fixes some issues when using older versions of the command line
    tools with a 23.02 controller.
  * srun/sbatch/salloc - In order to support user namespaces, process user and
    group ids are no longer used unless explicitly requested as an argument and
    are left as nobody(99) by default. Any cli_filters or SPANK plugins need to
    ignore any uid or gid that equal SLURM_AUTH_NOBODY (99). User and group ids
    are now resolved by the active auth plugin. To determine the actual job uid
    or gid you should use the RESPONSE_RESOURCE_ALLOCATION RPC.
- removed Fix-test-3.13.patch as fixed upstream
- removed Fix-test-38.11.patch as test changed upstream (forwarded request 1109009 from mslacken)

OBS-URL: https://build.opensuse.org/request/show/1109029
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=91
2023-09-06 16:57:11 +00:00
47d665607b Accepting request 1109009 from home:mslacken:branches:network:cluster
- updated to 23.02.04 which includes following changes: 
  * fixing the main scheduler loop not starting on the backup controller after
    a failover event, a segfault when attempting to use
  * AccountingStorageExternalHost, and an issue where steps could continue
    running indefinitely if the slurmctld takes too long to respond (bsc#1214983)
  * include a fix for a potential slurmctld crashes when the backup slurmctld
    takes over.
  * This also fixes some issues when using older versions of the command line
    tools with a 23.02 controller.
  * srun/sbatch/salloc - In order to support user namespaces, process user and
    group ids are no longer used unless explicitly requested as an argument and
    are left as nobody(99) by default. Any cli_filters or SPANK plugins need to
    ignore any uid or gid that equal SLURM_AUTH_NOBODY (99). User and group ids
    are now resolved by the active auth plugin. To determine the actual job uid
    or gid you should use the RESPONSE_RESOURCE_ALLOCATION RPC.
- removed Fix-test-3.13.patch as fixed upstream
- removed Fix-test-38.11.patch as test changed upstream

OBS-URL: https://build.opensuse.org/request/show/1109009
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=256
2023-09-05 11:47:06 +00:00
Dominique Leuenberger
03d2eefa9e Accepting request 1085677 from network:cluster
OBS-URL: https://build.opensuse.org/request/show/1085677
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=90
2023-05-09 11:09:16 +00:00
532aa1e96d Accepting request 1085668 from home:mslacken:branches:network:cluster
- updated to 23.02.02 which includes a number of fixes to Slurm stability
  * Includes a fix for a regression in 23.02 that caused openmpi mpirun to fail
    to launch tasks. 
  * It also includes two functional changes: Don't update the cron job tasks if
    the whole crontab file is left untouched after opening it with scrontab -e
  * Sort dynamic nodes and include them in topology after scontrol reconfigure
    or a slurmctld restart.

OBS-URL: https://build.opensuse.org/request/show/1085668
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=254
2023-05-09 10:35:16 +00:00
Dominique Leuenberger
0d5e08df4b Accepting request 1083466 from network:cluster
- Web-configurator: changed presets to SUSE defaults.
- If %_restart_on_update is no longer defined replace by own
  macro.
- Marked slurm-openlava, slurm-seff and slurm-sjstat noarch.
- rpmlint:
  * dropped some rpmlint filters which are no longer relevant.
  * added/refreshed filters. For Details, see rpmlintrc.
- Remove workaround to fix the restart issue in an Slurm package
  described in bsc#1088693.
  The Slurm version in this package as 16.05. Any attempt to
  directly migrate to the current version is bound to fail
  anyway.
- Now require slurm-munge if munge authentication is installed.

OBS-URL: https://build.opensuse.org/request/show/1083466
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=89
2023-04-28 14:23:13 +00:00
33bf8791ac - Require slurm-munge if munge authentication is installed.
- Replace 'Require: config(pam)' by 'Require: pam'.

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=252
2023-04-28 07:46:44 +00:00
392bec3223 Accepting request 1082770 from home:eeich:branches:network:cluster
- Web-configurator: changed presets to SUSE defaults.
- If %_restart_on_update is no longer defined replace by own
  macro.
- Marked slurm-openlava, slurm-seff and slurm-sjstat noarch.
- rpmlint:
  * dropped some rpmlint filters which are no longer relevant.
  * added/refreshed filters. For Details, see rpmlintrc.
- Remove workaround to fix the restart issue in an Slurm package
  described in bsc#1088693.
  The Slurm version in this package as 16.05. Any attempt to
  directly migrate to the current version is bound to fail
  anyway.

OBS-URL: https://build.opensuse.org/request/show/1082770
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=251
2023-04-27 13:24:37 +00:00
Dominique Leuenberger
e27e58c1b6 Accepting request 1076522 from network:cluster
- updated to 23.02.1 with the following changes:
  * job_container/tmpfs - cleanup job container even if namespace mount is
    already unmounted.
  * openapi/dbv0.0.38 - Fix not displaying an error when updating QOS or
    associations fails.
  * Fix nodes remaining as PLANNED after slurmctld save state recovery.
  * Add cgroup.conf EnableControllers option for cgroup/v2.
  * Get correct cgroup root to allow slurmd to run in containers like Docker.
  * slurmctld - add missing PrivateData=jobs check to step ContainerID lookup
    requests originated from 'scontrol show step container-id=<id>' or certain
    scrun operations when container state can't be directly queried.
  * Fix nodes un-draining after being drained due to unkillable step.
  * Fix remote licenses allowed percentages reset to 0 during upgrade.
  * sacct - Avoid truncating time strings when using SLURM_TIME_FORMAT with
    the --parsable option.
  * Fix regression in 22.05.0rc1 that broke Nodes=ALL in a NodeSet.
  * openapi/v0.0.39 - fix jobs submitted via slurmrestd being allocated fewer
    CPUs than tasks when requesting multiple tasks.
  * Fix job not being scheduled on valid nodes and potentially being rejected
    when using parentheses at the beginning of square brackets in a feature
    request, for example: "feat1&[(feat2|feat3)]".
  * Fix regression in 23.02.0rc1 which made --gres-flags=enforce-binding no
    longer enforce optimal core-gpu job placement.
  * mpi/pmix - Fix v5 to load correctly when libpmix.so isn't in the normal
    lib path.
  * data_parser/v0.0.39 - fix regression where "memory_per_node" would be
    rejected for job submission.
  * data_parser/v0.0.39 - fix regression where "memory_per_cpu" would be
    rejected for job submission.
  * slurmctld - add an assert to check for magic number presence before deleting

OBS-URL: https://build.opensuse.org/request/show/1076522
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=88
2023-04-01 17:32:20 +00:00
5a68fc8e5f - updated to 23.02.1 with the following changes:
- removed right-pmix-path.patch as fixed upstream

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=249
2023-03-31 15:48:27 +00:00
d2a2e0a1e8 Accepting request 1076461 from home:mslacken:branches:network:cluster
- updated to 23.02.1 with following chnages:
  * job_container/tmpfs - cleanup job container even if namespace mount is
    already unmounted.
  * openapi/dbv0.0.38 - Fix not displaying an error when updating QOS or
    associations fails.
  * Fix nodes remaining as PLANNED after slurmctld save state recovery.
  * Add cgroup.conf EnableControllers option for cgroup/v2.
  * Get correct cgroup root to allow slurmd to run in containers like Docker.
  * slurmctld - add missing PrivateData=jobs check to step ContainerID lookup
    requests originated from 'scontrol show step container-id=<id>' or certain
    scrun operations when container state can't be directly queried.
  * Fix nodes un-draining after being drained due to unkillable step.
  * Fix remote licenses allowed percentages reset to 0 during upgrade.
  * sacct - Avoid truncating time strings when using SLURM_TIME_FORMAT with
    the --parsable option.
  * Fix regression in 22.05.0rc1 that broke Nodes=ALL in a NodeSet.
  * openapi/v0.0.39 - fix jobs submitted via slurmrestd being allocated fewer
    CPUs than tasks when requesting multiple tasks.
  * Fix job not being scheduled on valid nodes and potentially being rejected
    when using parentheses at the beginning of square brackets in a feature
    request, for example: "feat1&[(feat2|feat3)]".
  * Fix regression in 23.02.0rc1 which made --gres-flags=enforce-binding no
    longer enforce optimal core-gpu job placement.
  * mpi/pmix - Fix v5 to load correctly when libpmix.so isn't in the normal
    lib path.
  * data_parser/v0.0.39 - fix regression where "memory_per_node" would be
    rejected for job submission.
  * data_parser/v0.0.39 - fix regression where "memory_per_cpu" would be
    rejected for job submission.
  * slurmctld - add an assert to check for magic number presence before deleting

OBS-URL: https://build.opensuse.org/request/show/1076461
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=248
2023-03-31 15:44:08 +00:00
Dominique Leuenberger
c7d67ed696 Accepting request 1072592 from network:cluster
added: right-pmix-path.patch (forwarded request 1072591 from mslacken)

OBS-URL: https://build.opensuse.org/request/show/1072592
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=87
2023-03-17 16:05:03 +00:00
5c3d4865a1 Accepting request 1072591 from home:mslacken:branches:network:cluster
added: right-pmix-path.patch

OBS-URL: https://build.opensuse.org/request/show/1072591
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=246
2023-03-17 10:52:44 +00:00
9883ad6d58 Accepting request 1072585 from home:mslacken:branches:network:cluster
- use libpmix.so.2 instead of libpmix.so to fix (bsc#1209260)
  this removes the need of pmix-pluginlib

OBS-URL: https://build.opensuse.org/request/show/1072585
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=245
2023-03-17 10:42:09 +00:00
Dominique Leuenberger
2de2dcca49 Accepting request 1072087 from network:cluster
- slurm-plugins need to require pmix-pluginlib (bsc#1209260) (forwarded request 1072084 from mslacken)

OBS-URL: https://build.opensuse.org/request/show/1072087
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=86
2023-03-15 17:56:12 +00:00
521f372d87 Accepting request 1072084 from home:mslacken:branches:network:cluster
- slurm-plugins need to require pmix-pluginlib (bsc#1209260)

OBS-URL: https://build.opensuse.org/request/show/1072084
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=243
2023-03-15 10:57:09 +00:00
Dominique Leuenberger
c224ea00c3 Accepting request 1070214 from network:cluster
- Fixing dependencies for slurm--plugin-ext-sensors-rrd again. (forwarded request 1070212 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1070214
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=85
2023-03-09 16:45:23 +00:00
e85b508441 Accepting request 1070212 from home:eeich:branches:network:cluster
- Fixing dependencies for slurm--plugin-ext-sensors-rrd again.

OBS-URL: https://build.opensuse.org/request/show/1070212
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=241
2023-03-08 15:43:28 +00:00
86940cb8c4 Accepting request 1070094 from home:eeich:branches:network:cluster
- Fix conflicts for plugin-ext-sensors-rrd

OBS-URL: https://build.opensuse.org/request/show/1070094
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=240
2023-03-08 07:58:58 +00:00
0f04c66747 Accepting request 1070043 from home:eeich:branches:network:cluster
- Fixup previous submission.

OBS-URL: https://build.opensuse.org/request/show/1070043
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=239
2023-03-07 22:14:15 +00:00
da464bfaae Accepting request 1070038 from home:eeich:branches:network:cluster
- Stop pulling firewall rules from github. There is no benefit to
  host these separately.
- Remove pre-sle12 pieces.

- Add missing Provides:, Conflicts: and Obsoletes: to slurm-cray,
  slurm-hdf5 and slurm-testsuite to avoid package conflicts.
- Unify Obsoletes:.
- Consolidate spec files between different Slurm releases in
  Leap/SLE maintenance.

OBS-URL: https://build.opensuse.org/request/show/1070038
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=238
2023-03-07 21:33:03 +00:00
Dominique Leuenberger
50b2b76a05 Accepting request 1068523 from network:cluster
- Add missing Provides: and Obsoletes: to slurm-cray, slurm-hdf5
  and slurm-testsuite to avoid package conflicts.
- Add dependency for the general plugin package to the
  AcctGatherProfile HDF5 plugin.
- Adjust node RealMemory in slurm.conf of test suite for 8G test
  nodes. (forwarded request 1068522 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1068523
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=84
2023-03-02 22:03:34 +00:00
6997bacde0 Accepting request 1068522 from home:eeich:branches:network:cluster
- Add missing Provides: and Obsoletes: to slurm-cray, slurm-hdf5
  and slurm-testsuite to avoid package conflicts.
- Add dependency for the general plugin package to the
  AcctGatherProfile HDF5 plugin.
- Adjust node RealMemory in slurm.conf of test suite for 8G test
  nodes.

OBS-URL: https://build.opensuse.org/request/show/1068522
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=236
2023-03-01 17:58:54 +00:00
Dominique Leuenberger
8a8f7dcb78 Accepting request 1068320 from network:cluster
- updated to 23.02.0
  * Highlights
    + slurmctld - Add new RPC rate limiting feature. This is enabled through
      SlurmctldParameters=rl_enable, otherwise disabled by default.
    + Make scontrol reconfigure and sending a SIGHUP to the slurmctld behave
      the same. If you were using SIGHUP as a 'lighter' scontrol reconfigure
      to rotate logs please update your scripts to use SIGUSR2 instead.
    + Change cloud nodes to show by default. PrivateData=cloud is no longer
      needed.
    + sreport - Count planned (FKA reserved) time for jobs running in
      IGNORE_JOBS reservations. Previously was lumped into IDLE time.
    + job_container/tmpfs - Support running with an arbitrary list of private
      mount points (/tmp and /dev/shm are the default, but not required).
    + job_container/tmpfs - Set more environment variables in InitScript.
    + Make all cgroup directories created by Slurm owned by root. This was the
      behavior in cgroup/v2 but not in cgroup/v1 where by default the step
      directories ownership were set to the user and group of the job.
    + accounting_storage/mysql - change purge/archive to calculate record ages
      based on end time, rather than start or submission times.
    + job_submit/lua - add support for log_user() from slurm_job_modify().
    + Run the following scripts in slurmscriptd instead of slurmctld:
      ResumeProgram, ResumeFailProgram, SuspendProgram, ResvProlog, ResvEpilog,
      and RebootProgram (only with SlurmctldParameters=reboot_from_controller).
    + Only permit changing log levels with 'srun --slurmd-debug' by root
      or SlurmUser.
    + slurmctld will fatal() when reconfiguring the job_submit plugin fails.
    + Add PowerDownOnIdle partition option to power down nodes after nodes
      become idle.
    + Add "[jobid.stepid]" prefix from slurmstepd and "slurmscriptd" prefix
      from slurmcriptd to Syslog logging. Previously was only happening when

OBS-URL: https://build.opensuse.org/request/show/1068320
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=83
2023-03-01 15:14:17 +00:00
e60f39a466 - updated to 23.02.0
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=234
2023-02-28 20:50:48 +00:00
8899aac00b - testsuite: on later SUSE versions claim ownership of directory
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=233
2023-02-28 20:34:03 +00:00
18aa012ab9 Accepting request 1068316 from home:eeich:branches:network:cluster
+ Fixed GpuFreqDef option. When set in slurm.conf, it will be used if
      --gpu-freq was not explicitly set by the job step.
    + topology/tree - Add new TopologyParam=SwitchAsNodeRank option to reorder
      nodes based on switch layout. This can be useful if the naming convention
      for the nodes does not natually map to the network topology.
    + Removed the default setting for GpuFreqDef. If unset, no attempt to change
      the GPU frequency will be made if --gpu-freq is not set for the step.

OBS-URL: https://build.opensuse.org/request/show/1068316
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=232
2023-02-28 20:30:32 +00:00
ef6d6521aa Accepting request 1067475 from home:eeich:branches:network:cluster
- updated to 23.02.0-0rc1
  * Highlights
    + slurmctld - Add new RPC rate limiting feature. This is enabled through
      SlurmctldParameters=rl_enable, otherwise disabled by default.
    + Make scontrol reconfigure and sending a SIGHUP to the slurmctld behave
      the same. If you were using SIGHUP as a 'lighter' scontrol reconfigure
      to rotate logs please update your scripts to use SIGUSR2 instead.
    + Change cloud nodes to show by default. PrivateData=cloud is no longer
      needed.
    + sreport - Count planned (FKA reserved) time for jobs running in
      IGNORE_JOBS reservations. Previously was lumped into IDLE time.
    + job_container/tmpfs - Support running with an arbitrary list of private
      mount points (/tmp and /dev/shm are the default, but not required).
    + job_container/tmpfs - Set more environment variables in InitScript.
    + Make all cgroup directories created by Slurm owned by root. This was the
      behavior in cgroup/v2 but not in cgroup/v1 where by default the step
      directories ownership were set to the user and group of the job.
    + accounting_storage/mysql - change purge/archive to calculate record ages
      based on end time, rather than start or submission times.
    + job_submit/lua - add support for log_user() from slurm_job_modify().
    + Run the following scripts in slurmscriptd instead of slurmctld:
      ResumeProgram, ResumeFailProgram, SuspendProgram, ResvProlog, ResvEpilog,
      and RebootProgram (only with SlurmctldParameters=reboot_from_controller).
    + Only permit changing log levels with 'srun --slurmd-debug' by root
      or SlurmUser.
    + slurmctld will fatal() when reconfiguring the job_submit plugin fails.
    + Add PowerDownOnIdle partition option to power down nodes after nodes
      become idle.
    + Add "[jobid.stepid]" prefix from slurmstepd and "slurmscriptd" prefix
      from slurmcriptd to Syslog logging. Previously was only happening when

OBS-URL: https://build.opensuse.org/request/show/1067475
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=231
2023-02-23 19:32:51 +00:00
Dominique Leuenberger
d1ebf00ba6 Accepting request 1063957 from network:cluster
- testsuite: on laster SUSE versions claim ownership of directory
  /etc/security/limits.d. (forwarded request 1063954 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1063957
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=82
2023-02-09 15:23:26 +00:00
4693e39860 Accepting request 1063954 from home:eeich:branches:network:cluster
- testsuite: on laster SUSE versions claim ownership of directory
  /etc/security/limits.d.

OBS-URL: https://build.opensuse.org/request/show/1063954
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=229
2023-02-09 08:22:55 +00:00
Dominique Leuenberger
a4484c7dc2 Accepting request 1042071 from network:cluster
OBS-URL: https://build.opensuse.org/request/show/1042071
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=81
2022-12-11 16:16:58 +00:00
6f080824a4 Accepting request 1039957 from home:eeich:branches:network:cluster
- Move the ext_sensors/rrd plugin to a separate package: this
  plugin requires librrd which in turn requires huge parts of
  the client side X Window System stack.
  There is probably no use in cluttering up a system for a
  plugin that probably only used by a few.

OBS-URL: https://build.opensuse.org/request/show/1039957
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=227
2022-12-11 07:58:12 +00:00
Dominique Leuenberger
30dd030610 Accepting request 1031255 from network:cluster
- Test Suite fixes:
  * Update README_Testsuite.md.
  * Clean up left over files when de-installing test suite.
  * Adjustment to test suite package: for SLE mark the openmpi4
    devel package and slurm-hdf5 optional.
  * Add -ffat-lto-objects to the build flags when LTO is set to
    make sure the object files we ship with the test suite still
    work correctly.
  * Improve setup-testsuite.sh: copy ssh fingerprints from all nodes.

- set environment variable SUSE_ZNOW to 0 in %build to avoid module load
  failures due to unresolved symbols as module take advantage of lazy
  bindings (bsc#1200030).

OBS-URL: https://build.opensuse.org/request/show/1031255
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=80
2022-10-26 10:32:00 +00:00
212048404b * Improve setup-testsuite.sh: copy ssh fingerprints from all nodes.
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=225
2022-10-26 06:23:36 +00:00
776ce8f23b - Test Suite fixes:
* Update README_Testsuite.md.
  * Clean up left over files when de-installing test suite.
  * Adjustment to test suite package: for SLE mark the openmpi4
    devel package and slurm-hdf5 optional.
  * Add -ffat-lto-objects to the build flags when LTO is set to
    make sure the object files we ship with the test suite still
    work correctly.

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=224
2022-10-25 11:33:49 +00:00
642a47efa7 - Adjustment to test suite package: only recommend openmpi4
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=223
2022-10-24 08:54:35 +00:00
52046053d5 Accepting request 1030610 from home:eeich:branches:network:cluster
- Update README_Testsuite.md.
- Make hdf5 package optional for test suite.
- Clean up left over files when de-installing test suite.

- set environment variable SUSE_ZNOW to 0 in %build to avoid module load
  failures due to unresolved symbols as module take advantage of lazy
  bindings (bsc#1200030).

OBS-URL: https://build.opensuse.org/request/show/1030610
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=222
2022-10-24 05:31:40 +00:00
Dominique Leuenberger
220eec76a4 Accepting request 1030432 from network:cluster
- updated to 22.05.5
- NOTE: Slurm validates that libraries are of the same version. Unfortunately,
  due to an oversight, we failed to notice that the slurmstepd loads the
  hash_k12 library only after a job has completed. This means that if the
  hash_k12 library is upgraded before a job finishes, the slurmstepd will load
  the new library when the job finishes, and will fail due to a mismatch of
  versions.  This results in nodes with slurmstepd processes stuck
  indefinitely. These processes require manual intervention to clean up. There
  is no clean way to resolve these hung slurmstepd processes.
  The only recommended way to upgrade between minor versions of 22.05 with
  RPM’s or upgrades that replace current binaries and libraries is to drain the
  nodes of running jobs first.
- Fixes a number of moderate severity issues, noteable are:
  * Load hash plugin at slurmstepd launch time to prevent issues loading the
    plugin at step completion if the Slurm installation is upgraded.
  * Update nvml plugin to match the unique id format for MIG devices in new
    Nvidia drivers.
  * Fix multi-node step launch failure when nodes in the controller aren't in
    natural order. This can happen with inconsistent node naming (such as
    node15 and node052) or with dynamic nodes which can register in any order.
  * job_container/tmpfs - cleanup containers even when the .ns file isn't
    mounted anymore.
  * Wait up to PrologEpilogTimeout before shutting down slurmd to allow prolog
    and epilog scripts to complete or timeout. Previously, slurmd waited 120
    seconds before timing out and killing prolog and epilog scripts. (forwarded request 1010642 from mslacken)

OBS-URL: https://build.opensuse.org/request/show/1030432
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=79
2022-10-22 12:13:18 +00:00
c2551ab47f Accepting request 1010642 from home:mslacken:branches:network:cluster
- updated to 22.05.5
- NOTE: Slurm validates that libraries are of the same version. Unfortunately,
  due to an oversight, we failed to notice that the slurmstepd loads the
  hash_k12 library only after a job has completed. This means that if the
  hash_k12 library is upgraded before a job finishes, the slurmstepd will load
  the new library when the job finishes, and will fail due to a mismatch of
  versions.  This results in nodes with slurmstepd processes stuck
  indefinitely. These processes require manual intervention to clean up. There
  is no clean way to resolve these hung slurmstepd processes.
  The only recommended way to upgrade between minor versions of 22.05 with
  RPM’s or upgrades that replace current binaries and libraries is to drain the
  nodes of running jobs first.
- Fixes a number of moderate severity issues, noteable are:
  * Load hash plugin at slurmstepd launch time to prevent issues loading the
    plugin at step completion if the Slurm installation is upgraded.
  * Update nvml plugin to match the unique id format for MIG devices in new
    Nvidia drivers.
  * Fix multi-node step launch failure when nodes in the controller aren't in
    natural order. This can happen with inconsistent node naming (such as
    node15 and node052) or with dynamic nodes which can register in any order.
  * job_container/tmpfs - cleanup containers even when the .ns file isn't
    mounted anymore.
  * Wait up to PrologEpilogTimeout before shutting down slurmd to allow prolog
    and epilog scripts to complete or timeout. Previously, slurmd waited 120
    seconds before timing out and killing prolog and epilog scripts.

OBS-URL: https://build.opensuse.org/request/show/1010642
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=220
2022-10-21 15:00:25 +00:00
Dominique Leuenberger
edd405b2c8 Accepting request 1006180 from network:cluster
- Do not deduplicate files of testsuite Slurm configuration.
  This directory is supposed to be mounted over /etc/slurm
  therefore it must not contain softlinks to the files in
  this directory.
- Improve .a and .o file collection for test suite: find these
  files even if there are multiple ones in a single line. (forwarded request 1005746 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1006180
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=78
2022-09-26 16:48:44 +00:00
09aecc2015 Accepting request 1005746 from home:eeich:branches:network:cluster
- Do not deduplicate files of testsuite Slurm configuration.
  This directory is supposed to be mounted over /etc/slurm
  therefore it must not contain softlinks to the files in
  this directory.
- Improve .a and .o file collection for test suite: find these
  files even if there are multiple ones in a single line.

OBS-URL: https://build.opensuse.org/request/show/1005746
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=218
2022-09-26 15:01:51 +00:00
Dominique Leuenberger
ae04ec8787 Accepting request 1005247 from network:cluster
- Fix build for older product version. (forwarded request 1005246 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1005247
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=77
2022-09-22 12:49:55 +00:00
3f68233e21 Accepting request 1005246 from home:eeich:branches:network:cluster
- Fix build for older product version.

OBS-URL: https://build.opensuse.org/request/show/1005246
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=216
2022-09-21 15:33:09 +00:00
Dominique Leuenberger
d3bcbab808 Accepting request 992362 from network:cluster
- Fix a potential security vulnerability in the test package
  (bsc#1201674, CVE-2022-31251).

- Patch NOFILE Limit in the slurmd.service copy for the testsuite. (forwarded request 992353 from eeich)

OBS-URL: https://build.opensuse.org/request/show/992362
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=76
2022-08-02 20:09:54 +00:00
b60ac5f569 Accepting request 992353 from home:eeich:branches:network:cluster
- Fix a potential security vulnerability in the test package
  (bsc#1201674, CVE-2022-31251).

- Patch NOFILE Limit in the slurmd.service copy for the testsuite.

OBS-URL: https://build.opensuse.org/request/show/992353
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=214
2022-08-02 15:34:01 +00:00