Commit Graph

317 Commits

Author SHA256 Message Date
Dominique Leuenberger
8a2be70840 Accepting request 1238577 from network:cluster
* `slurmrestd` - Remove deprecated fields from the following
     `.result` from `POST /slurm/v0.0.42/job/submit`.  
     `.job_id`, `.step_id`, `.job_submit_user_msg` from `POST /slurm/v0.0.42/job/{job_id}`.  
     `.job.exclusive`, `.jobs[].exclusive` to `POST /slurm/v0.0.42/job/submit`.  
     `.jobs[].exclusive` from `GET /slurm/v0.0.42/job/{job_id}`.  
     `.jobs[].exclusive` from `GET /slurm/v0.0.42/jobs`.  
     `.job.oversubscribe`, `.jobs[].oversubscribe` to `POST /slurm/v0.0.42/job/submit`.  
     `.jobs[].oversubscribe` from `GET /slurm/v0.0.42/job/{job_id}`.  
     `.jobs[].oversubscribe` from `GET /slurm/v0.0.42/jobs`.  
     `DELETE /slurm/v0.0.40/jobs`  
     `DELETE /slurm/v0.0.41/jobs`  
     `DELETE /slurm/v0.0.42/jobs`  
    allocation is granted.
    `job|socket|task` or `cpus|mem` per GRES.
    node update whereas previously only single nodes could be
    updated through `/node/<nodename>` endpoint:
    `POST /slurm/v0.0.42/nodes`
    partition as this is a cluster-wide option.
    `REQUEST_NODE_INFO RPC`.
    the db server is not reachable.
    (`.jobs[].priority_by_partition`) to JSON and YAML output.
    connection` error if the error was the result of an
    authentication failure.
    errors with the `SLURM_PROTOCOL_AUTHENTICATION_ERROR` error
    code.
    of `Unspecified error` if querying the following endpoints
    fails:  
    `GET /slurm/v0.0.40/diag/`  
    `GET /slurm/v0.0.41/diag/`  
    `GET /slurm/v0.0.42/diag/` (forwarded request 1238576 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1238577
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=111
2025-01-18 12:18:25 +00:00
247a29f2a0 * slurmrestd - Remove deprecated fields from the following
`.result` from `POST /slurm/v0.0.42/job/submit`.  
     `.job_id`, `.step_id`, `.job_submit_user_msg` from `POST /slurm/v0.0.42/job/{job_id}`.  
     `.job.exclusive`, `.jobs[].exclusive` to `POST /slurm/v0.0.42/job/submit`.  
     `.jobs[].exclusive` from `GET /slurm/v0.0.42/job/{job_id}`.  
     `.jobs[].exclusive` from `GET /slurm/v0.0.42/jobs`.  
     `.job.oversubscribe`, `.jobs[].oversubscribe` to `POST /slurm/v0.0.42/job/submit`.  
     `.jobs[].oversubscribe` from `GET /slurm/v0.0.42/job/{job_id}`.  
     `.jobs[].oversubscribe` from `GET /slurm/v0.0.42/jobs`.  
     `DELETE /slurm/v0.0.40/jobs`  
     `DELETE /slurm/v0.0.41/jobs`  
     `DELETE /slurm/v0.0.42/jobs`  
    allocation is granted.
    `job|socket|task` or `cpus|mem` per GRES.
    node update whereas previously only single nodes could be
    updated through `/node/<nodename>` endpoint:
    `POST /slurm/v0.0.42/nodes`
    partition as this is a cluster-wide option.
    `REQUEST_NODE_INFO RPC`.
    the db server is not reachable.
    (`.jobs[].priority_by_partition`) to JSON and YAML output.
    connection` error if the error was the result of an
    authentication failure.
    errors with the `SLURM_PROTOCOL_AUTHENTICATION_ERROR` error
    code.
    of `Unspecified error` if querying the following endpoints
    fails:  
    `GET /slurm/v0.0.40/diag/`  
    `GET /slurm/v0.0.41/diag/`  
    `GET /slurm/v0.0.42/diag/`

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=307
2025-01-17 21:14:19 +00:00
3a3588a812 - Make test suite package work on SLE-12.
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=306
2025-01-17 20:34:50 +00:00
Ana Guerrero
3b4d2235f3 Accepting request 1236247 from network:cluster
- Fix testsuite:
  Cater for erroneous: `#include </src/[slurm_internal_header]>`
  statements. (forwarded request 1236246 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1236247
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=110
2025-01-12 10:14:54 +00:00
bf43fd9d06 - Fix testsuite:
Cater for erroneous: `#include </src/[slurm_internal_header]>`
  statements.

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=304
2025-01-09 15:43:36 +00:00
Ana Guerrero
e8b6930a42 Accepting request 1235784 from network:cluster
- Update to version 24.11
  * `slurmctld` - Reject arbitrary distribution jobs that do not
    specifying a task count.
  * Fix backwards compatibility of the `RESPONSE_JOB_INFO RPC`
    (used by `squeue`, `scontrol show job`, etc.) with Slurm clients
    version 24.05 and below. This was a regression in 24.11.0rc1.
  * Do not let `slurmctld`/`slurmd` start if there are more nodes
    defined in `slurm.conf` than the maximum supported amount
    (64k nodes).
  * `slurmctld` - Set job's exit code to 1 when a job fails with
    state `JOB_NODE_FAIL`. This fixes `sbatch --wait` not being able
    to exit with error code when a job fails for this reason in
    some cases.
  * Fix certain reservation updates requested from 23.02 clients.
  * `slurmrestd` - Fix populating non-required object fields of
    objects as `{}` in JSON/YAML instead of `null` causing compiled
    OpenAPI clients to reject the response to
    `GET /slurm/v0.0.40/jobs` due to validation failure of
    `.jobs[].job_resources`.
  * Fix issue where older versions of Slurm talking to a 24.11 dbd
    could loose step accounting.
  * Fix minor memory leaks.
  * Fix bad memory reference when `xstrchr` fails to find char.
  * Remove duplicate checks for a data structure.
  * Fix race condition in `stepmgr` step completion handling.
  * `slurm.spec` - add ability to specify patches to apply on the
    command line.
  * `slurm.spec` - add ability to supply extra version information.
  * Fix 24.11 HA issues.
  * Fix requeued jobs keeping their priority until the decay thread (forwarded request 1235783 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1235784
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=109
2025-01-09 14:07:22 +00:00
626fb47a3b - Update to version 24.11
* `slurmctld` - Reject arbitrary distribution jobs that do not
    specifying a task count.
  * Fix backwards compatibility of the `RESPONSE_JOB_INFO RPC`
    (used by `squeue`, `scontrol show job`, etc.) with Slurm clients
    version 24.05 and below. This was a regression in 24.11.0rc1.
  * Do not let `slurmctld`/`slurmd` start if there are more nodes
    defined in `slurm.conf` than the maximum supported amount
    (64k nodes).
  * `slurmctld` - Set job's exit code to 1 when a job fails with
    state `JOB_NODE_FAIL`. This fixes `sbatch --wait` not being able
    to exit with error code when a job fails for this reason in
    some cases.
  * Fix certain reservation updates requested from 23.02 clients.
  * `slurmrestd` - Fix populating non-required object fields of
    objects as `{}` in JSON/YAML instead of `null` causing compiled
    OpenAPI clients to reject the response to
    `GET /slurm/v0.0.40/jobs` due to validation failure of
    `.jobs[].job_resources`.
  * Fix issue where older versions of Slurm talking to a 24.11 dbd
    could loose step accounting.
  * Fix minor memory leaks.
  * Fix bad memory reference when `xstrchr` fails to find char.
  * Remove duplicate checks for a data structure.
  * Fix race condition in `stepmgr` step completion handling.
  * `slurm.spec` - add ability to specify patches to apply on the
    command line.
  * `slurm.spec` - add ability to supply extra version information.
  * Fix 24.11 HA issues.
  * Fix requeued jobs keeping their priority until the decay thread

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=302
2025-01-08 06:03:29 +00:00
Dominique Leuenberger
17d576bce0 Accepting request 1220076 from network:cluster
- Update to version 24.05.4 & fix for CVE-2024-48936.
  * Fix generic int sort functions.
  * Fix user look up using possible unrealized uid in the dbd.
  * `slurmrestd` - Fix regressions that allowed `slurmrestd` to
    be run as SlurmUser when `SlurmUser` was not root.
  * mpi/pmix fix race conditions with het jobs at step start/end
    which could make srun to hang.
  * Fix not showing some `SelectTypeParameters` in `scontrol show
    config`.
  * Avoid assert when dumping removed certain fields in JSON/YAML.
  * Improve how shards are scheduled with affinity in mind.
  * Fix `MaxJobsAccruePU` not being respected when `MaxJobsAccruePA`
    is set in the same QOS.
  * Prevent backfill from planning jobs that use overlapping
    resources for the same time slot if the job's time limit is
    less than `bf_resolution`.
  * Fix memory leak when requesting typed gres and
    `--[cpus|mem]-per-gpu`.
  * Prevent backfill from breaking out due to "system state
    changed" every 30 seconds if reservations use `REPLACE` or
   `REPLACE_DOWN` flags.
  * `slurmrestd` - Make sure that scheduler_unset parameter defaults
    to true even when the following flags are also set:
    `show_duplicates`, `skip_steps`, `disable_truncate_usage_time`,
    `run_away_jobs`, `whole_hetjob`, `disable_whole_hetjob`,
    `disable_wait_for_result`, `usage_time_as_submit_time`,
    `show_batch_script`, and or `show_job_environment`. Additionaly,
    always make sure show_duplicates and
    `disable_truncate_usage_time` default to true when the following
    flags are also set: `scheduler_unset`, `scheduled_on_submit`, (forwarded request 1220075 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1220076
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=108
2024-11-01 20:07:50 +00:00
b1107f7a34 - Update to version 24.05.4 & fix for CVE-2024-48936.
* Fix generic int sort functions.
  * Fix user look up using possible unrealized uid in the dbd.
  * `slurmrestd` - Fix regressions that allowed `slurmrestd` to
    be run as SlurmUser when `SlurmUser` was not root.
  * mpi/pmix fix race conditions with het jobs at step start/end
    which could make srun to hang.
  * Fix not showing some `SelectTypeParameters` in `scontrol show
    config`.
  * Avoid assert when dumping removed certain fields in JSON/YAML.
  * Improve how shards are scheduled with affinity in mind.
  * Fix `MaxJobsAccruePU` not being respected when `MaxJobsAccruePA`
    is set in the same QOS.
  * Prevent backfill from planning jobs that use overlapping
    resources for the same time slot if the job's time limit is
    less than `bf_resolution`.
  * Fix memory leak when requesting typed gres and
    `--[cpus|mem]-per-gpu`.
  * Prevent backfill from breaking out due to "system state
    changed" every 30 seconds if reservations use `REPLACE` or
   `REPLACE_DOWN` flags.
  * `slurmrestd` - Make sure that scheduler_unset parameter defaults
    to true even when the following flags are also set:
    `show_duplicates`, `skip_steps`, `disable_truncate_usage_time`,
    `run_away_jobs`, `whole_hetjob`, `disable_whole_hetjob`,
    `disable_wait_for_result`, `usage_time_as_submit_time`,
    `show_batch_script`, and or `show_job_environment`. Additionaly,
    always make sure show_duplicates and
    `disable_truncate_usage_time` default to true when the following
    flags are also set: `scheduler_unset`, `scheduled_on_submit`,

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=300
2024-11-01 13:22:34 +00:00
Ana Guerrero
3133935d61 Accepting request 1217321 from network:cluster
- Add %(?%sysusers_requires} to slurm-config.
  This fixes issues when building against Slurm. (forwarded request 1217300 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1217321
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=107
2024-10-24 13:42:28 +00:00
427f09ad29 - Add %(?%sysusers_requires} to slurm-config.
This fixes issues when building against Slurm.

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=298
2024-10-23 09:42:56 +00:00
Ana Guerrero
de9dc95156 Accepting request 1208086 from network:cluster
- Update to version 24.05.3
  * `data_parser/v0.0.40` - Added field descriptions.
  * `slurmrestd` - Avoid creating new slurmdbd connection per request
    to `* /slurm/slurmctld/*/*` endpoints.
  * Fix compilation issue with `switch/hpe_slingshot` plugin.
  * Fix gres per task allocation with threads-per-core.
  * `data_parser/v0.0.41` - Added field descriptions.
  * `slurmrestd` - Change back generated OpenAPI schema for
    `DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using
    parameters for request. `slurmrestd` will continue accept endpoint
    requests via `RequestBody` or HTTP query.
  * `topology/tree` - Fix issues with switch distance optimization.
  * Fix potential segfault of secondary `slurmctld` when falling back
    to the primary when running with a `JobComp` plugin.
  * Enable `--json`/`--yaml=v0.0.39` options on client commands to
    dump data using data_parser/v0.0.39 instead or outputting nothing.
  * `switch/hpe_slingshot` - Fix issue that could result in a 0 length
    state file.
  * Fix unnecessary message protocol downgrade for unregistered nodes.
  * Fix unnecessarily packing alias addrs when terminating jobs with
    a mix of non-cloud/dynamic nodes and powered down cloud/dynamic
    nodes.
  * `accounting_storage/mysql` - Fix issue when deleting a qos that
    could remove too many commas from the qos and/or delta_qos fields
    of the assoc table.
  * `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU.
  * Fix allowing access to reservations without `MaxStartDelay` set.
  * Fix regression introduced in 24.05.0rc1 breaking
    `srun --send-libs` parsing.
  * Fix slurmd vsize memory leak when using job submission/allocation

OBS-URL: https://build.opensuse.org/request/show/1208086
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=106
2024-10-15 13:01:34 +00:00
1cc2983ebe - Removed Fix-test-21.41.patch as upstream test changed.
- Dropped package plugin-ext-sensors-rrd as the plugin module no
  longer exists.

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=296
2024-10-15 10:19:24 +00:00
b2f6e848a1 - Update to version 24.05.3
* `data_parser/v0.0.40` - Added field descriptions.
  * `slurmrestd` - Avoid creating new slurmdbd connection per request
    to `* /slurm/slurmctld/*/*` endpoints.
  * Fix compilation issue with `switch/hpe_slingshot` plugin.
  * Fix gres per task allocation with threads-per-core.
  * `data_parser/v0.0.41` - Added field descriptions.
  * `slurmrestd` - Change back generated OpenAPI schema for
    `DELETE /slurm/v0.0.40/jobs/` to `RequestBody` instead of using
    parameters for request. `slurmrestd` will continue accept endpoint
    requests via `RequestBody` or HTTP query.
  * `topology/tree` - Fix issues with switch distance optimization.
  * Fix potential segfault of secondary `slurmctld` when falling back
    to the primary when running with a `JobComp` plugin.
  * Enable `--json`/`--yaml=v0.0.39` options on client commands to
    dump data using data_parser/v0.0.39 instead or outputting nothing.
  * `switch/hpe_slingshot` - Fix issue that could result in a 0 length
    state file.
  * Fix unnecessary message protocol downgrade for unregistered nodes.
  * Fix unnecessarily packing alias addrs when terminating jobs with
    a mix of non-cloud/dynamic nodes and powered down cloud/dynamic
    nodes.
  * `accounting_storage/mysql` - Fix issue when deleting a qos that
    could remove too many commas from the qos and/or delta_qos fields
    of the assoc table.
  * `slurmctld` - Fix memory leak when using RestrictedCoresPerGPU.
  * Fix allowing access to reservations without `MaxStartDelay` set.
  * Fix regression introduced in 24.05.0rc1 breaking
    `srun --send-libs` parsing.
  * Fix slurmd vsize memory leak when using job submission/allocation

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=295
2024-10-15 06:51:09 +00:00
fc209e050f - updated to new release 24.05.0 with following major changes
- IMPORTANT NOTES:
  If using the slurmdbd (Slurm DataBase Daemon) you must update
  this first.  NOTE: If using a backup DBD you must start the
  primary first to do any database conversion, the backup will not
  start until this has happened.  The 24.05 slurmdbd will work
  with Slurm daemons of version 23.02 and above.  You will not
  need to update all clusters at the same time, but it is very
  important to update slurmdbd first and having it running before
  updating any other clusters making use of it.
- HIGHLIGHTS
  * Federation - allow client command operation when slurmdbd is
    unavailable.
  * burst_buffer/lua - Added two new hooks: slurm_bb_test_data_in
    and slurm_bb_test_data_out. The syntax and use of the new hooks
    are documented in etc/burst_buffer.lua.example. These are
    required to exist. slurmctld now checks on startup if the
    burst_buffer.lua script loads and contains all required hooks;
    slurmctld will exit with a fatal error if this is not
    successful. Added PollInterval to burst_buffer.conf. Removed
    the arbitrary limit of 512 copies of the script running
    simultaneously.
  * Add QOS limit MaxTRESRunMinsPerAccount. 
  * Add QOS limit MaxTRESRunMinsPerUser.
  * Add ELIGIBLE environment variable to jobcomp/script plugin.
  * Always use the QOS name for SLURM_JOB_QOS environment variables.
    Previously the batch environment would use the description field,
    which was usually equivalent to the name. 
  * cgroup/v2 - Require dbus-1 version >= 1.11.16.
  * Allow NodeSet names to be used in SuspendExcNodes.

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=294
2024-10-14 10:03:00 +00:00
Ana Guerrero
61add11d2b Accepting request 1161658 from network:cluster
OBS-URL: https://build.opensuse.org/request/show/1161658
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=105
2024-03-26 18:27:40 +00:00
cda5ce024e Accepting request 1161499 from home:mslacken:branches:network:cluster
- removed Keep-logs-of-skipped-test-when-running-test-cases-sequentially.patch
  as incoperated upstream
* Changes in Slurm 23.02.5
 * Add the JobId to debug() messages indicating when cpus_per_task/mem_per_cpu
   or pn_min_cpus are being automatically adjusted.
 * Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
   a node features plugin is configured.
 * Fix and prevent reoccurring reservations from overlapping.
 * job_container/tmpfs - Avoid attempts to share BasePath between nodes.
 * Change the log message warning for rate limited users from verbose to info.
 * With CR_Cpu_Memory, fix node selection for jobs that request gres and
   *-mem-per-cpu.
 * Fix a regression from 22.05.7 in which some jobs were allocated too few
   nodes, thus overcommitting cpus to some tasks.
 * Fix a job being stuck in the completing state if the job ends while the
   primary controller is down or unresponsive and the backup controller has
   not yet taken over.
 * Fix slurmctld segfault when a node registers with a configured CpuSpecList
   while slurmctld configuration has the node without CpuSpecList.
 * Fix cloud nodes getting stuck in POWERED_DOWN+NO_RESPOND state after not
   registering by ResumeTimeout.
 * slurmstepd - Avoid cleanup of config.json-less containers spooldir getting
   skipped.
 * slurmstepd - Cleanup per task generated environment for containers in
   spooldir.
 * Fix scontrol segfault when 'completing' command requested repeatedly in
   interactive mode.
 * Properly handle a race condition between bind() and listen() calls in the
   network stack when running with SrunPortRange set.
 * Federation - Fix revoked jobs being returned regardless of the -a/--all

OBS-URL: https://build.opensuse.org/request/show/1161499
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=292
2024-03-26 08:40:44 +00:00
2bd53c8d44 work correctly (boo#1204697).
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=291
2024-03-23 10:05:59 +00:00
Ana Guerrero
4ec0f5cd48 Accepting request 1151965 from network:cluster
- Update to version 23.11.03
  * slurmrestd - Reject single http query with multiple path
    requests.
  * Fix launching Singularity v4.x containers with
    `srun --container` by setting .process.terminal to true in
    generated `config.json` when step has pseudoterminal (`--pty`)
    requested.
  * Fix loading in `dyanmic/cloud` node jobs after `net_cred`
    expired.
  * Fix cgroup null path error on `slurmd/slurmstepd` tear down.
  * `data_parser/v0.0.40` - Prevent failure if accounting is
    disabled, instead issue a warning if needed data from the
    database can not be retrieved.
  * `openapi/slurmctld` - Prevent failure if accounting is disabled.
  * Prevent `slurmscriptd` processing delays from blocking other
    threads in `slurmctld` while trying to launch various scripts.
    This is additional work for a fix in 23.02.6.
  * Fix memory leak when receiving alias addrs from controller.
  * `scontrol` - Accept `scontrol token lifespan=infinite` to
    create tokens that effectively do not expire.
  * Avoid errors when Slurmdb accounting disabled when `--json` or
    `--yaml` is invoked with CLI commands and `slurmrestd`. Add
    warnings when query would have populated data from Slurmdb
    instead of errors.
  * Fix `slurmctld` memory leak when running job with
    `--tres-per-task=gres:shard:#`
  * Fix backfill trying to start jobs outside of backfill window.
  * Fix oversubscription on partitions with `PreemptMode=OFF`.
  * Preserve node reason on power up if the node is downed
    or drained. (forwarded request 1150524 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1151965
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=104
2024-02-27 21:47:57 +00:00
fb460ebe6a Accepting request 1150524 from home:eeich:branches:network:cluster
- Update to version 23.11.03
  * slurmrestd - Reject single http query with multiple path
    requests.
  * Fix launching Singularity v4.x containers with
    `srun --container` by setting .process.terminal to true in
    generated `config.json` when step has pseudoterminal (`--pty`)
    requested.
  * Fix loading in `dyanmic/cloud` node jobs after `net_cred`
    expired.
  * Fix cgroup null path error on `slurmd/slurmstepd` tear down.
  * `data_parser/v0.0.40` - Prevent failure if accounting is
    disabled, instead issue a warning if needed data from the
    database can not be retrieved.
  * `openapi/slurmctld` - Prevent failure if accounting is disabled.
  * Prevent `slurmscriptd` processing delays from blocking other
    threads in `slurmctld` while trying to launch various scripts.
    This is additional work for a fix in 23.02.6.
  * Fix memory leak when receiving alias addrs from controller.
  * `scontrol` - Accept `scontrol token lifespan=infinite` to
    create tokens that effectively do not expire.
  * Avoid errors when Slurmdb accounting disabled when `--json` or
    `--yaml` is invoked with CLI commands and `slurmrestd`. Add
    warnings when query would have populated data from Slurmdb
    instead of errors.
  * Fix `slurmctld` memory leak when running job with
    `--tres-per-task=gres:shard:#`
  * Fix backfill trying to start jobs outside of backfill window.
  * Fix oversubscription on partitions with `PreemptMode=OFF`.
  * Preserve node reason on power up if the node is downed
    or drained.

OBS-URL: https://build.opensuse.org/request/show/1150524
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=289
2024-02-26 21:40:59 +00:00
Ana Guerrero
6a021ebb80 Accepting request 1141442 from network:cluster
- Update to 23.11.1 with following major improvements and fixing
  CVE-2023-49933, CVE-2023-49934, CVE-2023-49935, CVE-2023-49936
  and CVE-2023-49937
  * Substantially overhauled the SlurmDBD association management
    code. For clusters updated to 23.11, account and user
    additions or removals are significantly faster than in prior
    releases.
  * Overhauled `scontrol reconfigure` to prevent configuration
    mistakes from disabling slurmctld and slurmd. Instead, an
    error will be returned, and the running configuration will
    persist. This does require updates to the systemd service
    files to use the `--systemd` option to `slurmctld` and `slurmd`.
  * Added a new internal `auth/cred` plugin - `auth/slurm`. This
    builds off the prior `auth/jwt` model, and permits operation
    of the `slurmdbd` and `slurmctld` without access to full
    directory information with a suitable configuration.
  * Added a new `--external-launcher` option to `srun`, which is
    automatically set by common MPI launcher implementations and
    ensures processes using those non-srun launchers have full
    access to all resources allocated on each node.
  * Reworked the dynamic/cloud modes of operation to allow for
    "fanout" - where Slurm communication can be automatically
    offloaded to compute nodes for increased cluster scalability.
  * Overhauled and extended the Reservation subsystem to allow
    for most of the same resource requirements as are placed on
    the job. Notably, this permits reservations to now reserve
    GRES directly.
- Details of changes:
  * Fix `scontrol update job=... TimeLimit+=/-=` when used with a
    raw JobId of job array element.

OBS-URL: https://build.opensuse.org/request/show/1141442
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=103
2024-01-25 17:41:05 +00:00
f98ecb23d5 - Remove last change. This is not how it is intended to work
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=287
2024-01-25 07:58:54 +00:00
a95f2355d0 Accepting request 1141020 from home:dimstar:Factory
- Fix dependency of testsuite when building without hdf5
  (have_hdf5=0). The previously use construct
  %{?have_hdf5:%ts_depends: does not behave as intended by the
  line-author: %{?…:} does not include a question of value, but
  only if the variable is defined or undefind.

OBS-URL: https://build.opensuse.org/request/show/1141020
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=286
2024-01-24 14:43:56 +00:00
e59754da76 CVE-2023-49933, CVE-2023-49934, CVE-2023-49935, CVE-2023-49936
and CVE-2023-49937
  * Substantially overhauled the SlurmDBD association management
    code. For clusters updated to 23.11, account and user
    additions or removals are significantly faster than in prior
    releases.
  * Overhauled `scontrol reconfigure` to prevent configuration
    mistakes from disabling slurmctld and slurmd. Instead, an
    error will be returned, and the running configuration will
    persist. This does require updates to the systemd service
    files to use the `--systemd` option to `slurmctld` and `slurmd`.
  * Added a new internal `auth/cred` plugin - `auth/slurm`. This
    builds off the prior `auth/jwt` model, and permits operation
    of the `slurmdbd` and `slurmctld` without access to full
    directory information with a suitable configuration.
  * Added a new `--external-launcher` option to `srun`, which is
    automatically set by common MPI launcher implementations and
    ensures processes using those non-srun launchers have full
    access to all resources allocated on each node.
  * Reworked the dynamic/cloud modes of operation to allow for
    "fanout" - where Slurm communication can be automatically
    offloaded to compute nodes for increased cluster scalability.
  * Overhauled and extended the Reservation subsystem to allow
    for most of the same resource requirements as are placed on
    the job. Notably, this permits reservations to now reserve
    GRES directly.
  * Fix `scontrol update job=... TimeLimit+=/-=` when used with a
    raw JobId of job array element.
  * Reject `TimeLimit` increment/decrement when called on job with
    `TimeLimit=UNLIMITED`.

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=285
2024-01-22 16:26:43 +00:00
e7275730c8 Accepting request 1138332 from home:mslacken:branches:network:cluster
- Update to 23.11.1 with following major improvements and fixing
  CVE-2023-49933, CVE-2023-49934, CVE-2023-49935, CVE-2023-49936 and
  CVE-2023-49937
  * Substantially overhauled the SlurmDBD association management code. For
    clusters updated to 23.11, account and user additions or removals are
    significantly faster than in prior releases.
  * Overhauled 'scontrol reconfigure' to prevent configuration mistakes from
    disabling slurmctld and slurmd. Instead, an error will be returned, and the
    running configuration will persist. This does require updates to the
    systemd service files to use the --systemd option to slurmctld and slurmd.
  * Added a new internal auth/cred plugin - "auth/slurm". This builds off the
    prior auth/jwt model, and permits operation of the slurmdbd and slurmctld
    without access to full directory information with a suitable configuration.
  * Added a new --external-launcher option to srun, which is automatically set
    by common MPI launcher implementations and ensures processes using those
    non-srun launchers have full access to all resources allocated on each
    node.
  * Reworked the dynamic/cloud modes of operation to allow for "fanout" - where
    Slurm communication can be automatically offloaded to compute nodes for
    increased cluster scalability.
    Added initial official Debian packaging support.
  * Overhauled and extended the Reservation subsystem to allow for most of the
    same resource requirements as are placed on the job. Notably, this permits
    reservations to now reserve GRES directly.
- Details of changes:
  * Fix scontrol update job=... TimeLimit+=/-= when used with a raw JobId of job
    array element.
  * Reject TimeLimit increment/decrement when called on job with
    TimeLimit=UNLIMITED.
  * Fix issue with requesting a job with  *licenses as well as

OBS-URL: https://build.opensuse.org/request/show/1138332
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=284
2024-01-22 15:21:33 +00:00
Dominique Leuenberger
1f813cb386 Accepting request 1137045 from network:cluster
- Update to 23.02.6 to fix (CVE-2023-49933 - bsc#1218046, CVE-2023-49935 -
  bsc#1218049, CVE-2023-49936 - bsc#1218050, CVE-2023-49937 - bsc#1218051,
  CVE-2023-49938 - bsc#1218053)
  * Security Fixes:
    + Add `JobAcctGatherParams=DisableGPUAcct` to disable gpu accounting.
    + `acct_gather_energy/ipmi` - Improve logging of DCMI issues.
    + `gpu/oneapi` - Add support for new env vars `ZE_FLAT_DEVICE_HIERARCHY`
      and `ZE_ENABLE_PCI_ID_DEVICE_ORDER`.
    + `data_parser/v0.0.39` - skip empty string when parsing QOS ids.
    + Remove error message from `assoc_mgr_update_assocs` when purposefully
      resetting the default QOS.
  * Bug Fixes:
    + `libslurm_nss` - Avoid causing glibc to assert due to an unexpected
      return from slurm_nss due to an error during lookup.
    + Fix job requests with `--tres-per-task` sometimes resulting in bad
      allocations that cannot run subsequent job steps.
    + Fix issue with `slurmd` where `srun` fails to be warned when a node
      prolog script runs beyond `MsgTimeout` set in `slurm.conf`.
    + `gres/shard` - Fix plugin functions to have matching parameter orders.
    + `gpu/nvml` - Fix issue that resulted in the wrong MIG devices being
      constrained to a job
    + `gpu/nvml` - Fix linking issue with MIGs that prevented multiple MIGs
      being used in a single job for certain MIG configurations
    + Fix file descriptor leak in slurmd when using `acct_gather_energy/ipmi`
      with DCMI devices.
    + `sview` - avoid crash when job has a node list string > 49 characters.
    + Prevent `slurmctld` crash during reconfigure when packing job start
      messages.
    + Preserve reason uid on reconfig.
    + Update node reason with updated `INVAL` state reason if different from (forwarded request 1136624 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1137045
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=102
2024-01-05 20:45:15 +00:00
af603b8163 Accepting request 1136624 from home:eeich:branches:network:cluster
- Update to 23.02.6 to fix (CVE-2023-49933 - bsc#1218046, CVE-2023-49935 -
  bsc#1218049, CVE-2023-49936 - bsc#1218050, CVE-2023-49937 - bsc#1218051,
  CVE-2023-49938 - bsc#1218053)
  * Security Fixes:
    + Add `JobAcctGatherParams=DisableGPUAcct` to disable gpu accounting.
    + `acct_gather_energy/ipmi` - Improve logging of DCMI issues.
    + `gpu/oneapi` - Add support for new env vars `ZE_FLAT_DEVICE_HIERARCHY`
      and `ZE_ENABLE_PCI_ID_DEVICE_ORDER`.
    + `data_parser/v0.0.39` - skip empty string when parsing QOS ids.
    + Remove error message from `assoc_mgr_update_assocs` when purposefully
      resetting the default QOS.
  * Bug Fixes:
    + `libslurm_nss` - Avoid causing glibc to assert due to an unexpected
      return from slurm_nss due to an error during lookup.
    + Fix job requests with `--tres-per-task` sometimes resulting in bad
      allocations that cannot run subsequent job steps.
    + Fix issue with `slurmd` where `srun` fails to be warned when a node
      prolog script runs beyond `MsgTimeout` set in `slurm.conf`.
    + `gres/shard` - Fix plugin functions to have matching parameter orders.
    + `gpu/nvml` - Fix issue that resulted in the wrong MIG devices being
      constrained to a job
    + `gpu/nvml` - Fix linking issue with MIGs that prevented multiple MIGs
      being used in a single job for certain MIG configurations
    + Fix file descriptor leak in slurmd when using `acct_gather_energy/ipmi`
      with DCMI devices.
    + `sview` - avoid crash when job has a node list string > 49 characters.
    + Prevent `slurmctld` crash during reconfigure when packing job start
      messages.
    + Preserve reason uid on reconfig.
    + Update node reason with updated `INVAL` state reason if different from

OBS-URL: https://build.opensuse.org/request/show/1136624
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=282
2024-01-05 12:29:13 +00:00
Ana Guerrero
0db8ed8d95 Accepting request 1130097 from network:cluster
- Add missing service file for slurmrestd (boo#1217711). (forwarded request 1130096 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1130097
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=101
2023-12-04 21:59:28 +00:00
bbe01bb79f Accepting request 1130096 from home:eeich:branches:network:cluster
- Add missing service file for slurmrestd (boo#1217711).

OBS-URL: https://build.opensuse.org/request/show/1130096
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=280
2023-11-30 19:27:08 +00:00
5a1d72f62c Accepting request 1129638 from home:eeich:branches:network:cluster
- Explicitly create an Obsoletes: entry for each package version
  that is obsoleted by the present version. These are all published
  versions of the last two major releases as well as all minor
  versions of the present release lower than the current one
  (bsc#1216869 2nd part).
  This prevents the current version to upgrade a old Slurm version
  for which no upgrade path exists.

OBS-URL: https://build.opensuse.org/request/show/1129638
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=279
2023-11-28 18:02:52 +00:00
Ana Guerrero
1e8971e87a Accepting request 1129192 from network:cluster
Automatic submission by obs-autosubmit

OBS-URL: https://build.opensuse.org/request/show/1129192
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=100
2023-11-27 21:44:42 +00:00
db15cbcf3e - On SLE-12 exclude build for s390x.
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=277
2023-11-20 15:31:39 +00:00
Ana Guerrero
ccb26326c7 Accepting request 1123596 from network:cluster
- Add missing dependencies to slurm-config to plugins package.
  These should help to tie down the slurm version and help to avoid
  a package mix (bsc#1216869). (forwarded request 1123595 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1123596
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=99
2023-11-06 20:14:38 +00:00
961668403a Accepting request 1123595 from home:eeich:branches:network:cluster
- Add missing dependencies to slurm-config to plugins package.
  These should help to tie down the slurm version and help to avoid
  a package mix (bsc#1216869).

OBS-URL: https://build.opensuse.org/request/show/1123595
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=275
2023-11-06 14:56:24 +00:00
Dominique Leuenberger
b28d182fe8 Accepting request 1121548 from network:cluster
Automatic submission by obs-autosubmit

OBS-URL: https://build.opensuse.org/request/show/1121548
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=98
2023-11-01 21:09:57 +00:00
c9c235c313 Format fix to changes file:
`GET /slurmdb/v0.0.39/assocations` and `GET /slurmdb/v0.0.39/qos` to

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=273
2023-10-25 07:12:31 +00:00
Ana Guerrero
150d433676 Accepting request 1118220 from network:cluster
- update to 23.02.6 to fix (CVE-2023-41914, bsc#1216207)

OBS-URL: https://build.opensuse.org/request/show/1118220
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=97
2023-10-17 18:24:48 +00:00
37c34593a9 - update to 23.02.6 to fix (CVE-2023-41914, bsc#1216207)
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=271
2023-10-17 08:09:39 +00:00
Ana Guerrero
f946358d8c Accepting request 1117163 from network:cluster
- update to 23.02.6 to fix (CVE-2023-41914)
  * Removed Fix-test-32.8.patch as fixed upstream
  * Bug Fixes:
    + Fix `CpusPerTres=` not upgreadable with scontrol update
    + Fix unintentional gres removal when validating the gres job state.
    + Fix `--without-hpe-slingshot` configure option.
    + Fix cgroup v2 memory calculations when transparent huge pages are used.
    + Fix parsing of `sgather --timeout` option.
    + Fix regression from 22.05.0 that caused `srun --cpu-bind "=verbose"`
      and `"=v"` options give different CPU bind masks.
    + Fix "_find_node_record: lookup failure for node" error message appearing
      for all dynamic nodes during reconfigure.
    + Avoid segfault if loading serializer plugin fails.
    + `slurmrestd` - Correct OpenAPI format for `GET /slurm/v0.0.39/licenses`.
    + `slurmrestd` - Correct OpenAPI format for
      `GET /slurm/v0.0.39/job/{job_id}`.
    + `slurmrestd` - Change format to multiple fields in
     'GET /slurmdb/v0.0.39/assocations` and `GET /slurmdb/v0.0.39/qos` to
      handle infinite and unset states.
    + When a node fails in a job with `--no-kill`, preserve the extern step on the
      remaining nodes to avoid breaking features that rely on the extern step
      such as `pam_slurm_adopt`, `x11`, and `job_container/tmpfs`.
    + `auth/jwt` - Ignore `x5c` field in JWKS files.
    + `auth/jwt` - Treat 'alg' field as optional in JWKS files.
    + Allow job_desc.selinux_context to be read from the job_submit.lua script.
    + Skip check in slurmstepd that causes a large number of errors in the
      munge log: "Unauthorized credential for client UID=0 GID=0".
      This error will still appear on `slurmd`/`slurmctld`/`slurmdbd` start up
      and is not a cause for concern.
    + `slurmctld` - Allow startup with zero partitions.

OBS-URL: https://build.opensuse.org/request/show/1117163
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=96
2023-10-12 21:41:42 +00:00
449ea49bf9 - Fix changes file formatting
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=269
2023-10-12 10:02:10 +00:00
cd2c5bfc50 Accepting request 1117145 from home:mslacken:branches:network:cluster
* Bug Fixes:
   + Fix CpusPerTres= not upgreadable with scontrol update
   + Fix unintentional gres removal when validating the gres job state.
   + Fix --without-hpe-slingshot configure option.
   + Fix cgroup v2 memory calculations when transparent huge pages are used.
   + Fix parsing of sgather --timeout option.
   + Fix regression from 22.05.0 that caused srun --cpu-bind "=verbose" and "=v"
     options give different CPU bind masks.
   + Fix "_find_node_record: lookup failure for node" error message appearing
     for all dynamic nodes during reconfigure.
   + Avoid segfault if loading serializer plugin fails.
   + slurmrestd - Correct OpenAPI format for 'GET /slurm/v0.0.39/licenses'.
   + slurmrestd - Correct OpenAPI format for 'GET /slurm/v0.0.39/job/{job_id}'.
   + slurmrestd - Change format to multiple fields in 'GET
     /slurmdb/v0.0.39/assocations' and 'GET /slurmdb/v0.0.39/qos' to handle
     infinite and unset states.
   + When a node fails in a job with --no-kill, preserve the extern step on the
     remaining nodes to avoid breaking features that rely on the extern step
     such as pam_slurm_adopt, x11, and job_container/tmpfs.
   + auth/jwt - Ignore 'x5c' field in JWKS files.
   + auth/jwt - Treat 'alg' field as optional in JWKS files.
   + Allow job_desc.selinux_context to be read from the job_submit.lua script.
   + Skip check in slurmstepd that causes a large number of errors in the munge
     log: "Unauthorized credential for client UID=0 GID=0".  This error will
     still appear on slurmd/slurmctld/slurmdbd start up and is not a cause for
     concern.
   + slurmctld - Allow startup with zero partitions.
   + Fix some mig profile names in slurm not matching nvidia mig profiles.
   + Prevent slurmscriptd processing delays from blocking other threads in
     slurmctld while trying to launch {Prolog|Epilog}Slurmctld.

OBS-URL: https://build.opensuse.org/request/show/1117145
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=268
2023-10-12 09:09:32 +00:00
90bba6a8aa Accepting request 1117137 from home:mslacken:branches:network:cluster
- update to 23.02.6 to fix (CVE-2023-41914) 
  * Removed Fix-test-32.8.patch as fixed upstream

OBS-URL: https://build.opensuse.org/request/show/1117137
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=267
2023-10-12 08:49:44 +00:00
Dominique Leuenberger
12bf38b1d0 Accepting request 1111943 from network:cluster
- Updated to version 23.02.5 with the following changes:
  * Bug Fixes:
    + Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the
      job's environment when `--ntasks-per-node` was requested.
      The method that is is being set, however, is different and should be more
      accurate in more situations.
    + Change pmi2 plugin to honor the `SrunPortRange` option. This matches the
      new behavior of the pmix plugin in 23.02.0. Note that neither of these
      plugins makes use of the `MpiParams=ports=` option, and previously
      were only limited by the systems ephemeral port range.
    + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
      a node features plugin is configured.
    + Fix and prevent reoccurring reservations from overlapping.
    + `job_container/tmpfs` - Avoid attempts to share BasePath between nodes.
    + With `CR_Cpu_Memory`, fix node selection for jobs that request gres and
      `--mem-per-cpu`.
    + Fix a regression from 22.05.7 in which some jobs were allocated too few
      nodes, thus overcommitting cpus to some tasks.
    + Fix a job being stuck in the completing state if the job ends while the
      primary controller is down or unresponsive and the backup controller has
      not yet taken over.
    + Fix `slurmctld` segfault when a node registers with a configured
      `CpuSpecList` while `slurmctld` configuration has the node without
      `CpuSpecList`.
    + Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after
      not registering by `ResumeTimeout`.
    + `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir
      getting skipped.
    + Fix scontrol segfault when 'completing' command requested repeatedly in
      interactive mode.

OBS-URL: https://build.opensuse.org/request/show/1111943
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=95
2023-09-20 11:26:46 +00:00
f0b994e220 plugins makes use of the MpiParams=ports= option, and previously
features with the `|` operator, which could prevent jobs from
    + `node_features/helpers` - Fix inconsistent handling of `&` and `|`,
      instead of just the current set. E.g. `foo|bar&baz` was interpreted
      `{foo} or {bar,baz}`.
      tasks fewer than GPUs, which resulted in incorrectly rejecting these
      jobs.
    + `slurmrestd` - For `GET /slurm/v0.0.39/node[s]`, change format of
      node's energy field `current_watts` to a dictionary to account for
    + `slurmrestd` - For `GET /slurm/v0.0.39/qos`, change format of QOS's
    + slurmrestd - For `GET /slurm/v0.0.39/job[s]`, the 'return code'
      `GET /slurmdb/v0.0.39/jobs` from slurmrestd.
      were present in the log: `error: Attempt to change gres/gpu Count`.
    + Hold the job with `(Reservation ... invalid)` state reason if the

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=265
2023-09-18 05:43:58 +00:00
74529b6cc2 - Updated to version 23.02.5 with the following changes:
* Bug Fixes:
    + Revert a change in 23.02 where `SLURM_NTASKS` was no longer set in the
      job's environment when `--ntasks-per-node` was requested.
      The method that is is being set, however, is different and should be more
      accurate in more situations.
    + Change pmi2 plugin to honor the `SrunPortRange` option. This matches the
      new behavior of the pmix plugin in 23.02.0. Note that neither of these
      plugins makes use of the "`MpiParams=ports=`" option, and previously
      were only limited by the systems ephemeral port range.
    + Fix regression in 23.02.2 that caused slurmctld -R to crash on startup if
      a node features plugin is configured.
    + Fix and prevent reoccurring reservations from overlapping.
    + `job_container/tmpfs` - Avoid attempts to share BasePath between nodes.
    + With `CR_Cpu_Memory`, fix node selection for jobs that request gres and
      `--mem-per-cpu`.
    + Fix a regression from 22.05.7 in which some jobs were allocated too few
      nodes, thus overcommitting cpus to some tasks.
    + Fix a job being stuck in the completing state if the job ends while the
      primary controller is down or unresponsive and the backup controller has
      not yet taken over.
    + Fix `slurmctld` segfault when a node registers with a configured
      `CpuSpecList` while `slurmctld` configuration has the node without
      `CpuSpecList`.
    + Fix cloud nodes getting stuck in `POWERED_DOWN+NO_RESPOND` state after
      not registering by `ResumeTimeout`.
    + `slurmstepd` - Avoid cleanup of `config.json-less` containers spooldir
      getting skipped.
    + Fix scontrol segfault when 'completing' command requested repeatedly in
      interactive mode.

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=264
2023-09-18 05:24:51 +00:00
Ana Guerrero
3825e9fab0 Accepting request 1110422 from network:cluster
- Create a macro for upgrade dependency to ensure uniform handling. (forwarded request 1110421 from eeich)

OBS-URL: https://build.opensuse.org/request/show/1110422
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=94
2023-09-12 19:02:53 +00:00
a323feff42 Accepting request 1110421 from home:eeich:branches:network:cluster
- Create a macro for upgrade dependency to ensure uniform handling.

OBS-URL: https://build.opensuse.org/request/show/1110421
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=262
2023-09-12 04:52:56 +00:00
Ana Guerrero
3bcde4bfd9 Accepting request 1110259 from network:cluster
- Updated to 23.02.4 with the following changes:
  * Bug Fixes:
    + Fix main scheduler loop not starting after a failover to backup
      controller. Avoid slurmctld segfault when specifying
     `AccountingStorageExternalHost` (bsc#1214983).
    + Fix sbatch return code when `--wait` is requested on a job array.
    + Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
    + Fix `slurmrestd` handling of job hold/release operations.
    + Fix step running indefinitely when slurmctld takes more than
      `MessageTimeout` to respond. Now, `slurmctld` will cancel the step when
       detected, preventing following steps from getting stuck waiting for
       resources to be released.
    + Fix regression to make `job_desc.min_cpus` accurate again in `job_submit`
      when requesting a job with `--ntasks-per-node`.
    + Fix handling of `ArrayTaskThrottle` in backfill.
    + Fix regression in 23.02.2 when checking gres state on `slurmctld`
      startup  or reconfigure. Gres changes in the configuration were not
      updated on slurmctld startup. On startup or reconfigure, these messages
      were present in the log: `"error: Attempt to change gres/gpu Count`".
    + Fix potential double count of gres when dealing with limits.
    + Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
    + Fixed an issue where jobs requesting licenses were incorrectly rejected.
    + `scrontab` - Fix cutting off the final character of quoted variables.
    + `smail` - Fix issues where e-mails at job completion were not being sent.
    + `scontrol/slurmctld` - fix comma parsing when updating a reservation's
       nodes.
    + Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus
      having more tasks than they should and other gpus being unused.
    + Fix regression in 23.02 that causes slurmstepd to crash when `srun`
      requests more than `TreeWidth` nodes in a step and uses the pmi2 or

OBS-URL: https://build.opensuse.org/request/show/1110259
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=93
2023-09-11 19:22:19 +00:00
f9646ba945 - Updated to 23.02.4 with the following changes:
* Bug Fixes:
    + Fix main scheduler loop not starting after a failover to backup
      controller. Avoid slurmctld segfault when specifying
     `AccountingStorageExternalHost` (bsc#1214983).
    + Fix sbatch return code when `--wait` is requested on a job array.
    + Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
    + Fix `slurmrestd` handling of job hold/release operations.
    + Fix step running indefinitely when slurmctld takes more than
      `MessageTimeout` to respond. Now, `slurmctld` will cancel the step when
       detected, preventing following steps from getting stuck waiting for
       resources to be released.
    + Fix regression to make `job_desc.min_cpus` accurate again in `job_submit`
      when requesting a job with `--ntasks-per-node`.
    + Fix handling of `ArrayTaskThrottle` in backfill.
    + Fix regression in 23.02.2 when checking gres state on `slurmctld`
      startup  or reconfigure. Gres changes in the configuration were not
      updated on slurmctld startup. On startup or reconfigure, these messages
      were present in the log: `"error: Attempt to change gres/gpu Count`".
    + Fix potential double count of gres when dealing with limits.
    + Fix `slurmstepd` segfault when `ContainerPath` is not set in `oci.conf`
    + Fixed an issue where jobs requesting licenses were incorrectly rejected.
    + `scrontab` - Fix cutting off the final character of quoted variables.
    + `smail` - Fix issues where e-mails at job completion were not being sent.
    + `scontrol/slurmctld` - fix comma parsing when updating a reservation's
       nodes.
    + Fix `--gpu-bind=single binding` tasks to wrong gpus, leading to some gpus
      having more tasks than they should and other gpus being unused.
    + Fix regression in 23.02 that causes slurmstepd to crash when `srun`
      requests more than `TreeWidth` nodes in a step and uses the pmi2 or

OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=260
2023-09-11 07:21:32 +00:00
Ana Guerrero
6b47182efe Accepting request 1109308 from network:cluster
- Fixes since 23.02.03:
  Highlights:
  * Fix main scheduler loop not starting after a failover to backup controller.
  * Avoid slurmctld segfault when specifying `AccountingStorageExternalHost`
    (bsc#1214983).
  Other:
  * Fix sbatch return code when `--wait` is requested on a job array.
  * Fix collected `GPUUtilization` values for `acct_gather_profile` plugins.
  * Fix `slurmrestd` handling of job hold/release operations.
  * Make spank `S_JOB_ARGV` item value hold the requested command `argv`
    instead of the `srun --bcast` value when `--bcast` requested (only in local
    context).
  * Fix step running indefinitely when slurmctld takes more than
    `MessageTimeout` to respond. Now, slurmctld will cancel the step when
    detected, preventing following steps from getting stuck waiting for
    resources to be released.
  * Fix regression to make `job_desc.min_cpus` accurate again in job_submit when
    requesting a job with `--ntasks-per-node`.
  * Fix handling of `ArrayTaskThrottle` in backfill.
  * Fix regression in 23.02.2 when checking gres state on `slurmctld` startup or
    reconfigure. Gres changes in the configuration were not updated on slurmctld
    startup. On startup or reconfigure, these messages were present in the log:
    `"error: Attempt to change gres/gpu Count`".
  * Fix potential double count of gres when dealing with limits.
  * Fix slurmstepd segfault when ContainerPath is not set in `oci.conf`
  * Fixed an issue where jobs requesting licenses were incorrectly rejected.
  * `scrontab` - Fix cutting off the final character of quoted variables.
  * `smail` - Fix issues where e-mails at job completion were not being sent.
  * `scontrol/slurmctld` - fix comma parsing when updating a reservation's
    nodes.

OBS-URL: https://build.opensuse.org/request/show/1109308
OBS-URL: https://build.opensuse.org/package/show/openSUSE:Factory/slurm?expand=0&rev=92
2023-09-07 19:12:41 +00:00