slurm/slurmctld-rerun-agent_init-when-backup-controller-takes-over.patch
Egbert Eich f21d191e3c Accepting request 650545 from home:eeich:branches:network:cluster
- Added missing perl-base dependency.

- Moved HTML docs to doc package.

- Moved config man pages to a separate package: This way, they won't
  get installed on compute nodes.                                                                                                                                  

- Update to 18.08.3
  * Add new burst buffer state of "teardown-fail" to indicate the burst
    buffer teardown operation is failing on specific buffers.
  * Multiple backup slurmctld daemons can be configured
  * Enable jobs with zero node count for creation and/or deletion of persistent
    burst buffers.
  * Add "scontrol show dwstat" command to display Cray burst buffer status.
  * Add "GetSysStatus" option to burst_buffer.conf file.
  * Add node and partition configuration options of "CpuBind" to control
    default task binding.
  * Add "NumaCpuBind" option to knl.conf
  * Add sbatch "--batch" option to identify features required on batch node.
  * Add "BatchFeatures" field to output of "scontrol show job".
  * Add support for "--bb" option to sbatch command.
  * Add new SystemComment field to job data structure and database.
  * Expand reservation "flags" field from 32 to 64 bits.
  * Add job state flag of "SIGNALING" to avoid race condition.
  * Properly handle srun --will-run option when there are jobs in COMPLETING
    state.
  * Properly report who is signaling a step.
  * Don't combine updated reservation records in sreport's reservation report.
  * node_features plugin - Add suport for XOR & XAND of job constraints (node
    feature specifications).

OBS-URL: https://build.opensuse.org/request/show/650545
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=75
2018-11-20 17:07:44 +00:00

59 lines
2.5 KiB
Diff

From: Egbert Eich <eich@suse.com>
Date: Tue Nov 20 09:22:15 2018 +0100
Subject: slurmctld: rerun agent_init() when backup controller takes over
Patch-mainline: Not yet
Git-commit: 21a7abc02e4a27cc64a213ba1fc8572a20e21ba9
References: bsc#1084917
A slurmctld backup controller often fails to clean up jobs which have
finished, the node appears in an 'IDLE+COMPLETING' state while squeue -l
still shows the job in a completing state.
This situation persists until the primary controller is restarted and
cleans up all tasks in 'COMPLETING' state.
This issue is caused by a race condition in the backup controller:
When the backup controller detects that the primary controller is
inaccessible, it will run thru a restart cycle. To trigger the shutdown
of some entities, it will set slurmctld_config.shutdown_time to a value
!= 0. Before continuing as the controller in charge, it resets this
variable to 0 again.
The agent which handles the request queue - from a separate thread -
wakes up periodically (in a 2 sec interval) and checks for things to do.
If it finds slurmctld_config.shutdown_time set to a value != 0, it will
terminate.
If this wakeup occurs in the 'takeover window' between the variable
being set to !=0 and reset to 0, the agent goes away and will no longer
be available to handle queued requests as there is nothing at the end
of the 'takeover window' that would restart it.
This fix adds a restart of the agent by calling agent_init() after
slurmctld_config.shutdown_time has been reset to 0.
Should an agent still be running (because it didn't wake up during the
'takeover window') it will be caught in agent_init().
Signed-off-by: Egbert Eich <eich@suse.com>
---
src/slurmctld/backup.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/src/slurmctld/backup.c b/src/slurmctld/backup.c
index de74513..2b4c74e 100644
--- a/src/slurmctld/backup.c
+++ b/src/slurmctld/backup.c
@@ -65,6 +65,7 @@
#include "src/slurmctld/read_config.h"
#include "src/slurmctld/slurmctld.h"
#include "src/slurmctld/trigger_mgr.h"
+#include "src/slurmctld/agent.h"
#define _DEBUG 0
#define SHUTDOWN_WAIT 2 /* Time to wait for primary server shutdown */
@@ -258,6 +259,9 @@ void run_backup(slurm_trigger_callbacks_t *callbacks)
error("Unable to recover slurm state");
abort();
}
+ /* Reinit agent in case it has been terminated - agent_init()
+ will check itself */
+ agent_init();
slurmctld_config.shutdown_time = (time_t) 0;
unlock_slurmctld(config_write_lock);
select_g_select_nodeinfo_set_all();