slurm/slurmctld-rerun-agent_init-when-backup-controller-takes-over.patch

From: Egbert Eich <eich@suse.com>
Date: Tue Nov 20 09:22:15 2018 +0100
Subject: slurmctld: rerun agent_init() when backup controller takes over
Patch-mainline: Not yet
Git-commit: 21a7abc02e4a27cc64a213ba1fc8572a20e21ba9
References: bsc#1084917

A slurmctld backup controller often fails to clean up jobs which have
finished, the node appears in an 'IDLE+COMPLETING' state while squeue -l
still shows the job in a completing state.
This situation persists until the primary controller is restarted and
cleans up all tasks in 'COMPLETING' state.
This issue is caused by a race condition in the backup controller:
When the backup controller detects that the primary controller is
inaccessible, it will run thru a restart cycle. To trigger the shutdown
of some entities, it will set slurmctld_config.shutdown_time to a value
!= 0. Before continuing as the controller in charge, it resets this
variable to 0 again.
The agent which handles the request queue - from a separate thread -
wakes up periodically (in a 2 sec interval) and checks for things to do.
If it finds slurmctld_config.shutdown_time set to a value != 0, it will
terminate.
If this wakeup occurs in the 'takeover window' between the variable
being set to !=0 and reset to 0, the agent goes away and will no longer
be available to handle queued requests as there is nothing at the end
of the 'takeover window' that would restart it.

This fix adds a restart of the agent by calling agent_init() after
slurmctld_config.shutdown_time has been reset to 0.
Should an agent still be running (because it didn't wake up during the
'takeover window') it will be caught in agent_init().

Signed-off-by: Egbert Eich <eich@suse.com>
---
 src/slurmctld/backup.c | 4 ++++
 1 file changed, 4 insertions(+)
diff --git a/src/slurmctld/backup.c b/src/slurmctld/backup.c
index de74513..2b4c74e 100644
--- a/src/slurmctld/backup.c
+++ b/src/slurmctld/backup.c
@@ -65,6 +65,7 @@
 #include "src/slurmctld/read_config.h"
 #include "src/slurmctld/slurmctld.h"
 #include "src/slurmctld/trigger_mgr.h"
+#include "src/slurmctld/agent.h"
 
 #define _DEBUG		0
 #define SHUTDOWN_WAIT	2	/* Time to wait for primary server shutdown */
@@ -258,6 +259,9 @@ void run_backup(slurm_trigger_callbacks_t *callbacks)
 		error("Unable to recover slurm state");
 		abort();
 	}
+	/* Reinit agent in case it has been terminated - agent_init()
+	   will check itself */
+	agent_init();
 	slurmctld_config.shutdown_time = (time_t) 0;
 	unlock_slurmctld(config_write_lock);
 	select_g_select_nodeinfo_set_all();