slurm/slurmctld-rerun-agent_init-when-backup-controller-takes-over.patch

From: Egbert Eich <eich@suse.com>
Date: Tue Jul 31 17:31:15 2018 +0200
Subject: slurmctld: rerun agent_init() when backup controller takes over
Patch-mainline: Not yet
Git-commit: 169d9522c89a10dcffbf1403c20b4e6249bac79b
References: 

A slurmctld backup controller often fails to clean up jobs which have
finished, the node appears in an 'IDLE+COMPLETING' state while squeue -l
still shows the job in a completing state.
This situation persists until the primary controller is restarted and
cleans up all tasks in 'COMPLETING' state.
This issue is caused by a race condition in the backup controller:
When the backup controller detects that the primary controller is
inaccessible, it will run thru a restart cycle. To trigger the shutdown
of some entities, it will set slurmctld_config.shutdown_time to a value
!= 0. Before continuing as the controller in charge, it resets this
variable to 0 again.
The agent which handles the request queue - from a separate thread -
wakes up periodically (in a 2 sec interval) and checks for things to do.
If it finds slurmctld_config.shutdown_time set to a value != 0, it will
terminate.
If this wakeup occurs in the 'takeover window' between the variable
being set to !=0 and reset to 0, the agent goes away and will no longer
be available to handle queued requests as there is nothing at the end
of the 'takeover window' that would restart it.

This fix adds a restart of the agent by calling agent_init() after
slurmctld_config.shutdown_time has been reset to 0.
Should an agent still be running (because it didn't wake up during the
'takeover window') it will be caught in agent_init().

Signed-off-by: Egbert Eich <eich@suse.com>
---
 src/slurmctld/backup.c | 4 ++++
 1 file changed, 4 insertions(+)
diff --git a/src/slurmctld/backup.c b/src/slurmctld/backup.c
index 24ddcde..cf3bb43 100644
--- a/src/slurmctld/backup.c
+++ b/src/slurmctld/backup.c
@@ -65,6 +65,7 @@
 #include "src/slurmctld/read_config.h"
 #include "src/slurmctld/slurmctld.h"
 #include "src/slurmctld/trigger_mgr.h"
+#include "src/slurmctld/agent.h"
 
 #define SHUTDOWN_WAIT     2	/* Time to wait for primary server shutdown */
 
@@ -225,6 +226,9 @@ void run_backup(slurm_trigger_callbacks_t *callbacks)
 		abort();
 	}
 	slurmctld_config.shutdown_time = (time_t) 0;
+	/* Reinit agent in case it has been terminated - agent_init()
+	   will check itself */
+	agent_init();
 	unlock_slurmctld(config_write_lock);
 	select_g_select_nodeinfo_set_all();
Accepting request 629222 from home:eeich:branches:network:cluster - Update to 17.11.9 * Fix segfault in slurmctld when a job's node bitmap is NULL during a scheduling cycle. Primarily caused by EnforcePartLimits=ALL. * Remove erroneous unlock in acct_gather_energy/ipmi. * Enable support for hwloc version 2.0.1. * Fix 'srun -q' (--qos) option handling. * Fix socket communication issue that can lead to lost task completition messages, which will cause a permanently stuck srun process. * Handle creation of TMPDIR if environment variable is set or changed in a task prolog script. * Avoid node layout fragmentation if running with a fixed CPU count but without Sockets and CoresPerSocket defined. * burst_buffer/cray - Fix datawarp swap default pool overriding jobdw. * Fix incorrect job priority assignment for multi-partition job with different PriorityTier settings on the partitions. * Fix sinfo to print correct node state. - When using a remote shared StateSaveLocation, slurmctld needs to be started after remote filesystems have become available. Add 'remote-fs.target' to the 'After=' directive in slurmctld.service (boo#1103561). - Update to 17.11.8 * Fix incomplete RESPONSE_[RESOURCE\|JOB_PACK]_ALLOCATION building path. * Do not allocate nodes that were marked down due to the node not responding by ResumeTimeout. * task/cray plugin - search for "mems" cgroup information in the file "cpuset.mems" then fall back to the file "mems". * Fix ipmi profile debug uninitialized variable. * PMIx: fixed the direct connect inline msg sending. OBS-URL: https://build.opensuse.org/request/show/629222 OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=64 2018-08-14 15:00:16 +02:00			`From: Egbert Eich <eich@suse.com>`
			`Date: Tue Jul 31 17:31:15 2018 +0200`
			`Subject: slurmctld: rerun agent_init() when backup controller takes over`
			`Patch-mainline: Not yet`
			`Git-commit: 169d9522c89a10dcffbf1403c20b4e6249bac79b`
			`References:`

			`A slurmctld backup controller often fails to clean up jobs which have`
			`finished, the node appears in an 'IDLE+COMPLETING' state while squeue -l`
			`still shows the job in a completing state.`
			`This situation persists until the primary controller is restarted and`
			`cleans up all tasks in 'COMPLETING' state.`
			`This issue is caused by a race condition in the backup controller:`
			`When the backup controller detects that the primary controller is`
			`inaccessible, it will run thru a restart cycle. To trigger the shutdown`
			`of some entities, it will set slurmctld_config.shutdown_time to a value`
			`!= 0. Before continuing as the controller in charge, it resets this`
			`variable to 0 again.`
			`The agent which handles the request queue - from a separate thread -`
			`wakes up periodically (in a 2 sec interval) and checks for things to do.`
			`If it finds slurmctld_config.shutdown_time set to a value != 0, it will`
			`terminate.`
			`If this wakeup occurs in the 'takeover window' between the variable`
			`being set to !=0 and reset to 0, the agent goes away and will no longer`
			`be available to handle queued requests as there is nothing at the end`
			`of the 'takeover window' that would restart it.`

			`This fix adds a restart of the agent by calling agent_init() after`
			`slurmctld_config.shutdown_time has been reset to 0.`
			`Should an agent still be running (because it didn't wake up during the`
			`'takeover window') it will be caught in agent_init().`

			`Signed-off-by: Egbert Eich <eich@suse.com>`
			`---`
			`src/slurmctld/backup.c \| 4 ++++`
			`1 file changed, 4 insertions(+)`
			`diff --git a/src/slurmctld/backup.c b/src/slurmctld/backup.c`
			`index 24ddcde..cf3bb43 100644`
			`--- a/src/slurmctld/backup.c`
			`+++ b/src/slurmctld/backup.c`
			`@@ -65,6 +65,7 @@`
			`#include "src/slurmctld/read_config.h"`
			`#include "src/slurmctld/slurmctld.h"`
			`#include "src/slurmctld/trigger_mgr.h"`
			`+#include "src/slurmctld/agent.h"`

			`#define SHUTDOWN_WAIT 2 /* Time to wait for primary server shutdown */`

			`@@ -225,6 +226,9 @@ void run_backup(slurm_trigger_callbacks_t *callbacks)`
			`abort();`
			`}`
			`slurmctld_config.shutdown_time = (time_t) 0;`
			`+ /* Reinit agent in case it has been terminated - agent_init()`
			`+ will check itself */`
			`+ agent_init();`
			`unlock_slurmctld(config_write_lock);`
			`select_g_select_nodeinfo_set_all();`