From: Egbert Eich Date: Tue Jul 31 17:31:15 2018 +0200 Subject: slurmctld: rerun agent_init() when backup controller takes over Patch-mainline: Not yet Git-commit: 169d9522c89a10dcffbf1403c20b4e6249bac79b References: bsc#1084917 A slurmctld backup controller often fails to clean up jobs which have finished, the node appears in an 'IDLE+COMPLETING' state while squeue -l still shows the job in a completing state. This situation persists until the primary controller is restarted and cleans up all tasks in 'COMPLETING' state. This issue is caused by a race condition in the backup controller: When the backup controller detects that the primary controller is inaccessible, it will run thru a restart cycle. To trigger the shutdown of some entities, it will set slurmctld_config.shutdown_time to a value != 0. Before continuing as the controller in charge, it resets this variable to 0 again. The agent which handles the request queue - from a separate thread - wakes up periodically (in a 2 sec interval) and checks for things to do. If it finds slurmctld_config.shutdown_time set to a value != 0, it will terminate. If this wakeup occurs in the 'takeover window' between the variable being set to !=0 and reset to 0, the agent goes away and will no longer be available to handle queued requests as there is nothing at the end of the 'takeover window' that would restart it. This fix adds a restart of the agent by calling agent_init() after slurmctld_config.shutdown_time has been reset to 0. Should an agent still be running (because it didn't wake up during the 'takeover window') it will be caught in agent_init(). Signed-off-by: Egbert Eich --- src/slurmctld/backup.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/src/slurmctld/backup.c b/src/slurmctld/backup.c index 24ddcde..cf3bb43 100644 --- a/src/slurmctld/backup.c +++ b/src/slurmctld/backup.c @@ -65,6 +65,7 @@ #include "src/slurmctld/read_config.h" #include "src/slurmctld/slurmctld.h" #include "src/slurmctld/trigger_mgr.h" +#include "src/slurmctld/agent.h" #define SHUTDOWN_WAIT 2 /* Time to wait for primary server shutdown */ @@ -225,6 +226,9 @@ void run_backup(slurm_trigger_callbacks_t *callbacks) abort(); } slurmctld_config.shutdown_time = (time_t) 0; + /* Reinit agent in case it has been terminated - agent_init() + will check itself */ + agent_init(); unlock_slurmctld(config_write_lock); select_g_select_nodeinfo_set_all();