From: Egbert Eich Date: Tue Nov 20 09:22:15 2018 +0100 Subject: slurmctld: rerun agent_init() when backup controller takes over Patch-mainline: Not yet Git-commit: 21a7abc02e4a27cc64a213ba1fc8572a20e21ba9 References: bsc#1084917 A slurmctld backup controller often fails to clean up jobs which have finished, the node appears in an 'IDLE+COMPLETING' state while squeue -l still shows the job in a completing state. This situation persists until the primary controller is restarted and cleans up all tasks in 'COMPLETING' state. This issue is caused by a race condition in the backup controller: When the backup controller detects that the primary controller is inaccessible, it will run thru a restart cycle. To trigger the shutdown of some entities, it will set slurmctld_config.shutdown_time to a value != 0. Before continuing as the controller in charge, it resets this variable to 0 again. The agent which handles the request queue - from a separate thread - wakes up periodically (in a 2 sec interval) and checks for things to do. If it finds slurmctld_config.shutdown_time set to a value != 0, it will terminate. If this wakeup occurs in the 'takeover window' between the variable being set to !=0 and reset to 0, the agent goes away and will no longer be available to handle queued requests as there is nothing at the end of the 'takeover window' that would restart it. This fix adds a restart of the agent by calling agent_init() after slurmctld_config.shutdown_time has been reset to 0. Should an agent still be running (because it didn't wake up during the 'takeover window') it will be caught in agent_init(). Signed-off-by: Egbert Eich --- src/slurmctld/backup.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/src/slurmctld/backup.c b/src/slurmctld/backup.c index de74513..2b4c74e 100644 --- a/src/slurmctld/backup.c +++ b/src/slurmctld/backup.c @@ -65,6 +65,7 @@ #include "src/slurmctld/read_config.h" #include "src/slurmctld/slurmctld.h" #include "src/slurmctld/trigger_mgr.h" +#include "src/slurmctld/agent.h" #define _DEBUG 0 #define SHUTDOWN_WAIT 2 /* Time to wait for primary server shutdown */ @@ -258,6 +259,9 @@ void run_backup(slurm_trigger_callbacks_t *callbacks) error("Unable to recover slurm state"); abort(); } + /* Reinit agent in case it has been terminated - agent_init() + will check itself */ + agent_init(); slurmctld_config.shutdown_time = (time_t) 0; unlock_slurmctld(config_write_lock); select_g_select_nodeinfo_set_all();