slurm/slurmctld-rerun-agent_init-when-backup-controller-takes-over.patch
Egbert Eich fafb5a0196 Accepting request 629226 from home:eeich:branches:network:cluster
- slurmctld-rerun-agent_init-when-backup-controller-takes-over.patch
  Fix an issue where the fallback controller will not be able to idle
  nodes after a failover when a process has terminated (bsc#1084917).

OBS-URL: https://build.opensuse.org/request/show/629226
OBS-URL: https://build.opensuse.org/package/show/network:cluster/slurm?expand=0&rev=65
2018-08-14 13:18:35 +00:00

59 lines
2.4 KiB
Diff

From: Egbert Eich <eich@suse.com>
Date: Tue Jul 31 17:31:15 2018 +0200
Subject: slurmctld: rerun agent_init() when backup controller takes over
Patch-mainline: Not yet
Git-commit: 169d9522c89a10dcffbf1403c20b4e6249bac79b
References: bsc#1084917
A slurmctld backup controller often fails to clean up jobs which have
finished, the node appears in an 'IDLE+COMPLETING' state while squeue -l
still shows the job in a completing state.
This situation persists until the primary controller is restarted and
cleans up all tasks in 'COMPLETING' state.
This issue is caused by a race condition in the backup controller:
When the backup controller detects that the primary controller is
inaccessible, it will run thru a restart cycle. To trigger the shutdown
of some entities, it will set slurmctld_config.shutdown_time to a value
!= 0. Before continuing as the controller in charge, it resets this
variable to 0 again.
The agent which handles the request queue - from a separate thread -
wakes up periodically (in a 2 sec interval) and checks for things to do.
If it finds slurmctld_config.shutdown_time set to a value != 0, it will
terminate.
If this wakeup occurs in the 'takeover window' between the variable
being set to !=0 and reset to 0, the agent goes away and will no longer
be available to handle queued requests as there is nothing at the end
of the 'takeover window' that would restart it.
This fix adds a restart of the agent by calling agent_init() after
slurmctld_config.shutdown_time has been reset to 0.
Should an agent still be running (because it didn't wake up during the
'takeover window') it will be caught in agent_init().
Signed-off-by: Egbert Eich <eich@suse.com>
---
src/slurmctld/backup.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/src/slurmctld/backup.c b/src/slurmctld/backup.c
index 24ddcde..cf3bb43 100644
--- a/src/slurmctld/backup.c
+++ b/src/slurmctld/backup.c
@@ -65,6 +65,7 @@
#include "src/slurmctld/read_config.h"
#include "src/slurmctld/slurmctld.h"
#include "src/slurmctld/trigger_mgr.h"
+#include "src/slurmctld/agent.h"
#define SHUTDOWN_WAIT 2 /* Time to wait for primary server shutdown */
@@ -225,6 +226,9 @@ void run_backup(slurm_trigger_callbacks_t *callbacks)
abort();
}
slurmctld_config.shutdown_time = (time_t) 0;
+ /* Reinit agent in case it has been terminated - agent_init()
+ will check itself */
+ agent_init();
unlock_slurmctld(config_write_lock);
select_g_select_nodeinfo_set_all();