Dirk Mueller
6f2036b23f
- Add ocf-pull-request-63.patch and ocf-pull-request-64.patch: fixes to avoid moving master unnecessarily, and to make start notification handler more reliable. - Add ocf-pull-request-66.patch: do not consider transient local failures as failures of remote nodes. OBS-URL: https://build.opensuse.org/request/show/556717 OBS-URL: https://build.opensuse.org/package/show/network:messaging:amqp/rabbitmq-server?expand=0&rev=84
51 lines
2.2 KiB
Diff
51 lines
2.2 KiB
Diff
From 21d14dbe7389c2d0cc8778476ba5c71ad5ad4406 Mon Sep 17 00:00:00 2001
|
|
From: Vincent Untz <vuntz@suse.com>
|
|
Date: Wed, 13 Dec 2017 12:34:31 +0100
|
|
Subject: [PATCH] OCF RA: Do not consider local failures as remote node
|
|
problems
|
|
|
|
In is_clustered_with(), commands that we run to check if the node is
|
|
clustered with us, or partitioned with us may fail. When they fail, it
|
|
actually doesn't tell us anything about the remote node.
|
|
|
|
Until now, we were considering such failures as hints that the remote
|
|
node is not in a sane state with us. But doing so has pretty negative
|
|
impact, as it can cause rabbitmq to get restarted on the remote node,
|
|
causing quite some disruption.
|
|
|
|
So instead of doing this, ignore the error (it's still logged).
|
|
|
|
There was a comment in the code wondering what is the best behavior;
|
|
based on experience, I think preferring stability is the slightly more
|
|
acceptable poison between the two options.
|
|
---
|
|
scripts/rabbitmq-server-ha.ocf | 8 ++++----
|
|
1 file changed, 4 insertions(+), 4 deletions(-)
|
|
|
|
diff --git a/scripts/rabbitmq-server-ha.ocf b/scripts/rabbitmq-server-ha.ocf
|
|
index 87bb7d4..bc6a538 100755
|
|
--- a/scripts/rabbitmq-server-ha.ocf
|
|
+++ b/scripts/rabbitmq-server-ha.ocf
|
|
@@ -870,8 +870,8 @@ is_clustered_with()
|
|
rc=$?
|
|
if [ "$rc" -ne 0 ]; then
|
|
ocf_log err "${LH} Failed to check whether '$node_name' is considered running by us"
|
|
- # XXX Or should we give remote node benefit of a doubt?
|
|
- return 1
|
|
+ # We had a transient local error; that doesn't mean the remote node is
|
|
+ # not part of the cluster, so ignore this
|
|
elif [ "$seen_as_running" != true ]; then
|
|
ocf_log info "${LH} Node $node_name is not running, considering it not clustered with us"
|
|
return 1
|
|
@@ -882,8 +882,8 @@ is_clustered_with()
|
|
rc=$?
|
|
if [ "$rc" -ne 0 ]; then
|
|
ocf_log err "${LH} Failed to check whether '$node_name' is partitioned with us"
|
|
- # XXX Or should we give remote node benefit of a doubt?
|
|
- return 1
|
|
+ # We had a transient local error; that doesn't mean the remote node is
|
|
+ # not partitioned with us, so ignore this
|
|
elif [ "$seen_as_partitioned" != false ]; then
|
|
ocf_log info "${LH} Node $node_name is partitioned from us"
|
|
return 1
|