Notes on debugging fencing agents

I have had recently to debug my fence_ovh agent which I suppose I will update soon.I am writing some notes specifically for fence_ovh. Most of them might be applied also to other fencing agents. If you are trying to develop your own fencing agent these notes might help you. The idea is not to learn everything again if I need to develop or improve a fencing agent on myself in the future :).

Introduction

You can check Fedora FAQ about Fencing to get a general idea of why Fencing is needed in an High Availability (HA) system. Basically you make sure that a node in your cluster that has been detected as non working is isolated. If you isolate the node itself (e.g. by turning its power off) you are doing node fencing. If you isolate the common resources (usually shared storage) that the node can access from the node accessing them (e.g. Iscsi lun no longer accepting connections from isolated node ips) you are doing resource fencing.
At Fedora Agent API wiki you will learn what a fencing agent is.Basically I understand them as plugins in your HA system. What makes your node to be shutdown is usually a hardware device which you can communicate with via network. So what you need in order to shutdown that node is a series of commands that the hardware device understands as an order to shutdown your node. In the other hand the HA system needs to know the status of each one of the nodes periodically. That same status can be also gathered by these hardware devices that will inform us about the server being on or off.
Update 14 July 2014: Although not official yet I point you to Fence Arch wiki page where fence agents are explained in a different way than other documentation pages.
So you might begin to understand what’s the fencing agent role in all of that. With the fencing agent the linux HA can monitor the fencing hardware device status. Linux HA can also order the fencing hardware device to shutdown a node if it detects that it is offline (by the means of the fencing agent).
As you might imagine not all the fencing devices work by the means of turning off electrically a server. And not all fencing devices are hardware devices.

A fencing agent

As I said you can check Fedora Agent API wiki to know what a fencing agent is supposed to do according to the linux-ha enforced standards.
Basically a fencing agents reads from input lines in the form:

argument=value
#this line is ignored
argument2=value2

and the process them. Depending on the fencing hardware device it return a different status.

Fence_ovh installation

I think we have had enough theory and it’s getting boring. So let’s begin with actual testing and debugging. As I said this is for fence_ovh so you are supposed to install a custom fence script (after having developed it).
First of all you need to install fence_ovh dependencies which at Proxmox, currently is installing: python-suds package.
Usually there’s a folder in your system where all the present fence_xxx scripts are found. In proxmox system that folder is:

/usr/sbin/

so I copy there fence_ovh and, in order to make it easier, I also copy in the same directory the fencing.py library which it’s an auxiliary library developed by linux-ha people.
So, now, if you wanted to use fence_ovh and its special options to be recognised by your linux-ha system is compulsory to run for the first time:

ccs_update_schema

and if you ever edit fence_ovh you should run again:

ccs_update_schema --force

so that, as I say, its options are taken into account.

fence_ovh testing based on script itself

First of all you need to make sure that fence_ovh is not called by the linux-ha system. That means cluster.conf not having fence_ovh agent setup. That’s usually the case when you begin to debug a fencing agent. But, just in case, I write it down.
Then you need to figure out which are the options that you want to test with fence_ovh.
For me the options are going to be these ones:

verbose=true
nodename=ns567890.ip-123-22-44.eu
action=off
login=ab12345-ovh
passwd=TOPSECRET
power_wait=5
plug=ns567890.ip-123-22-44.eu

which roughly means:

  • Be verbose
  • Use ns567890.ip-123-22-44.eu as a nodename (not sure if I’m using currently that option)
  • Action to be performed is: off . In terms of fence_ovh that means instructing OVH API to turn off the server found in plug option.
  • Use my OVH login: ab12345-ovh to be able to login into OVH Soap API
  • Use my OVH password: TOPSECRET to be able to login into OVH Soap API
  • I guess that you have to wait for 5 minutes for power to be on. I’m not actually sure that this is implemented by fence_ovh script itself.
  • Server to be turned off (or another action to be applied upon it) is: ns567890.ip-123-22-44.eu

So… What command should you issue to test these particular options? We are going to put these options into fence_ovh standard input by using the echo command and separating each one of them thanks to a return code. Here it is:

cd /usr/sbin
echo -e -n "verbose=true\n\
nodename=ns567890.ip-123-22-44.eu\n\
action=off\n\
login=ab12345-ovh\n\
passwd=TOPSECRET\n\
power_wait=5\n\
plug=ns567890.ip-123-22-44.eu" | fence_ovh

The idea is that you write the command in a single line such as this non working example:

echo -e -n "nodename=ns567890.ip-123-22-44.eu\naction=off" | fence_ovh

I wrote the former command into multiple lines so that it fits ok in the web.
Update 14 July 2014: It seems it’s much easier to do these tests using command-line arguments instead of stdin. Manual pages explain each of the options.
So that’s it. After running the command if it finishes without any error that means that you are a very lucky person ;).
But, that does not mean that it was a successful execution. Just after running this command you need to gather the return code.
That it is you need to run:

echo $?

which would output probably 0, 1 or 2.
Once again reading on Fedora Agent API wiki and its Agent Operations and Return Values section will tell you if it’s ok.
In this case we have asked the «off» operation. So if 0 is returned that means that it was successful. That means the node has been turned off (Actually it reboots into OVH rescue mode) ok. However if 1 is returned that means that either it was not successful or that somehow SOAP API was not available so the order could not be sent.

fence_ovh testing based on fence_node

Once you have tested your script to work as the official API expects it to work you can use a more abstract tool to test it.
fence_node is a command line tool that I think your Linux-HA system using internally when fencing (but I might be wrong).
This command needs cluster.conf to be informed so that fence_node knows which fencing agent should be called. But, at the same time you have to ensure that cluster is not on so that both of them do not interfer with one another. I must admit I don’t know to do that. I would try to run service rgmanager stop in all of the nodes but it’s just a wild guess that needs confirmation.
Once the requirements met I recommend you to run:
fence_node -vv ns567890.ip-123-22-44.eu
This command not only tells fence_node to fence that node but also forces it to show the arguments that will be passed to your fencing agent.
So that you see something like:

Fencing node ns567890.ip-123-22-44.eu
verbose=true
nodename=ns567890.ip-123-22-44.eu
action=off
login=ab12345-ovh
passwd=TOPSECRET
power_wait=5
plug=ns567890.ip-123-22-44.eu
[Output of fence_ovh fencing agent]

That way you are not taken into surprise on what it’s going on.

fence_ovh testing on cluster.conf

Well, this is not testing or debugging at all.Basically you setup your cluster.conf to use it and you check logs to see if it works as expected when you simulate a link down.
So basically I edit:

/etc/pve/cluster.conf

so that it has:

<?xml version="1.0"?>
<cluster config_version="45" name="cluster-fm-ha-1">
  <cman keyfile="/var/lib/pve-cluster/corosync.authkey" transport="udpu"/>
  <fencedevices>
    <fencedevice agent="fence_ovh" plug="ns567890.ip-123-22-44.eu"
 login="ab12345-ovh" name="fence01" passwd="TOPSECRET" power_wait="5"/>
    <fencedevice agent="fence_ovh" plug="ns567891.ip-123-22-45.eu"
 login="ab12345-ovh" name="fence02" passwd="TOPSECRET" power_wait="5"/>
    <fencedevice agent="fence_ovh" plug="ns567892.ip-123-22-46.eu"
 login="ab12345-ovh" name="fence03" passwd="TOPSECRET" power_wait="5"/>
  </fencedevices>
  <clusternodes>
    <clusternode name="ns567890.ip-123-22-44.eu" nodeid="1" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence01"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="ns567891.ip-123-22-45.eu" nodeid="2" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence02"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="ns567892.ip-123-22-46.eu" nodeid="3" votes="1">
      <fence>
        <method name="1">
          <device action="off" name="fence03"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>
  <rm>
    <pvevm autostart="1" vmid="100"/>
    <pvevm autostart="1" vmid="101"/>
  </rm>
</cluster>

This cluster conf example matches as close as possible former comamnds and as you can see two virtual machines are being setup as HA and three nodes are available. So each of them might be detected as non available and one of the other nodes will decide to turn it off.
So basically you check: /var/log/cluster/corosync.log so that it detects a new cluster formation and also /var/log/cluster/fenced.log where, hopefully, the fencing agents output will be.
I personally simulate a node being down by issuing:

/sbin/iptables -A INPUT -p udp --destination-port 5404 -j DROP
/sbin/iptables -A INPUT -p udp --destination-port 5405 -j DROP

in the node I want to be considered to be off. Once again I’m not sure if it’s the best method for doing so.

Debugging on fenced.log

If you take a look at current fence_ovh implementation you will see that it has several logging. I don’t know where that goes to. I have not found it and I have to ask linux-ha people about it. I would take a look at forementioned /var/log/cluster/fenced.log and other log files such as daemon.log or syslog.
That logging should work in any one of the testing procedures I have described here. Probably using something like logging=true as an option might enable them but I will have to wait for the answer.
Update 14 July 2014: It seems that by default all the logging (at least in most recent versions) go to syslog. If you either use debug on stdin or –debug-file as a command-line argument you can choose an alternate file for saving logging.

About fence_ovh future

There are two development lines for fence_ovh. The one I am describing here which it is username and password based which gets updated to latest Proxmox VE 3.2 and that will be released soon. And the other one is about using the new API capabilites. The first step would be to enable a user / hash pair to be able to perform only the OVH SOAP API functions needed by the fencing agent and, if posssible, only in the machines that we want the cluster to control of. That makes easier to deny that user / hash pair permission if it gets stolen. And then it’s more secure about what the cracker can do with your user / hash pair. E.g. it won’t be able to reinstall your server or see your bills.
For either one of the other release please check (or subscribe to): Proxmox HA Cluster at OVH – Fencing thread at Proxmox forum.
Then the idea is to submit improvements to Fedora people which will update the Red Hat Linux-HA component.
Finally I think I would have to convince Proxmox people to adopt more fencing devices from upstream fence-agents package because its current Debian package already has fence_ovh in its upstream source code tar xz but then it’s not being built into an usable fencing agent!
So that in the end you will just have to take care of writing a correct cluster.conf file that uses fence_ovh agent 😉