{"id":2080,"date":"2026-06-29T11:03:42","date_gmt":"2026-06-29T05:33:42","guid":{"rendered":"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/"},"modified":"2026-06-29T13:04:14","modified_gmt":"2026-06-29T07:34:14","slug":"sap-hana-hsr-pacemaker-ha-dr-setup","status":"publish","type":"post","link":"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/","title":{"rendered":"SAP HANA HSR &#038; Pacemaker \u2014 Complete HA\/DR Cluster Setup Guide"},"content":{"rendered":"<p>If you&#8217;ve ever been woken up at 3 AM because your SAP HANA production node went down and the &#8220;high availability&#8221; setup didn&#8217;t quite live up to its name, you know why this post exists. Setting up real high availability for SAP HANA isn&#8217;t just about flipping a switch \u2014 it&#8217;s about understanding how System Replication and Pacemaker actually talk to each other, what happens when a network link flaps silently, and why your failover test in the DR lab might not survive contact with production reality.<\/p>\n<p>In my experience helping Basis teams deploy HANA HA\/DR clusters, the biggest pain point isn&#8217;t the initial setup \u2014 it&#8217;s making sure failover actually works six months later when you haven&#8217;t touched the cluster since go-live. This guide walks you through the entire stack: HSR modes, Pacemaker cluster configuration, fencing, resource agents, automatic failover, and the DR drill procedures your auditors want to see.<\/p>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_85 counter-hierarchy ez-toc-counter ez-toc-grey ez-toc-container-direction\">\n<p class=\"ez-toc-title\" style=\"cursor:inherit\">Table of Contents<\/p>\n<label for=\"ez-toc-cssicon-toggle-item-6a43f01c04796\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Toggle<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #999;color:#999\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewBox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #999;color:#999\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewBox=\"0 0 24 24\" version=\"1.2\" baseProfile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a43f01c04796\"  aria-label=\"Toggle\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1 ' ><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Architecture_Overview_Primary_Secondary_and_Quorum\" >Architecture Overview: Primary, Secondary, and Quorum<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#HSR_Replication_Modes_Choosing_the_Right_One\" >HSR Replication Modes: Choosing the Right One<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Mode_1_Synchronous_sync\" >Mode 1: Synchronous (sync)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Mode_2_Synchronous_in_Memory_syncmem\" >Mode 2: Synchronous in Memory (syncmem)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Mode_3_Asynchronous_async\" >Mode 3: Asynchronous (async)<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Pacemaker_Cluster_Components\" >Pacemaker Cluster Components<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Corosync_%E2%80%94_The_Membership_Layer\" >Corosync \u2014 The Membership Layer<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Fencing_%E2%80%94_STONITH_and_Why_It_Matters\" >Fencing \u2014 STONITH and Why It Matters<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Resource_Agents_and_Constraints\" >Resource Agents and Constraints<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-10\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Step-by-Step_Cluster_Configuration\" >Step-by-Step Cluster Configuration<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-11\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Step_1_Prerequisites_Checklist\" >Step 1: Prerequisites Checklist<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-12\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Step_2_Enable_HANA_System_Replication\" >Step 2: Enable HANA System Replication<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-13\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Step_3_Configure_the_Cluster_Infrastructure\" >Step 3: Configure the Cluster Infrastructure<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-14\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Step_4_Configure_Fencing\" >Step 4: Configure Fencing<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-15\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Step_5_Define_Cluster_Resources\" >Step 5: Define Cluster Resources<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-16\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Automatic_Failover_Testing_DR_Drills\" >Automatic Failover Testing &amp; DR Drills<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-17\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Test_Scenario_1_Graceful_Primary_Failure\" >Test Scenario 1: Graceful Primary Failure<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-18\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Test_Scenario_2_Hard_Crash_Kernel_Panic\" >Test Scenario 2: Hard Crash (Kernel Panic)<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-19\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Test_Scenario_3_Network_Partition\" >Test Scenario 3: Network Partition<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-20\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Test_Scenario_4_DR_Failover\" >Test Scenario 4: DR Failover<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-21\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Monitoring_the_Cluster\" >Monitoring the Cluster<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-22\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Essential_Cluster_Health_Checks\" >Essential Cluster Health Checks<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-23\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Replication_Lag_Monitoring\" >Replication Lag Monitoring<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-24\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Integration_with_SAP_Solution_Manager\" >Integration with SAP Solution Manager<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-25\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Troubleshooting_Common_Issues\" >Troubleshooting Common Issues<\/a><ul class='ez-toc-list-level-3' ><li class='ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-26\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Split-Brain_Both_Nodes_Primary\" >Split-Brain: Both Nodes Primary<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-27\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Failed_Failover_Secondary_Wont_Promote\" >Failed Failover: Secondary Won&#8217;t Promote<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-3'><a class=\"ez-toc-link ez-toc-heading-28\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Inactive_Replication_After_Maintenance\" >Inactive Replication After Maintenance<\/a><\/li><\/ul><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-29\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Best_Practices_for_Production\" >Best Practices for Production<\/a><\/li><li class='ez-toc-page-1 ez-toc-heading-level-2'><a class=\"ez-toc-link ez-toc-heading-30\" href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-hsr-pacemaker-ha-dr-setup\/#Conclusion\" >Conclusion<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"Architecture_Overview_Primary_Secondary_and_Quorum\"><\/span>Architecture Overview: Primary, Secondary, and Quorum<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Before touching a single config file, let&#8217;s get the architecture straight. A proper SAP HANA HSR + Pacemaker setup involves at minimum three nodes:<\/p>\n<ul>\n<li><strong>Primary node<\/strong> \u2014 handles all read\/write HANA workloads and replicates data to the secondary.<\/li>\n<li><strong>Secondary node<\/strong> \u2014 receives replication data and stands ready to take over if the primary fails.<\/li>\n<li><strong>Quorum\/Witness node<\/strong> \u2014 breaks ties in split-brain scenarios. In a two-node setup without a witness, you risk both nodes thinking they&#8217;re primary simultaneously \u2014 and that&#8217;s when your data diverges silently.<\/li>\n<\/ul>\n<p>The quorum node can be a lightweight VM \u2014 it doesn&#8217;t run HANA itself. It participates only in the Corosync voting ring. If you&#8217;re running on Azure or AWS, place the witness in a different availability zone from your HANA nodes. Think of it like a referee on the sidelines who can see both players \u2014 if the two players start arguing about who&#8217;s in charge, the referee makes the call.<\/p>\n<p>Pacemaker itself has two major components you need to understand: <strong>Corosync<\/strong> handles cluster communication and membership (who&#8217;s alive), while <strong>Pacemaker<\/strong> manages resources (what runs where and in what order). When someone says &#8220;Pacemaker failed over,&#8221; what actually happened is Corosync detected the node was gone, and Pacemaker&#8217;s policy engine decided to relocate the HANA resource agent to the secondary.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"HSR_Replication_Modes_Choosing_the_Right_One\"><\/span>HSR Replication Modes: Choosing the Right One<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>SAP HANA supports three HSR replication modes, and choosing correctly is the single most important design decision you&#8217;ll make. Each mode represents a different tradeoff between zero data loss and performance impact \u2014 there&#8217;s no universal &#8220;best&#8221; option.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Mode_1_Synchronous_sync\"><\/span>Mode 1: Synchronous (sync)<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Synchronous replication means every transaction on the primary waits for confirmation that it was written to the secondary&#8217;s log buffer before committing. This guarantees zero data loss (RPO = 0) but adds latency to every write operation \u2014 typically 1-5 milliseconds over a direct network link.<\/p>\n<p><strong>Use synchronous when:<\/strong> Your RPO is zero (no data loss acceptable), your network latency between nodes is under 2ms, and your workload can tolerate the additional commit delay. This is the default recommendation for most production SAP landscapes.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Mode_2_Synchronous_in_Memory_syncmem\"><\/span>Mode 2: Synchronous in Memory (syncmem)<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Syncmem is a middle ground. The secondary confirms receipt of log pages to the primary before they&#8217;re persisted to disk on the secondary. Transactions commit faster than pure sync because the secondary doesn&#8217;t wait for disk I\/O \u2014 it only waited for the log to land in memory.<\/p>\n<p>The catch? If the secondary crashes before flushing those log pages to disk, you lose whatever was still in memory. In practice, the window is tiny (milliseconds), and SAP rates syncmem as near-zero data loss. I recommend this mode when synchronous adds more than 5ms of latency to your workload but you still need near-zero RPO.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Mode_3_Asynchronous_async\"><\/span>Mode 3: Asynchronous (async)<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Asynchronous replication fires and forgets. The primary doesn&#8217;t wait for any acknowledgment from the secondary. This gives you the lowest latency on the primary side, but during a failover, you lose whatever transactions were in transit \u2014 potentially minutes of data.<\/p>\n<p><strong>Use async when:<\/strong> The replication link spans long distances (inter-region DR), network latency exceeds 10ms, or you have a tolerance for some data loss in exchange for performance. It&#8217;s common to see sync for intra-data-center HA paired with async for cross-region DR \u2014 a tiered approach.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Pacemaker_Cluster_Components\"><\/span>Pacemaker Cluster Components<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Now let&#8217;s talk about what makes the cluster actually work. Understanding these components will save you hours of &#8220;why didn&#8217;t it failover?&#8221; debugging later.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Corosync_%E2%80%94_The_Membership_Layer\"><\/span>Corosync \u2014 The Membership Layer<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Corosync is the cluster communication layer. It manages node membership, message passing, and quorum decisions. In an HANA HA setup, Corosync runs on all nodes (including the witness) over a dedicated network interface \u2014 and yes, you should use a separate NIC for Corosync traffic.<\/p>\n<p>Key <code>corosync.conf<\/code> settings you&#8217;ll configure:<\/p>\n<ul>\n<li><code>totem.version: 2<\/code> \u2014 Ring protocol version<\/li>\n<li><code>secauth: on<\/code> \u2014 Encrypt cluster communication (enable this in production)<\/li>\n<li><code>interface ring0_addr<\/code> \u2014 Bind to your dedicated cluster network<\/li>\n<li><code>two_node: 1<\/code> \u2014 Required when using a two-node setup<\/li>\n<li><code>quorum_votes<\/code> \u2014 Set the witness to 0 votes (non-voting) or adjust for expected failures<\/li>\n<\/ul>\n<h3><span class=\"ez-toc-section\" id=\"Fencing_%E2%80%94_STONITH_and_Why_It_Matters\"><\/span>Fencing \u2014 STONITH and Why It Matters<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Fencing is the mechanism that guarantees only one node can access shared resources at any time. Without proper fencing, you get a split-brain scenario where both nodes think they&#8217;re the primary \u2014 and each starts accepting writes independently. This corrupts your data silently and is significantly worse than downtime.<\/p>\n<p>Pacemaker uses STONITH (Shoot The Other Node In The Head) fencing agents. Common agents include:<\/p>\n<ul>\n<li><strong>fence_aws \/ fence_azure<\/strong> \u2014 Cloud API-based fencing for virtualized environments<\/li>\n<li><strong>fence_ipmilan<\/strong> \u2014 IPMI\/BMC-based physical server fencing<\/li>\n<li><strong>fence_sbd<\/strong> \u2014 SBD (STOTH Block Device) for shared-disk fencing<\/li>\n<\/ul>\n<p><strong>Critical rule:<\/strong> Never disable STONITH in production because &#8220;it&#8217;s causing problems.&#8221; STONITH triggering is a symptom, not the cause. If STONITH fires unexpectedly, fix the root cause (network flapping, resource starvation) rather than disabling the guard rail.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Resource_Agents_and_Constraints\"><\/span>Resource Agents and Constraints<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Resource agents are scripts that know how to start, stop, and monitor a specific service \u2014 in our case, the SAP HANA instance. The <code>sfence_saphana<\/code> and <code>sfence_sapnode<\/code> resource agents handle HANA-specific operations including takeover, replication status checks, and version validation.<\/p>\n<p>The key constraint properties you&#8217;ll configure:<\/p>\n<ul>\n<li><strong>colocation<\/strong> \u2014 HANA primary IP must run on the same node as the HANA primary role<\/li>\n<li><strong>ordering<\/strong> \u2014 Bring up the IP address before starting HANA, and reverse on shutdown<\/li>\n<li><strong>promotable resources<\/strong> \u2014 Define the master\/slave relationship managed by HSR<\/li>\n<\/ul>\n<h2><span class=\"ez-toc-section\" id=\"Step-by-Step_Cluster_Configuration\"><\/span>Step-by-Step Cluster Configuration<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Here&#8217;s the practical sequence I follow when standing up a new HANA HSR + Pacemaker cluster from scratch. Adjust for your specific OS (SLES or RHEL) and HANA version.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Step_1_Prerequisites_Checklist\"><\/span>Step 1: Prerequisites Checklist<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<ul>\n<li>Both HANA nodes installed with identical version, patch level, and parameter files<\/li>\n<li>Network: at least two network paths between nodes (production + replication)<\/li>\n<li>Time synchronized \u2014 NTP drift breaks Corosync. I&#8217;ve seen 60-second drift cause false fencing events.<\/li>\n<li>SSH key-based authentication between nodes (for HSR replication)<\/li>\n<li>SAP Host Agent installed and running on all nodes<\/li>\n<li>Log mounts accessible from both nodes (for shared log access during failover)<\/li>\n<\/ul>\n<h3><span class=\"ez-toc-section\" id=\"Step_2_Enable_HANA_System_Replication\"><\/span>Step 2: Enable HANA System Replication<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>On the primary node, enable HSR:<\/p>\n<pre><code>hdbsql -u SYSTEM -p &lt;password&gt; -i 90 \"ALTER SYSTEM ALTER CONFIGURATION ('global.ini', 'system') SET ('replication', 'mode') = 'logreplay' WITH RECONFIGURE\"<\/code><\/pre>\n<p>For synchronous mode:<\/p>\n<pre><code>hdbsql -u SYSTEM -p &lt;password&gt; -i 90 \"ALTER SYSTEM ALTER CONFIGURATION ('global.ini', 'system') SET ('replication', 'mode') = 'logreplay_sync' WITH RECONFIGURE\"<\/code><\/pre>\n<p>Register the secondary on the primary:<\/p>\n<pre><code>hdbnsutil -sr_enable --name=NODE01<\/code><\/pre>\n<p>On the secondary node, register for replication from the primary:<\/p>\n<pre><code>hdbnsutil -sr_register --remoteHost=hananode01 --remoteInstance=00 --mode=sync --operationMode=logreplay<\/code><\/pre>\n<h3><span class=\"ez-toc-section\" id=\"Step_3_Configure_the_Cluster_Infrastructure\"><\/span>Step 3: Configure the Cluster Infrastructure<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Install cluster packages (SLES example):<\/p>\n<pre><code>zypper install -y pacemaker pcs corosync fence-agents-saphana saphana-ha<\/code><\/pre>\n<p>Initialize the cluster on the primary:<\/p>\n<pre><code>pcs cluster auth hananode01 hananode02 -u hacluster -p &lt;password&gt;\r\npcs cluster setup --name hana-ha --start hananode01 hananode02 --enable<\/code><\/pre>\n<p>Configure Corosync for two-node operation:<\/p>\n<pre><code>pcs property set no-quorum-policy=freeze\r\npcs property set stonith-enabled=true<\/code><\/pre>\n<p>The <code>freeze<\/code> policy for two-node clusters prevents a single remaining node from continuing without quorum (which would violate cluster consistency unless the witness confirms).<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Step_4_Configure_Fencing\"><\/span>Step 4: Configure Fencing<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>For cloud deployments (AWS example with IPMI fallback):<\/p>\n<pre><code>pcs stonith create fence-node1 fence_aws instance_id=i-0abc123 region=us-east-1 op monitor interval=30s\r\npcs stonith create fence-node2 fence_aws instance_id=i-0def456 region=us-east-1 op monitor interval=30s<\/code><\/pre>\n<p>Configure fencing levels \u2014 try graceful shutdown before hard power-off:<\/p>\n<pre><code>pcs stonith level add 1 node1 fence-node1\r\npcs stonith level add 1 node2 fence-node2\r\npcs stonith level add 2 node1 fence-ipmilan-node1\r\npcs stonith level add 2 node2 fence-ipmilan-node2<\/code><\/pre>\n<h3><span class=\"ez-toc-section\" id=\"Step_5_Define_Cluster_Resources\"><\/span>Step 5: Define Cluster Resources<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Create the HANA resource agent with promotable master\/slave semantics:<\/p>\n<pre><code>pcs resource create hana_SAPHana_PRD --class SAPHanaSR --provider=SAPHanaSR --type SAPHanaSR operations --import SAPHanaSR operations --p START_TIMEOUT=3600 --p STOP_TIMEOUT=3600 --p PROMOTE_TIMEOUT=3600 --p DEMOTE_TIMEOUT=3600<\/code><\/pre>\n<p>Create the virtual IP that follows the primary:<\/p>\n<pre><code>pcs resource create vip_SAPHana_PRD IPaddr2 ip=10.0.1.100 op monitor interval=10s<\/code><\/pre>\n<p>Set colocation and ordering constraints:<\/p>\n<pre><code>pcs constraint colocation add vip_SAPHana_PRD master hana_SAPHana_PRD-master INFINITY\r\npcs constraint order stop hana_SAPHana_PRD then stop vip_SAPHana_PRD symmetrical=false\r\npcs constraint order start vip_SAPHana_PRD then start hana_SAPHana_PRD<\/code><\/pre>\n<h2><span class=\"ez-toc-section\" id=\"Automatic_Failover_Testing_DR_Drills\"><\/span>Automatic Failover Testing &amp; DR Drills<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>This is where I see teams cut corners \u2014 and it&#8217;s exactly where things go wrong in production. Running a proper DR drill means simulating real failure conditions, not just clicking &#8220;stop HANA&#8221; in the H studio.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Test_Scenario_1_Graceful_Primary_Failure\"><\/span>Test Scenario 1: Graceful Primary Failure<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<pre><code># On primary node\r\ncrm_mon -r  # Note current layout\r\npcs cluster stop hananode01  # Simulates clean shutdown<\/code><\/pre>\n<p>Expected: Secondary promotes to primary within 30-60 seconds. Verify with:<\/p>\n<pre><code>hdbview -d SYSTEMDB --saphost=hananode02<\/code><\/pre>\n<h3><span class=\"ez-toc-section\" id=\"Test_Scenario_2_Hard_Crash_Kernel_Panic\"><\/span>Test Scenario 2: Hard Crash (Kernel Panic)<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>This is the real test. On the primary node:<\/p>\n<pre><code>echo c > \/proc\/sysrq-trigger<\/code><\/pre>\n<p>This triggers a kernel panic immediately \u2014 no graceful shutdown. The cluster should detect the node loss within the configured <code>token<\/code> timeout (default 10 seconds) plus fencing time. For a clean test, expected total failover time: 15-30 seconds with sync replication.<\/p>\n<p>Check the cluster log for the fencing event:<\/p>\n<pre><code>journalctl -u pacemaker --since \"5 minutes ago\" | grep stonith<\/code><\/pre>\n<h3><span class=\"ez-toc-section\" id=\"Test_Scenario_3_Network_Partition\"><\/span>Test Scenario 3: Network Partition<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Simulate network failure by blocking Corosync traffic on the primary:<\/p>\n<pre><code>iptables -A INPUT -s hananode02 -j DROP\r\niptables -A OUTPUT -d hananode02 -j DROP<\/code><\/pre>\n<p>The quorum mechanism should handle this correctly. If you have a witness node, the primary with witness connection continues operating. Remove the iptables rule after testing \u2014 leaving it in will cause permanent split-brain.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Test_Scenario_4_DR_Failover\"><\/span>Test Scenario 4: DR Failover<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>For cross-site DR replication, test the full takeover procedure to your DR site. Document the time it takes \u2014 your RTO (Recovery Time Objective) number needs to be measured, not guessed. As I covered in my earlier post about <a href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-backup-recovery-complete-guide\/\">SAP HANA Backup &amp; Recovery<\/a>, recovery time validation is something most teams skip until an auditor asks for evidence.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Monitoring_the_Cluster\"><\/span>Monitoring the Cluster<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>A cluster you can&#8217;t monitor is a cluster that fails silently. Set up monitoring from day one \u2014 don&#8217;t wait until after go-live.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Essential_Cluster_Health_Checks\"><\/span>Essential Cluster Health Checks<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<ul>\n<li><strong>crm_mon -rf<\/strong> \u2014 Real-time resource status. Run this as a cron job every minute to catch stop-start loops.<\/li>\n<li><strong>SAPHanaSR-showAttr<\/strong> \u2014 Check HSR replication status, sync state, and site-specific attributes.<\/li>\n<li><strong>systemReplicationStatus.py<\/strong> \u2014 Built-in SAP script that checks replication health. Returns exit code 0 (OK), 1 (WARNING), or 2 (CRITICAL).<\/li>\n<\/ul>\n<h3><span class=\"ez-toc-section\" id=\"Replication_Lag_Monitoring\"><\/span>Replication Lag Monitoring<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Watch the <code>logGap<\/code> metric between primary and secondary. If it grows beyond 10MB consistently, your network throughput or secondary write performance is the bottleneck. Check:<\/p>\n<pre><code>SELECT * FROM M_VOLUME_HOST_FILE_LEFT JOIN M_SERVICE_REPLICATION ON M_VOLUME_HOST_FILE_LEFT.HOST = M_SERVICE_REPLICATION.HOST WHERE M_SERVICE_REPLICATION.SERVICE_NAME = 'indexserver' AND M_SERVICE_REPLICATION.REPLICATION_STATUS = 'ACTIVE'<\/code><\/pre>\n<h3><span class=\"ez-toc-section\" id=\"Integration_with_SAP_Solution_Manager\"><\/span>Integration with SAP Solution Manager<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Register your HANA cluster in Solution Manager&#8217;s Technical Monitoring for central alerting. The <code>HDB_ALERT_MONITOR<\/code> configuration handles critical cluster events including failed failovers, fencing actions, and replication lag exceeding thresholds.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Troubleshooting_Common_Issues\"><\/span>Troubleshooting Common Issues<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>After building dozens of these clusters, here are the problems I see most often \u2014 and the fixes that aren&#8217;t in the SAP documentation:<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Split-Brain_Both_Nodes_Primary\"><\/span>Split-Brain: Both Nodes Primary<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Symptoms: Both nodes show <code>PRIMARY<\/code> role in M_SERVICE_REPLICATION, and applications report inconsistent data reads.<\/p>\n<p>Root cause: Usually network corruption between Corosync ring members, or a STONITH agent that failed to fire. Check for asymmetric routing \u2014 packets from node1 to node2 taking a different path than node2 to node1.<\/p>\n<p>Fix sequence:<\/p>\n<ol>\n<li>Identify which node has the most recent data (check <code>M_BACKUP_CATALOG<\/code> for latest timestamps)<\/li>\n<li>Manually demote the stale node: <code>hdbnsutil -sr_takeover<\/code> on the winning node first, then re-register the losing node<\/li>\n<li>Investigate why STONITH didn&#8217;t fire \u2014 check agent logs, API credentials expiry (common in cloud environments)<\/li>\n<\/ol>\n<h3><span class=\"ez-toc-section\" id=\"Failed_Failover_Secondary_Wont_Promote\"><\/span>Failed Failover: Secondary Won&#8217;t Promote<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Most common cause: The secondary&#8217;s HANA log replay is behind and Pacemaker won&#8217;t promote because the resource agent&#8217;s start operation times out. Check:<\/p>\n<pre><code>SAPHanaSR-showAttr --sid=PRD<\/code><\/pre>\n<p>Look for <code>sro-Status<\/code> showing anything other than <code>SOK<\/code>. If the secondary is in a corrupt or incomplete state, you may need to re-sync from the current primary before attempting promotion.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Inactive_Replication_After_Maintenance\"><\/span>Inactive Replication After Maintenance<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>After a planned maintenance cycle where both nodes were briefly stopped, replication often shows &#8220;inactive.&#8221; This usually means the secondaries don&#8217;t have the correct log position. Resolution:<\/p>\n<pre><code>On the secondary:\r\nhdbnsutil -sr_register --remoteHost=hananode01 --remoteInstance=00 --mode=sync --operationMode=logreplay<\/code><\/pre>\n<p>Verify sync state returns to ACTIVE before considering the cluster &#8220;healthy&#8221; again.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Best_Practices_for_Production\"><\/span>Best Practices for Production<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>After everything above, here are the non-obvious things that separate a production-grade installation from a lab setup \u2014 and linking to some of my <a href=\"https:\/\/adilfahim.com\/myblog\/sap-on-azure-deployment-guide\/\">SAP on Azure Deployment Guide<\/a> for cloud-specific considerations:<\/p>\n<ul>\n<li><strong>Use log replay over log shipping.<\/strong> Log replay is transaction-consistent; log shipping has edge cases during concurrent transactions that can silently corrupt data.<\/li>\n<li><strong>Dedicate a network interface for HSR replication.<\/strong> Sharing the HSR NIC with production traffic creates latency spikes during peak workloads.<\/li>\n<li><strong>Set the correct <code>max_concurrency<\/code> for your replication.<\/strong> Default values assume uniform workload \u2014 if you have large batch jobs, increase the parallel log shipping threads.<\/li>\n<li><strong>Test your fencing agents monthly.<\/strong> Cloud API credentials rotate, IPMI firmware gets updated, and shared disk SBD devices get re-partitioned. Test that your fence agent actually works.<\/li>\n<li><strong>Document the runbook and store it offline.<\/strong> When your cluster is down and your wiki is on that same cluster, you can&#8217;t access the fix procedure. Keep a printed copy or separate documentation system.<\/li>\n<li><strong>Use the HANA Lifecycle Manager for updates.<\/strong> Stop a CRS to patch one node without complex manual steps. The cluster won&#8217;t programmatically place nodes in maintenance mode for HSR, so script it.<\/li>\n<li><strong>Monitor disk latency on the secondary.<\/strong> If the secondary&#8217;s storage is slower than the primary&#8217;s, replication lag will grow during heavy write workloads even if HSR shows &#8220;active.&#8221;<\/li>\n<\/ul>\n<p>For the sizing and performance math behind HANA memory and storage planning, I covered the detailed calculations in my <a href=\"https:\/\/adilfahim.com\/myblog\/sap-hana-dba-calculations-sizing-backup-memory-performance\/\">SAP HANA DBA Calculations Guide<\/a> \u2014 worth a read if you&#8217;re planning capacity.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span>Conclusion<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Setting up SAP HANA HSR with Pacemaker gives you genuine automatic failover capability \u2014 but only if you design, configure, and test it properly. The cluster doesn&#8217;t maintain itself. Split-brain scenarios, silent replication failures, and expired fencing credentials are all things that look fine in monitoring until the moment you actually need the cluster to save you.<\/p>\n<p>Start with a clear architecture decision (sync for zero data loss, async for DR, syncmem for the middle ground), configure fencing before you configure resources, and run actual failure tests \u2014 not just &#8220;stop the service&#8221; tests. Your future self at 3 AM will thank you.<\/p>\n<p><strong>Have you set up HANA HSR clusters in production? What was your biggest surprise during the first failover test?<\/strong> Drop a comment below or connect with me on LinkedIn \u2014 I&#8217;d love to hear what worked (or didn&#8217;t) in your environment.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Step-by-step guide to SAP HANA System Replication with Pacemaker cluster covering sync modes, fencing, automatic failover, DR drills, and troubleshooting.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"footnotes":""},"categories":[1241,1237],"tags":[1323,1320,1322,982,188,189],"class_list":["post-2080","post","type-post","status-publish","format-standard","hentry","category-sap-basis","category-sap-hana","tag-disaster-recovery","tag-high-availability","tag-hsr","tag-pacemaker","tag-sap-basis","tag-sap-hana"],"_links":{"self":[{"href":"https:\/\/adilfahim.com\/myblog\/wp-json\/wp\/v2\/posts\/2080","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/adilfahim.com\/myblog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/adilfahim.com\/myblog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/adilfahim.com\/myblog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/adilfahim.com\/myblog\/wp-json\/wp\/v2\/comments?post=2080"}],"version-history":[{"count":1,"href":"https:\/\/adilfahim.com\/myblog\/wp-json\/wp\/v2\/posts\/2080\/revisions"}],"predecessor-version":[{"id":2082,"href":"https:\/\/adilfahim.com\/myblog\/wp-json\/wp\/v2\/posts\/2080\/revisions\/2082"}],"wp:attachment":[{"href":"https:\/\/adilfahim.com\/myblog\/wp-json\/wp\/v2\/media?parent=2080"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/adilfahim.com\/myblog\/wp-json\/wp\/v2\/categories?post=2080"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/adilfahim.com\/myblog\/wp-json\/wp\/v2\/tags?post=2080"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}