PROVE IT !!If you know it then PROVE IT !! Skill Proficiency Test

IBM BigSQL : Steps to Recover BigSQL HA From Failed/Corrupt Cluster Configuration

Summary

Big SQL High Availability (HA) refers to providing high availability for the meta data of the Big SQL database (the Big SQL Metastore ) and for the Big SQL Scheduler component. The Big SQL database is used by Big SQL applications to access Hadoop tables.

Introduction

In this article we will look at steps to recover BigSQL HA in scenarios where resource group is showing up in offline state and HA is not being established. For example, when lssam command is run, db2_bigsql_bigsql_BIGSQL-rg resource group shows up as offline,

lssam
Online IBM.ResourceGroup:BigSQLScheduler_bigsql-rg Nominal=Online
        '- Online IBM.Application:BigSQLScheduler_bigsql-rs
                |- Online IBM.Application:BigSQLScheduler_bigsql-rs:wine2
                '- Online IBM.Application:BigSQLScheduler_bigsql-rs:wine3
Offline IBM.ResourceGroup:db2_bigsql_bigsql_BIGSQL-rg Request=Lock Nominal=Online                                                                    
        '- Offline IBM.Application:db2_bigsql_bigsql_BIGSQL-rs Control=StartInhibitedBecauseSuspended                                                           
                |- Offline IBM.Application:db2_bigsql_bigsql_BIGSQL-rs:wine2
                '- Offline IBM.Application:db2_bigsql_bigsql_BIGSQL-rs:wine3

Running db2pd on the head node to list the hadr status shows up as not active.

db2pd -db bigsql -hadr 
 
Database Member 0 -- Database BIGSQL -- Active -- Up 0 days 04:55:53 -- 
Date 2018-01-30-03.06.34.763327 
 
HADR Information: 
HADR is not active. 

Users can connect to individual BigSQL head nodes and run the queries.

Pre-requisites:

Following assumptions are made while writing this blog, users should only proceed if the following conditions are satisfied

  • The HADR config files created by installer is present on each of the head nodes, in order to verify if the xml config files are present run the following command on each of the head node,
ls /var/lib/ambari-agent/cache/extensions/IBM-Big_SQL/5.0.1.0/services/BIGSQL/package/scripts/*.xml
/var/lib/ambari-agent/cache/extensions/IBM-Big_SQL/5.0.1.0/services/BIGSQL/package/scripts/-hadr.xml
  • There has been no changes made to the network configuration such as firewall, IP address of network interfaces , gateway address on both the head nodes post installation.
  • Users are expected to be familiar with HA terminologies and db2 cli commands

Steps to recover HA.

The following steps should be run as bigsql instance user which by default is bigsql , unless otherwise stated.

1. Remove the HA cluster configuraiton by running db2haicu -delete on both the head nodes.

db2haicu -delete
Welcome to the DB2 High Availability Instance Configuration Utility (db2haicu).

You can find detailed diagnostic information in the DB2 server diagnostic log file called db2diag.log. Also, you can use the utility called db2pd to query the status of the cluster domains you create.

For more information about configuring your clustered environment using db2haicu, see the topic called 'DB2 High Availability Instance Configuration Utility (db2haicu)' in the DB2 Information Center.

db2haicu determined the current DB2 database manager instance is 'bigsql'. The cluster configuration that follows will apply to this instance.

When you use db2haicu to configure your clustered environment, you create cluster domains. For more information, see the topic 'Creating a cluster domain with db2haicu' in the DB2 Information Center. db2haicu is searching the current machine for an existing active cluster domain ...
db2haicu found a cluster domain called 'domain-1' on this machine. The cluster configuration that follows will apply to this domain.

Deleting the domain 'domain-1' from the cluster ...
Deleting the domain 'domain-1' from the cluster was successful.
All cluster configurations have been completed successfully. db2haicu exiting ...

2. Once removed, run the lsrpdomain command , it should not list any domains on both the head nodes

3. Verify if HADR configuration variables are still intact on both the head nodes, by running following command,

db2 get db cfg for bigsql| grep -i HADR
 HADR database role                                      = PRIMARY
 HADR local host name                  (HADR_LOCAL_HOST) = headhost1.ibm.com
 HADR local service name                (HADR_LOCAL_SVC) = 20008
 HADR remote host name                (HADR_REMOTE_HOST) = headhost2.ibm.com
 HADR remote service name              (HADR_REMOTE_SVC) = 20008
 HADR instance name of remote server  (HADR_REMOTE_INST) = bigsql
 HADR timeout value                       (HADR_TIMEOUT) = 120
 HADR target list                     (HADR_TARGET_LIST) = 
 HADR log write synchronization mode     (HADR_SYNCMODE) = NEARSYNC
 HADR spool log data limit (4KB)      (HADR_SPOOL_LIMIT) = AUTOMATIC(384000)
 HADR log replay delay (seconds)     (HADR_REPLAY_DELAY) = 0
 HADR peer window duration (seconds)  (HADR_PEER_WINDOW) = 600
 HADR SSL certificate label             (HADR_SSL_LABEL) = 

4. On the standby head node check if the database is in rollforward pending state ,

db2 get db cfg for bigsql|grep -i rollforward
 Rollforward pending                                     = NO

If the output says NO then proceed to step 5, if the output says DATABASE proceed to step 7.

5. Take a backup of database from primary head node by running following sequence of commands,

          a. db2 terminate
          b. db2 deactivate database bigsql
          c. db2 backup db BIGSQL to ~/backup_dir 

6. Copy the backup image to the standby head node and restore it by using following steps

          a. Make sure no applications are connected to the standby head node

             db2 list applications
             SQL1611W  No data was returned by Database System Monitor 

          b. Check if BigSQL database is currently active

             db2 list active databases

          c. If BigSQL database is listed as active , deactivate it first before restoring the backup

               db2 deactivate db bigsql

          d. From the backup_dir run the restore command, when prompted for overwrite select Y

             db2 restore database bigsql taken at 20180228152811

          e. Now check again if rollforward is pending on database, by running following command

             db2 get db cfg for bigsql|grep -i rollforward
             Rollforward pending                                     = DATABASE

7. Start HADR on Standby node

       db2 start hadr on database bigsql as standby

8. Start HADR on primary head node

       db2 start hadr on database bigsql as primary

9. Now restore the HA configurations using following sequence of steps

    
       a. As the root user run prepnode to prepare the hosts for HA cluster configuration.
            Run the following command by replacing the corresponding hostnames 
        preprpnode 
       b. Run the following command as bigsql user on each of the head node ,
        db2haicu -f /var/lib/ambari-agent/cache/extensions/IBM-Big_SQL/5.0.1.0/services/BIGSQL/package/scripts/-hadr.xml 

If all the steps succeeded , BigSQL HA should be online and working . To confirm list the resources groups using lssam to verify if all the resources are online.
The output should be similar to following excerpt,

lssam
Online IBM.ResourceGroup:BigSQLScheduler_bigsql-rg Nominal=Online
        '- Online IBM.Application:BigSQLScheduler_bigsql-rs
                |- Online IBM.Application:BigSQLScheduler_bigsql-rs:wine2
                '- Online IBM.Application:BigSQLScheduler_bigsql-rs:wine3
Online IBM.ResourceGroup:db2_bigsql_bigsql_BIGSQL-rg Nominal=Online
        '- Online IBM.Application:db2_bigsql_bigsql_BIGSQL-rs
                |- Online IBM.Application:db2_bigsql_bigsql_BIGSQL-rs:wine2
                '- Offline IBM.Application:db2_bigsql_bigsql_BIGSQL-rs:wine3

If any of the above steps fail, please reach out to IBM BigSQL Support for assistance .

Let’s block ads! (Why?)