monitoring the health of resources in an SSI environment

This section is principally aimed at a Blueworx Voice Response Single System Image (SSI) environment running in an HACMP environment, but does not require HACMP and can be beneficial in a non-HACMP SSI environment.

This section describes how Blueworx Voice Response detects problems with DB2 and NFS servers, and what it does when it finds a problem.

This section describes DBHEALTH, a Blueworx Voice Response process that monitors DB2 and filesystems containing voice and customer applications. DBHEALTH also provides information on the accessibility of key resources to the rest of the system.

Most installations can use the default values of the system parameters used by DBHEALTH, and you should not need to read this section unless you see error messages about DBHEALTH or you are monitoring system status using SNMP or a custom server.

Resource problem when Blueworx Voice Response starts up

A key problem in an SSI environment is auto-restart after a power failure on the client and server systems in the cluster. At initialization Blueworx Voice Response keeps trying to contact resources on the server until they become available.

Initialization scripts:

Product initialization scripts such as vaeinit, vaeinit.nox (and scripts called by them) retry during server outage. During this retry cycle a message such as "Will retry DB2 connection in 20 seconds" is displayed.

Cancelling an unsuccessful startup:

If you want to abort a Blueworx Voice Response startup that is stuck in retry mode run DT_shutdown.

Custom servers:

If Blueworx Voice Response detects any resource problems when it starts, Blueworx Voice Response won't start any AUTOEXEC custom servers, or process other requests to start custom servers until the resources are available.

While waiting for a resource problem to clear the following message is added to DTstatus.out

CA_CNTL: initialization waiting on DBHEALTH system_state=8

Where system_state is the DTstatus returned by SNMP, see Blueworx Voice Response resources information.

A similar message is added to trace.

This is a very unlikely condition representing a resource problem after vaeinit has successfully checked the resources, possibly caused by instability of the resource server.

Resource problems when Blueworx Voice Response is running

The DBHEALTH process, which monitors resource availability, is started by NODEM with other programs in $SYS_DIR/tasklist.data.

Configuring :

The following parameters control the time DBHEALTH waits for a response from a resource before it considers there to be a problem with a resource and Blueworx Voice Response takes action:

Database Availability Check Timeout
File Availability Check Timeout
System Response during Server Outage

These parameters are described in more detail in the Configuring the System guide.

The default timeout is good enough for most systems. A resource which causes a delay greater than 15 seconds is likely to be unacceptable to a caller, since the caller hears nothing during this time.

Operations that put a heavy load on the filesystem, such as backup, can cause a slow response which might be interpreted as a resource problem. If this happens, lower the priority of the operation rather than increasing the timeout values so that the response to your callers is not affected.

If you configure Blueworx Voice Response to disconnect calls in progress when DBHEALTH detects a problem and an HACMP failover is planned anyway, use the System Monitor or your switch to quiesce trunks first, to minimize the number disconnected calls.

Note the following about DBHEALTH:

DBHEALTH should be one of the first entries in $SYS_DIR/tasklist.data so that resource monitoring is available as soon possible.
If DBHEALTH is stopped the system status remains set according to the state of the resources when DBHEALTH was stopped. If Blueworx Voice Response is running with all resources present Blueworx Voice Response continues to run.
When DBHEALTH is started and resources become available after a failure DBHEALTH unlocks mailboxes locked by the system on which it is running.
DBHEALTH must run with standard output and standard error redirected to a file to prevent threads blocking if screen I/O is interrupted.

Detection of resource problems

In most cases the System Administrator need not be aware of these details.

What monitors:

The resources monitored by DBHEALTH are DB2 itself and a file on five filesystems under $CUR_DIR. These files are:

$CUR_DIR/ca/.dirTalkIDStamp
$CUR_DIR/voice/segment/.dirTalkIDStamp
$CUR_DIR/voice/msg/.dirTalkIDStamp
$CUR_DIR/voice/greet/.dirTalkIDStamp
$CUR_DIR/voice/aname/.dirTalkIDStamp.

These are the same resources initialized by $VAETOOLS/fsupdate when Blueworx Voice Response is installed or a Single System Image is configured. If these resources have been deleted, shut down Blueworx Voice Response then run $VAETOOLS/fsupdate. If problems persist with an SSI configuration see the section on creating and managing a single system image in Configuring the System guide.

How monitors each file:

A thread issues an AIX system call (statx()) to access the file.
The returned data is checked against what was stored in DB2 for that file when the system was set up.
In addition another file (.HOSTNAME) is written and read.
A separate thread checks that the first thread is running and the statx() call returns before the value set in the File Availability Check Timeout system parameter.
One error is logged per resource.
DBHEALTH propagates the error to other Blueworx Voice Response Processes as described in What Blueworx Voice Response does when DBHEALTH detects a resource problem.

How monitors :

This is like How monitors each file: :

DBHEALTH queries the nodes table on the DB2 server (used by Blueworx Voice Response to record SSI configuration).
DBHEALTH writes a timestamp to the nodes table so that another Blueworx Voice Response client can unlock a mailbox that was locked by a client that has resource problems.

Debugging Mode:

If the signal SIGUSR1 (kill -30 <DBHEALTH process id>) is sent to DBHEALTH the debugging mode is toggled.

The debugging mode is not intended for normal use but might be useful when debugging system problems or deciding values for the Database Availability Check Timeout and File Availability Check Timeout system parameters. DBHEALTH's debugging mode generates a lot of information causing DTstatus.out to be archived and deleted.

In debugging mode the time taken for each resource to be polled and any error information from the polling command (DB2 query or filesystem access) is recorded. This additional error information might be useful because DBHEALTH records only the first problem per resource in the errorlog.

The fields written to DTstatus.out when the system is running normally are as follows:

Timestamp of this record (which can be aligned with the timestamp in the errorlog).
DBHEALTH (the name of the process doing the logging).
The function in DBHEALTH doing the logging.
The line number causing the log entry.
The time the polling function took.
The name of the resource being polled.

For example:

Timestamp   Process  Function             Line   Time   Resource

14:44:45.86 DBHEALTH DBpollingThread LINE  982   0.003s DB2(dtdbv230)

Similar information is also written to the system trace buffer.

What Blueworx Voice Response does when DBHEALTH detects a resource problem

Telephone calls, state tables, and Java applications:

All calls in progress are terminated immediately and no new calls are accepted (configured by the System Response during Server Outage system parameter).
Attempts to enable a trunk during a server outage cause the Enable attempt to be blocked. The trunk is shown as "ENABLING" until the server is available again.
All running state tables are terminated immediately.
No Java applications will be started.

Custom servers:

Custom servers are not terminated if a resource problem is detected. This is to allow program termination for example rolling back a transaction on a back-end database.
Hard-mount the custom server directory (/ca) so that if there is a request to page in the custom server executable, the custom server waits until the custom server directory becomes available. If you soft-mount the /ca directory, and a paging request fails, the custom server core dumps.
A waiting CA_Receive_DT_Msg() indicates termination of state tables by returning the function id of the close function (see the Custom Servers information).
Calls to the Custom Server Library are executed.
Attempts to access the Direct Talk data fail and many CA library calls set CA_errno to CA_INV_REQUEST.
In other cases CA_errno indicates failure to access the data concerned for example, CA_Open_CHP_Link() sets CA_errno to CA_NO_LINK_AVAILABLE.
When a resource problem is detected requests to the Custom Server Controller are all returned at once with a fail to prevent buffers building up.
Trace shows an entry like the following:
```
456 5.59218667 0.083580 CA_LIB: [29106]
CA_CNTL Request 110 rejected system_state=8
```
(system state is the DTstatus as reported by SNMP).
Therefore operations including building from the GUI and starting Custom Servers will fail.
CA_Install_CA() CA_Deinstall_CA and CA_Set_IPL_State() will set CA_errno to CA_INV_REQUEST.
CA_Start_CA() will set CA_errno to CA_START_FAILED.

Windows:

Many of the Blueworx Voice Response windows continue to function after an outage, some display the message 'Database server unavailable. Access to data including HELP system not possible at this time.', but some appear to hang. However, the following functions are still available:

Logoff
System Monitor
Shutdown

Monitoring the state of the system when DBHEALTH detects a resource problem

When DBHEALTH detects a resource problem it sets a variable which you can monitor as follows:

Using the CA_Get_System_State() custom server subroutine (described in the Custom Servers) information.
Using NetView® or a shell script with SNMP the DTstatus MIB (see dtStatus).
At a terminal using the DTmon command. DTmon -p displays the system state at the bottom of the output in the format
```
System Stat    :  run
```

Errors logged by DBHEALTH

Errors logged by DBHEALTH (numbers 5301, and 5310 through 5317) are described in the Problem Determination information.

If DBHEALTH issues a red error indicating that the system is experiencing resource problems, there should be a corresponding green error (5314) indicating that the problem has cleared.

Check information in the error message for details of the problem.

Modifying the Technical Difficulties Message:

When your system is experiencing problems and unable to handle new calls, you might want to let callers know. You can do this by changing the message played when there are technical difficulties. See the section on changing the technical difficulties message in Configuring the System.