HP OpenVMS Systems

ask the wizard

Cluster votes, shadowing, disaster tolerance?

» close window

The Question is:

 
Hi,
 
I would appreciate any comments/hints you may have concerning the following
configurations:
 
Present configuration is a dual node AXP 8200/4100 SCSI cluster with a
quorum disk. All disks are housed in an SW800 cabinet where we have 3 pairs
of dual redundant HSZ50's serving approximately 350 Gigabytes of disk space
configured using various RAID 5,
1,0 and 10 sets (24 logical VMS disks of sizes between 2 Gigabytes and 25
Gigabytes). Each node has 2 Gbyte memory and twin CPU's
 
The application is a customer care and billing system using Ingres 6.4 which
unfortunately is only supported on VMS upto 6.2-1H3 which is what we have.
 
Our configuration of Ingres 6.4 does not know about clustering and so the
4100 runs the Ingres server processes whilst the 8200 takes the Ingres
client processes and user logins (about 300 at present). The Ingres server
<-> client traffic as well the SCA
cluster traffic is handled by an isolated FDDI ring whilst user logins are
by Telnet over 100Mbit ethernet.
 
This seems to run fairly well at present, but the next stage is worrying me
somewhat ... We are planning a disaster tolerant cluster by splitting the 2
nodes (and possibly upgrading the 4100 to the equivalent of an 8200) to
different sites using a giga-sw
itched FDDI ring for inter node comms and reproducing the disk configuration
in another SW800 at the second site and then shadowing the whole lot using
VMS volume shadowing. The quorum will be upheld using a third smaller node
(eg a 2100) in a third site
with a local system disk (locally shadowed for availability reasons).
 
I have anticipated that flooding the FDDI ring during shadow copies is very
likely and so plan to minimise their frequency by adding a medium sized
extra node in each of the two main sites . These two extra nodes will do
nothing much except ensure that th
e shadow sets remain intact if and when one or both of the main application
nodes goes down (lets call these smaller nodes the "shadow watchers"). To do
this I figured on making all of these 4 nodes MSCP servers.
 
I have two questions really ... at first I thought that I would have to
assign cluster votes by giving 0 votes to each of the 2 shadow watchers and
1 vote to each of the two application nodes and 1 to the quorum watcher.
This would ensure a cluster hang i
f both application nodes crashed whilst theoretically still keeping both
members of all shadow sets intact and so avoiding the need for expensive
shadowset merges or copies when the nodes were brought back up.
 
However, I have now also thought it may be better to keep the cluster up in
such a circumstance by giving all nodes in the cluster exactly one vote.
What worries me about doing this is what happens to the data in the database
and the RMS flat files if bot
h application nodes go down for some reason? Will the database of RMS flat
files be corrupted?
 
My second concern is for the length of time it will require to rebuild
shadow sets amounting to 450 Gigabytes if and when they are required. Do you
have any figures for this?
 
I guess I also have a couple more questions... Will it work??? and have you
seen a similar cluster elsewhere?
 
Regards
 
Anxious

The Answer is :

 
  Please read and heed the information on VOTES and EXPECTED_VOTES that
  is present in the OpenVMS FAQ.
 
  RMS files will not be corrupted at loss of quorum, nor ensuing recovery.
  (The whole basis of the quorum scheme is to prevent these corruptions,
  by preventing partitioned clusters and thus preventing uncoordinated
  write operations to shared resources.)
 
  Shadow set merge operations occur when shadow set volumes are not cleanly
  dismounted -- this is not particularly tied to cluster quorum.
 
  Attempts to bypass the quorum mechanism can and have led to severe disk
  data corruptions.
 
  Discussions of disaster-tolerant configurations generally involves some
  assistance from folks very familiar with the local requirements, with the
  general pitfalls, and with the required final environment. (Disaster
  tolerance can initially look easy, but there are almost always a wide
  variety of considerations -- it is almost never as easy as it initially
  looks.)  The OpenVMS Wizard would encourage you to contact the Compaq
  Customer Support Center for assistance and consulting services.
 

  
     
     answer written or last revised on ( 7-SEP-1999 )
     » close window