HP OpenVMS Systemsask the wizard |
The Question is: When running a system with single CPU, a program calling the system service $QIO has no problem. After installing a second CPU, it gets a completion status 00000000 in iosb when calling the system service $CANTIM (cancel timer for timeout) or $CANCEL (cancel QIO). 1. What does this completion status mean ? 2. How can I overcome this situation ? 3. Do I have to treat it as an error ? The Answer is :
The Wizard would tend to expect the program in question has one or more
latent errors that are uncovered by the differences in the execution
timings between the SMP and uniprocessor platforms, or in the difference
in performance. Both moving to SMP and moving to faster processors are
notorious for uncovering latent synchronization errors lurking within
various applications. Further, upgrading to an OpenVMS releases where
the timing of constituent operations has changed can serve as a trigger
-- operations that are slower are certainly an obvious cause, but faster
completions regularly also trigger synchronization-related errors.
The first thing will you want to do is read and understand the OpenVMS
Programming Concepts Manual, and particularly -- for this question -- the
chapters that cover asynchronous system traps (ASTs), general process
synchronization mechanisms, and interlocked memory synchronization.
You will then want to look for common sources of synchronization errors.
Some of the sources for these errors can include:
o Lack of checking for the completion of an asynchronous operation.
Variations include:
o failure to always use and verify an IOSB. Typically, the IOSB
and (as applicable) the EFN is verified with $synch, or other
similar synchronization coding technique.
o Erroneously assuming that the setting of an event flag is an
indication that the asynchronous operation has succeeded.
(Use of the lib$get_ef and lib$free_ef calls to allocate unique
event flags is a start, but the application should be coded to
assume spurious event flag changes can arise.) Ask The Wizard
topics discussing event flags include (446), (640), (687),
(811), (819), (923), (1170), (1661), (1894), (2637), (2922),
(3531), (4325), (6099), (6138).
o If you do not want, need, nor use an event flag, do not use
event flag zero (the default), rather use EFN$C_ENF; use the
Do Not Care event flag. (See enfdef on V7.1 and later.)
o failure to use an IOSB that is valid over the lifetime of the
asynchronous call.
o erroneously sharing the IOSB across multiple asynchronous calls.
o failure to allocate the IOSB in memory that is valid over the
lifetime of the call. Using subroutine-local storage -- this
is often allocated on the stack and is valid only while the
call frame is itself valid -- is one classic example.
o accessing the buffer that the asynchronous read (such as a read
I/O) before the IOSB has been verified as non-zero, and thus
before the current read operation has completed.
o accessing the buffer that the asynchronous write (such as a write
I/O) before the IOSB has been verified as non-zero, and thus
before the last write operation has completed.
o overwriting the contents of the I/O write buffer before the IOSB
from the last write has been verified as non-zero.
o erroneously placing the read or write buffer for an asynchronous
operation in volatile storage, or in storage that is erroneously
shared with other currently-outstanding asynchronous calls.
o Assumptions around the synchronous completion of system services
not listed as synchronous, accessing the data or buffers involved
before completion; without the expected use of $synch or other
completion synchronization.
o Assumptions around the delivery of or the delivery order of
ASTs originating on calls including $setimr, $dclast or $qio,
and access to the buffers or the data involved in the call
without the expected use of $synch or other synchronization.
o Incorrect shared memory synchronization.
Variations include:
o failure to use interlocked operations.
o failure to correctly account for caching. Depending on the
operation and the platform, memory barriers may be required.
Some of the possible memory caching policies include no
caching, write-through caching, and write-back caching.
(Please see Ask The Wizard topics (2681), (6984) and (7383)
for further information on the correct use of memory barriers
on Alpha systems.)
o incorrect use of the interlocked queue operations, erroneously
adding or removing entries at any location in the queue other
than the header of an interlocked queue.
o Uninterlocked sharing of any data between any AST(s) and the
mainline threads -- all data structures must be interlocked
or otherwise entirely re-entrant.
o On Alpha, failure to use the memory barrier operators (when
necessary) to ensure consistent memory contents -- memory
barriers are used to properly control the (expected) read
and write reordering normally found on Alpha. The barrier
will block execution until all pending memory operations
have completed. (Again, see topic (2681) for details, and
for discussions of the memory barriers and particularly the
granularity of the hardware interlocks, please see topics
(6984) and (7383).)
o Failure to account for "tearing" when performing non-aligned
(non-naturally aligned) memory access on adjacent areas of
memory. You will need to know if the particular platform
requires naturally-aligned quadword references, naturally-aligned
longword references, or some other value. Tearing involves
references to memory that are not naturally aligned -- and
that are otherwise unsynchronized -- and specifically involves
parallel unaligned references to nearby areas of memory.
* A variation of tearing involves the use of IPL-based or
spinlocked-based synchronization -- and multiple levels
of these synchronization mechanisms -- within a single
addressable unit of memory. Individual bits within a
status value longword, for instance, cannot be safely
synchronized using multiple IPLs or multiple spinlocks.
* A second and potentially more subtle variation of tearing
involves the granularity of reference used by the particular
compiler (eg: CC/GRANULARITY), where the compiler can generate
code which can read and re-write adjacent values -- if, for
instance, the adjacent memory is an (adjacent) device CSR,
well, then things can get rather interesting rather quickly.
o Random programming errors:
Variations include:
o Failure to correctly deal with spurious sys$wake requests when
using sys$hiber calls. Topics (2637) and (3783) are related.
The alternative to a spurious $wake is a lost $wake, and this
can cause a process to stall waiting for the lost $wake.
The Programming Concepts manual discussion of $hiber and $wake
contains further information on this.
o Failure to correctly size memory allocated and deallocated.
o Failure to insert application-specific debugging into any large
or complex application. This includes logging. (See topic
(4129) for information on dynamic activation of the debugger;
for information that permits generating a traceback using a
supported and documented API.)
o Failure to centralize error-prone areas of the code into a few
routines, particularly the ability to centralize all memory
management calls into a few routines. This allows the ability
to use "fenceposts" or similar techniques to track down memory
pool corruptors -- allocating a "hidden" quadword at the front
and the back of any allocation call, filling both quadwords with
known patterns unique to the particular memory allocation call,
and checking for the pattern on deallocation. (Additional
details of using "fenceposts" are included in topic (3257).)
o accessing the contents of a descriptor for a dynamic string for
write through any means other than the provided string descriptor
routines.
o Writing data through an uninitialized pointer.
o Writing beyond the end of a data structure.
o Failure to check for an appropriate return condition value from
a subroutine or system service before continuing the execution
of a routine.
o On VAX, an REI instruction must be executed prior to executing
any instructions (code) that were written by the application
program.
o SYSGEN parameter setting assumptions:
Variations include:
o Failure to specify the entire required process quota list on
a sys$creprc call.
o Failure to specify the mailbox size and buffer quota on a
call to sys$crembx.
o Failure to check the required SYSGEN quota values for the
appropriate minimum values on each application or on each
system startup.
o Incorrect mixing of threads and ASTs:
o See topics (4647) and (6099) for details.
o And the Access Violation (ACCVIO), a brief introduction:
o See previous discussions of the Access Violation (ACCVIO)
and decoding the stackdump here in Ask The Wizard, and
specifically please see topics (837), (1705), (2195),
(2223), (3215), (5533), (6065), (6495), (6776), (7551),
and likely a few other topics.
o For information on the OpenVMS Debugger, on the "divide and
conquer" troubleshooting technique, and for general details
on how to debug an application, please see topics (7552) and
(4129).
Memory allocation routines are commonly referenced as sources of errors
by many programmers. The memory allocation and deallocation routines in
the OpenVMS libraries see extensive use throughout the operating system,
layered products and applications, and throughout the entire customer
application base, and the corresponding incidence of errors in these
calls -- though while certainly possible -- is exceedingly rare. Most
often, the application seeing the error has somehow clobbered a key part
of the memory heap or has clobbered part of the stack. This is why the
Wizard recommends centralizing the allocation and deallocation routines,
and using "fenceposts" (details of fenceposts discussed above).
o Related topics include (2624), (2630) and (3748), as well as
(3115), (3257), (4808), (5455), (5640), (6536), (7006), etc.
Also please see the heap analyzer support in the OpenVMS Debugger -- the
Debugger is an invaluable tool for locating and resolving errors, and can
even be programmed to lurk waiting for an error, or to activate (via a
call to lib$signal with SS$_DEBUG) and then display information on the
error and the current application context.
Related topics include (2681) and (6099). Also see (2624) and (2630).
Also the ACCVIO topics: (837), (1705), (2195), (2223), (3215), (5533),
(6065), (6495), (6776), (7551), etc. Also see the ASTs, threads,
reentrancy, and shared memory topics: (2681), (4647), (6099), and (6984).
And debugging and traceback topics such as (4129) and (7552). For virtual
memory debugging and memory heap corruptions, and fenceposts, see (3257).
|