HP OpenVMS Systems

ask the wizard

Debugging synchronization problems? (SMP)

» close window

The Question is:

 
When running a system with single CPU, a program calling the system service
$QIO has no problem.
After installing a second CPU, it gets a completion status 00000000 in iosb
when calling the system service $CANTIM (cancel timer for timeout) or
$CANCEL (cancel QIO).
 
1. What does this completion status mean ?
2. How can I overcome this situation ?
3. Do I have to treat it as an error ?

The Answer is :

 
  The Wizard would tend to expect the program in question has one or more
  latent errors that are uncovered by the differences in the execution
  timings between the SMP and uniprocessor platforms, or in the difference
  in performance.  Both moving to SMP and moving to faster processors are
  notorious for uncovering latent synchronization errors lurking within
  various applications.  Further, upgrading to an OpenVMS releases where
  the timing of constituent operations has changed can serve as a trigger
  -- operations that are slower are certainly an obvious cause, but faster
  completions regularly also trigger synchronization-related errors.
 
  The first thing will you want to do is read and understand the OpenVMS
  Programming Concepts Manual, and particularly -- for this question -- the
  chapters that cover asynchronous system traps (ASTs), general process
  synchronization mechanisms, and interlocked memory synchronization.
 
  You will then want to look for common sources of synchronization errors.
  Some of the sources for these errors can include:
 
    o Lack of checking for the completion of an asynchronous operation.
 
      Variations include:
 
      o failure to always use and verify an IOSB.  Typically, the IOSB
        and (as applicable) the EFN is verified with $synch, or other
        similar synchronization coding technique.
      o Erroneously assuming that the setting of an event flag is an
        indication that the asynchronous operation has succeeded.
        (Use of the lib$get_ef and lib$free_ef calls to allocate unique
        event flags is a start, but the application should be coded to
	assume spurious event flag changes can arise.)  Ask The Wizard
        topics discussing event flags include (446), (640), (687),
        (811), (819), (923), (1170), (1661), (1894), (2637), (2922),
        (3531), (4325), (6099), (6138).
      o If you do not want, need, nor use an event flag, do not use
        event flag zero (the default), rather use EFN$C_ENF; use the
        Do Not Care event flag.  (See enfdef on V7.1 and later.)
      o failure to use an IOSB that is valid over the lifetime of the
        asynchronous call.
      o erroneously sharing the IOSB across multiple asynchronous calls.
      o failure to allocate the IOSB in memory that is valid over the
        lifetime of the call.  Using subroutine-local storage -- this
        is often allocated on the stack and is valid only while the
	call frame is itself valid -- is one classic example.
      o accessing the buffer that the asynchronous read (such as a read
        I/O) before the IOSB has been verified as non-zero, and thus
        before the current read operation has completed.
      o accessing the buffer that the asynchronous write (such as a write
        I/O) before the IOSB has been verified as non-zero, and thus
        before the last write operation has completed.
      o overwriting the contents of the I/O write buffer before the IOSB
        from the last write has been verified as non-zero.
      o erroneously placing the read or write buffer for an asynchronous
        operation in volatile storage, or in storage that is erroneously
	shared with other currently-outstanding asynchronous calls.
      o Assumptions around the synchronous completion of system services
        not listed as synchronous, accessing the data or buffers involved
        before completion; without the expected use of $synch or other
        completion synchronization.
      o Assumptions around the delivery of or the delivery order of
        ASTs originating on calls including $setimr, $dclast or $qio,
        and access to the buffers or the data involved in the call
        without the expected use of $synch or other synchronization.
 
    o Incorrect shared memory synchronization.
 
      Variations include:
 
      o failure to use interlocked operations.
      o failure to correctly account for caching.  Depending on the
        operation and the platform, memory barriers may be required.
	Some of the possible memory caching policies include no
	caching, write-through caching, and write-back caching.
        (Please see Ask The Wizard topics (2681), (6984) and (7383)
        for further information on the correct use of memory barriers
        on Alpha systems.)
      o incorrect use of the interlocked queue operations, erroneously
        adding or removing entries at any location in the queue other
	than the header of an interlocked queue.
      o Uninterlocked sharing of any data between any AST(s) and the
        mainline threads -- all data structures must be interlocked
	or otherwise entirely re-entrant.
      o On Alpha, failure to use the memory barrier operators (when
        necessary) to ensure consistent memory contents -- memory
	barriers are used to properly control the (expected) read
        and write reordering normally found on Alpha.  The barrier
	will block execution until all pending memory operations
	have completed.  (Again, see topic (2681) for details, and
        for discussions of the memory barriers and particularly the
        granularity of the hardware interlocks, please see topics
        (6984) and (7383).)
      o Failure to account for "tearing" when performing non-aligned
        (non-naturally aligned) memory access on adjacent areas of
        memory.  You will need to know if the particular platform
        requires naturally-aligned quadword references, naturally-aligned
        longword references, or some other value.  Tearing involves
        references to memory that are not naturally aligned -- and
        that are otherwise unsynchronized -- and specifically involves
        parallel unaligned references to nearby areas of memory.
        * A variation of tearing involves the use of IPL-based or
          spinlocked-based synchronization -- and multiple levels
          of these synchronization mechanisms -- within a single
          addressable unit of memory.  Individual bits within a
          status value longword, for instance, cannot be safely
          synchronized using multiple IPLs or multiple spinlocks.
        * A second and potentially more subtle variation of tearing
          involves the granularity of reference used by the particular
          compiler (eg: CC/GRANULARITY), where the compiler can generate
          code which can read and re-write adjacent values -- if, for
          instance, the adjacent memory is an (adjacent) device CSR,
          well, then things can get rather interesting rather quickly.
 
    o Random programming errors:
 
      Variations include:
 
      o Failure to correctly deal with spurious sys$wake requests when
        using sys$hiber calls.  Topics (2637) and (3783) are related.
	The alternative to a spurious $wake is a lost $wake, and this
	can cause a process to stall waiting for the lost $wake.
        The Programming Concepts manual discussion of $hiber and $wake
	contains further information on this.
      o Failure to correctly size memory allocated and deallocated.
      o Failure to insert application-specific debugging into any large
        or complex application.  This includes logging.  (See topic
        (4129) for information on dynamic activation of the debugger;
        for information that permits generating a traceback using a
        supported and documented API.)
      o Failure to centralize error-prone areas of the code into a few
        routines, particularly the ability to centralize all memory
	management calls into a few routines.  This allows the ability
	to use "fenceposts" or similar techniques to track down memory
	pool corruptors -- allocating a "hidden" quadword at the front
	and the back of any allocation call, filling both quadwords with
	known patterns unique to the particular memory allocation call,
	and checking for the pattern on deallocation.  (Additional
        details of using "fenceposts" are included in topic (3257).)
      o accessing the contents of a descriptor for a dynamic string for
        write through any means other than the provided string descriptor
	routines.
      o Writing data through an uninitialized pointer.
      o Writing beyond the end of a data structure.
      o Failure to check for an appropriate return condition value from
        a subroutine or system service before continuing the execution
        of a routine.
      o On VAX, an REI instruction must be executed prior to executing
        any instructions (code) that were written by the application
        program.
 
    o SYSGEN parameter setting assumptions:
 
      Variations include:
 
      o Failure to specify the entire required process quota list on
        a sys$creprc call.
      o Failure to specify the mailbox size and buffer quota on a
        call to sys$crembx.
      o Failure to check the required SYSGEN quota values for the
        appropriate minimum values on each application or on each
	system startup.
 
    o Incorrect mixing of threads and ASTs:
 
      o See topics (4647) and (6099) for details.
 
    o And the Access Violation (ACCVIO), a brief introduction:
 
      o See previous discussions of the Access Violation (ACCVIO)
        and decoding the stackdump here in Ask The Wizard, and
        specifically please see topics (837), (1705), (2195),
        (2223), (3215), (5533), (6065), (6495), (6776), (7551),
        and likely a few other topics.
 
     o For information on the OpenVMS Debugger, on the "divide and
       conquer" troubleshooting technique, and for general details
       on how to debug an application, please see topics (7552) and
       (4129).
 
 
  Memory allocation routines are commonly referenced as sources of errors
  by many programmers.  The memory allocation and deallocation routines in
  the OpenVMS libraries see extensive use throughout the operating system,
  layered products and applications, and throughout the entire customer
  application base, and the corresponding incidence of errors in these
  calls -- though while certainly possible -- is exceedingly rare.  Most
  often, the application seeing the error has somehow clobbered a key part
  of the memory heap or has clobbered part of the stack.  This is why the
  Wizard recommends centralizing the allocation and deallocation routines,
  and using "fenceposts" (details of fenceposts discussed above).
 
      o Related topics include (2624), (2630) and (3748), as well as
        (3115), (3257), (4808), (5455), (5640), (6536), (7006), etc.
 
  Also please see the heap analyzer support in the OpenVMS Debugger -- the
  Debugger is an invaluable tool for locating and resolving errors, and can
  even be programmed to lurk waiting for an error, or to activate (via a
  call to lib$signal with SS$_DEBUG) and then display information on the
  error and the current application context.
 
  Related topics include (2681) and (6099).  Also see (2624) and (2630).
  Also the ACCVIO topics: (837), (1705), (2195), (2223), (3215), (5533),
  (6065), (6495), (6776), (7551), etc.  Also see the ASTs, threads,
  reentrancy, and shared memory topics: (2681), (4647), (6099), and (6984).
  And debugging and traceback topics such as (4129) and (7552).  For virtual
  memory debugging and memory heap corruptions, and fenceposts, see (3257).
 

  
     
     answer written or last revised on ( 15-SEP-2004 )
     » close window