It took quite a while to fully understand this problem.  My thanks to the 
many suggestions, many of which provided valuable as debugging steps but 
none of which quite identified the problem.
We believe the core cause of this performance degradation was the change 
in the default behavior of the Oracle database's use of the filesystem 
when run on Tru64 5.1A (as opposed to on the original 4.0F).  On 5.1A, 
Oracle uses direct I/O by default (avoiding any operating system 
filesystem caching), whereas on 4.0F, Oracle doesn't do this (and, in 
fact, the OS doesn't support direct I/O) so it uses the filesystem cache.
In our case, this difference in filesystem cache usage dramatically 
impacted a number of queries, most of which manipulated a large quantity 
of data by repeatedly reading the same data over and over (poor query 
design, but hadn't caused problems in the past).  It would appear that 
prior to the 5.1A upgrade, the performance of these queries was being 
propped up by heavy use of the AdvFS filesystem cache at the OS level. 
Once we went to 5.1A, this cache wasn't being used (because Oracle was 
using direct I/O) and our queries fell victim to extreme I/O waits, since 
the SGA DB_BLOCK_BUFFERS weren't large enough to buffer as much data as 
had been buffered in the filesystem buffer pre-OS upgrade.
>From a purist standpoint, the problem could be considered a combination of 
several poorly designed queries (many repetitive reads on a large quantity 
of data) combined with an undersized data buffer cache for the instance 
(causing it to depend heavily on OS filesystem caching).   The OS 
upgrade's effect of shifting Oracle I/O to direct I/O made these causes 
yield drastic performance slow downs.
Our solution was to dramatically increase our DB_BUFFER_CACHE size (by 
approximately 20 times).  Another possible solution is to disable direct 
I/O usage by Oracle RDBMS (which can be done in 8.1.7.2 by the 
_TRU64_DIRECTIO_DISABLE parameter, which via an Oracle patch can also be 
made available in 8.1.7.1) - since this solution required a patch and our 
initial testing got mixed results using this, we opted for the data buffer 
cache increase (since we could make the memory available to support this). 
 We continue to test the direct I/O parameter in a more controlled 
environment (dedicated test box set up for just this purpose) in case we 
need a fallback, but our performance is now back to roughly the same as 
pre-upgrade.
Once again, thanks for everyone's suggestions.  For anyone planning an 
4.0F to 5.X upgrade for a system running Oracle databases, I would 
strongly suggest you look closely at performance for large jobs and 
examine how caching is being used - the direct I/O change can have 
dramatic effects.
Thanks,
Tom
=================================================================
Tom Ackenhusen                                  tackenhu_at_fnal.gov
Business Systems Technical Services
Fermi National Accelerator Laboratory            Batavia, IL, USA
tackenhu_at_fnal.gov
Sent by: tru64-unix-managers-owner_at_ornl.gov
12/06/2002 10:39 AM
 
        To:     tru64-unix-managers_at_ornl.gov
        cc: 
        Subject:        Extremely slow Oracle performance after 5.1A upgrade
I am looking for any help or suggestions to address a drastic query 
performance problem we have encountered in our production Oracle database 
after doing a 4.0F to 5.1A PK3 upgrade.
The symptom being observed is that some data intensive reporting jobs 
which before ran in approximately 1 hour now run approximately 9 hours. 
Other reporting jobs and, for as far as we can find, the online 
transactions are not noticeable affected.  The application is Oracle 
Financials 11.0.3.  The database version is 8.1.7.1.
We have verified this problem is not corrected by any of the following 
actions:
- Upgrading the database to 8.1.7.4.
- Disabling AdvFS direct I/O use by Oracle 8.1.7.4 using the 
_TRU64_DIRECTIO_ Oracle parameter.  (Parameter isn't available in 8.1.7.1 
to test it.)
- Verifying our kernel parameters include vfs.fifo_do_adaptive=0 and 
vm.new_wire_method=0
- Verifying our IPC and proc kernel parameters meet the recommended 
settings from the Oracle installprep.sh script (this script examines the 
settings for max_per_proc_stack_size, per_proc_stack_size, 
max_per_proc_data_size, per_proc_data_size, max_per_proc_address_space, 
per_proc_address_space, SHMMAX, SHMMIN, SHMMNI, SHMSEG)
- Running on different systems (the problem occurs both on our production 
8400 with 6 625MHz processors, 6 Gb RAM, external HSZ-80 RAID array and on 
our test ES40 with 2 833 MHz processors, 4 Gb RAM, and internal RAID 
controller) - the same performance problem occurs on both (I just wish 
performance testing had been done by the testing team PRIOR to the 
production upgrade....<sigh>)
- Running syscheck on both systems doesn't yield any significant messages 
and doesn't flag any concerns in the Oracle area
- Monitoring CPU usage doesn't show significant load averages or CPU 
usage, although there are many Oracle processes in a WAIT state (as shown 
by top)
- Doing explain plans on the queries show they are using indices as 
expected
- Recreating the indices (testing the theory they might be corrupt) 
doesn't fix the problem
We have not yet verified if this same problem is affecting our other 
databases on the same system - right now our only verified problems are on 
the Oracle Financials application.  However, the size of the database 
(16Gb), tables (up to 12 million rows and 1.5 Gb for the largest), and 
jobs (up to an hour runtime before this problem) are an order of magnitude 
larger for this application than our other ones.  As you can see, however, 
the database is not particularly large.  We have actually been running in 
production since 11/16/2002 without any reported problems - the problems 
were only reported once our monthly accounting close reports were at the 
beginning of December.
There were no issues in doing the OS upgrades - we used the path of 
4.0F->4.0G->5.1->5.1A->5.1A+PK3.  We did NOT upgrade the AdvFS domains 
(which sit under the database files) to V4 - we are still using V3.
Does anyone have any suggestions on what one might try, documentation on 
possible diagnostics steps, or relevant tools to use?
Any suggestions would be appreciated.  I will summarize with what works 
when we resolve the problem.
Thanks,
Tom
=================================================================
Tom Ackenhusen                                  tackenhu_at_fnal.gov
Business Systems Technical Services
Fermi National Accelerator Laboratory            Batavia, IL, USA
Received on Mon Dec 23 2002 - 05:33:05 NZDT