HP OpenVMS Systems

ask the wizard

Floating Point Handling?

» close window

The Question is:

 
This may seem simple but when I run this BASIC code,
 
1       DECLARE DOUBLE  TMP.DBL
        DECLARE LONG    TMP.LONG
        TMP.DBL=39.80*100.
        TMP.LONG=TMP.DBL
        PRINT TMP.LONG
32767   END
 
I get the folowing output:
 
ALPHA::ARS$ R TEST_TYPE_CAST
 3979
ALPHA::ARS$
 
Why is this number 1 less than it should be?  There was no subtraction
 involved.  I came accross this while I was doing a type cast between double
 and long in one of my payment processing programs. I ran this on two ALPHA
 systems so far and the same outpu
t came up for both. For some reason I could not get another number to do the
 same (e.g. 43.50) other than 39.80, any guesses?
 
Thanks!

The Answer is :

 
    VAX and Alpha systems, like just about every modern computer, represent
    real numbers in a binary floating-point format.  Floating point refers
    to numbers being represented internally with the radix point adjusted
    so that the number's fraction is always between .05 and 1.  This is
    similar in concept to 'scientific notation' of very large or small
    numbers being represented in a notation such as  "6.02 * 10^23" where
    "10^23" represents an exponent indicating a large power of 10.
 
    Internally, a floating point value is stored as a combination of three
    components:
 
            - A base-2 fraction of a certain number of digits
            - An exponent (in powers of 2)
            - A 1 bit sign
 
    Some common floating point formats on VAX and Alpha are as follows
    (note that not all formats are 'native' to both architectures):
 
            Data Type               Bits    Fraction Bits   Exponent Bits
            ---------------         ----    -------------   -------------
            F Floating              32      24              8
            D Floating              64      56              8
            G Floating              64      53              11
            H Floating              128     112             16
            IEEE S Floating         32      23              9
            IEEE T Floating         64      52              12
 
    For the F Floating format, the fraction is 24 binary digits (bits),
    and the exponent is 8 bits.  The exponent is the power of 2 which,
    when multiplied by the fraction, gives the value.  In addition, things
    are manipulated so that the fraction's leftmost digit is always 1 -
    this is called "normalization" - and the exponent adjusted
    accordingly.  Since that bit is always 1, there is no need to store
    it, so it is assumed.  So the fraction "f" is always in the range (0.5
    <= f < 1).  Note that the fraction is 24 bits long, but only 23 bits
    are stored.  A sign bit is included as well, so there is 1 sign bit, 8
    exponent and 23 fraction bits actually store in memory for F Floating
    format.
 
    The exponent for F Floating can range from -127 to +127, and is stored
    by adding 128 to the exponent value - this is called "biasing".  A
    stored exponent of zero is reserved - if the sign is positive, then
    the value is zero, regardless of the fraction.  If the sign is
    negative, this is called a "reserved operand", and generates an
    exception if it is used.
 
    Let's take a simple read-world example - the number 1.  Remembering
    that the fraction is between 0.5 and 1 (but less than 1), we have to
    represent this as a fraction of 0.5 and an exponent of 1 (0.5 times
    2**1). 0.5 can be exactly expressed as a binary fraction, so there's
    no problem with this.  The bits would work out this way:
 
            Sign:     0 (positive), goes in bit 15
            Exponent: 1, biased with 128 gives 129, bits 14:7
            Fraction: 0.5, or in binary, 0.100000000000000000000000
                      bits 6:0 and 31:16 (23 actual bits plus hidden bit)
 
    Putting all the bits together we get:
 
            3              111      00     0
            1              654      76     0
            ffffffffffffffffseeeeeeeefffffff
            00000000000000000100000010000000
 
    or in hex:
 
            0   0   0   0   4   0   8   0
 
    D Floating format is the same as F Floating except that it has another
    32 fraction bits available (all zero in this case).
 
    The first thing you can see is that since we only have 24 fraction
    bits, we are limited in the accuracy to which we can store values.  24
    binary fraction digits translates roughly to 6 decimal digits, so if
    we have a value with more than 6 significant decimal digits, it's
    unlikely it can be represented accurately in F Floating.  We'll choose
    the closest representation we can in 24 bits.
 
    It is important to realize that "nice, clean" decimal fractions such
    as 0.1 and 0.05 don't translate to "nice, clean" binary fractions.  In
    fact, they end up as repeating fractions, where you can keep adding
    bits forever and you'll never get it exactly right.  The binary
    fraction for .05 looks like:
 
                0.110011001100110011001100110011001100... ad infinitum
                                         ^
            The 24th fraction bit is here|
 
    And since the next bit is 1, we'll round up, and thus the F Floating
    value will be slightly higher than .05.  How "slightly"?  Well, the F
    Floating value of CCCD3E4C turns out to be in decimal:
 
            0.05000000074505806
 
    What would we have gotten if we didn't round, and left the 24th bit
    zero?  The hex would be CCCC3E4C and in decimal:
 
            0.04999999701976776
 
    which is much further away from .05 than the first value.
 
    Now take this F Floating value and convert it to D Floating.  This is
    done by tacking on 32 extra fraction bits of zero.  But since the
    original F value is only correct to 24 bits, the D value isn't going
    to be any better.  We'll end up with hex 00000000CCCD3E4C which is
    exactly the same decimal value as above.
 
    If we had started out by converting 0.05 to D Floating, adding 32 bits
    of precision, we'd STILL get a repeating fraction, but the rounding
    error would be much further out.  In hex we'll get:
 
            CCCDCCCCCCCC3E4C
 
    Certainly different than the F-converted-to-D value above. This is
    good to at least 16 decimal digits, but again isn't EXACTLY .05 but
    slightly higher.  You could go to H Floating and get a whopping 113
    fraction bits for about 33 decimal digits of accuracy, but you'd STILL
    not have exactly the right answer.
 
    So when dealing with floating point, remember that you've only got an
    approximation of the value you want.  Sometimes it's exactly right,
    when the fraction can be exactly expressed, but often it isn't,
    especially when dealing with decimal fractions.
 
    And the other thing to remember is that simply converting a value from
    single-precision to double-precision doesn't magically conjure up
    those fraction bits that got chopped off in the first place.  Choose
    your initial precision wisely, and don't necessarily believe that
    those last decimal digits you print out are meaningful.
 
    When arithmetic is performed on these approximations of decimal
    values, the error is compounded to propagated to the final result.  So
    tiny differences in conversion can result in much larger errors later
    on.  Obviously, multiplication or division can magnify the differences
    even further.
 
    Because in many cases an exact decimal number (for example, .05 as
    shown previously) does not accurately convert to a binary number, it
    is important to remember that these numbers are approximate when
    stored in binary floating point format.  This accounts for the common
    advice that financial data and calculations representing dollars and
    cents should not use floating point numbers.
 
    In case you are tempted to "check" your binary computer's floating
    point arithmetic operations by using your "pocket" calculator, be
    aware that the results rarely agree.  Calculators almost always use
    BCD (Binary Coded Decimal) numeric representation, so their results
    tend to be more nearly "exact".  Unfortunately, BCD calculations tend
    to be quite slow, that's why computers tend to use floating point
    arithmetic natively.  Software is usually used to handle BCD
    operations (though the VAX architecture does describe optional support
    for Decimal-string instructions).
 
    The OpenVMS Wizard would also encourage you to review together the
    article "What Every Computer Scientist Should Know About Floating-Point
    Arithmetic: by David Goldberg of the Xerox Palo Alto Research Center
    (available on the internet in several places including
    http://docs.sun.com/source/806-3568/ncg_goldberg.html).
 
    All that nonsense out of the way, this program appears to run just as
    would be expected.  OpenVMS Alpha V7.3-1 (all patches installed), with
    BASIC V1.4-000.  Never, never, never use floating point format for
    financial data, as floating point is, will be, and always has been
    an approximation -- most accountants will prefer integer values for
    monetary data, whether stored in a longword or a quadword.
 
    $ type x.bas
    1       DECLARE DOUBLE  TMP.DBL
            DECLARE LONG    TMP.LONG
            TMP.DBL=39.80*100.
            TMP.LONG=TMP.DBL
            PRINT TMP.LONG
    32767   END
    $ basic x
    $ link x
    run x
     3980
    $
 
    And the same on OpenVMS VAX V7.3 (all patches installed), BASIC
    V3.9-000.
 
    $ basic
 
    VAX BASIC V3.9-000
 
    Ready
 
    1       DECLARE DOUBLE  TMP.DBL
            DECLARE LONG    TMP.LONG
            TMP.DBL=39.80*100.
            TMP.LONG=TMP.DBL
            PRINT TMP.LONG
    32767   END
    run
    NONAME    19-JUN-2003 19:04
 
     3980
    Ready
 
 
  You could also resolve the current and undesired (but entirely correct
  and valid) result with the following change to the code:
 
    	 TMP.DBL='39.80'D*100.
 
  Put another way, you cannot represent 39.8 in a floating point value,
  and assigning it to an integer will truncate it.
 
  You will also want to read the following information in the BASIC
  HELP library:
 
    $ HELP/LIBRARY=BASICHELP CONSTANTS Literal_notation
 

  
     
     answer written or last revised on ( 1-JUL-2003 )
     » close window