
===================
Miscellaneous notes
===================

I don't really like calling a  large  section  'miscellaneous',  but
there  are  just  too many things that don't fit properly into other
categories.


I've tried hard to  make  the  program  portable, but let's face it,
it's a big, complex numerical program.  I can't  test  every  system
and every compiler.

I've  cleaned up every compiler warning I've found, but that doesn't
really mean all that much.

I've  added a replacement malloc() package for DJGPP.  This is based
on D.J. Delorie's stuff on  his  web  page.   As I've said many many
times,  DJGPP  has  an  idiotic  malloc()  package  that  wastes  an
incredible amount of memory.   It  doesn't even compact memory after
you've free()ed it!  D.J. Delorie should be  severely  punished  for
his  use of that one.  (Sic the I.R.S. on him!)  What kind of a jerk
would chose a lame malloc like that for inclusion in a compiler?

Also in the malloc6.c file,  I've  added a special AlignMalloc() and
AlignFree() to align malloc() calls onto a 256 byte  boundary.  Many
compilers,  including  Borland  C 4.52 and Watcom 10, don't allocate
memory  that  is aligned for efficiency.  They often only align data
on 4 byte boundaires, even though many systems, such as Pentiums, do
much better on 8 or 16 byte boundaries.  These versions (selected in
config.h) should help.  Provided they work  on  your  system.   They
should,  since  they  are  just  a  fairly simple wrapper around the
standard functions.  I didn't take  any  shortcuts, and made them as
open as possible, to avoid compiler problems.  But, I allow for just
in case.  The regular versions also has the ability (selected with a
#if 1)  to  detect  and  warn  you  about  poorly  aligned  malloc()
pointers, although it doesn't try and do anything about it.

I  added  a  few DJGPP specific things to show how much physical and
virtual  memory  is  available,  and  to  let you know if disk write
caching is enabled and  you  are  doing the disk_num version.  These
things are, of course, wrapped  by #ifdef's, so they shouldn't cause
you problems.

In the Fast AGM, I changed the FullMul() to a new FullSquare().  The
Square can be done  very  slightly  faster  than a regular FullMul()
squaring.  Not a heck  of  a  lot,  but  a  little.  Based on actual
timings for the FFT, it'd be about 5% faster, but due to  the  extra
overhead,  it's  nearly  invisible.   It  might show up at very high
computations, but it would  still  be  very little.  The reason this
was done wasn't speed.  Instead, by doing it this  way,  it  removes
the one place in the FastAGM based pi computation  where  we  really
did  do  a  full  sized  multiplication.   Removing that reduces the
memory requirements for  the  multiplication, and essentially raises
the limit of the FFT/NTT based multiplication.  (Meaning that as far
as the, for example, FFT based  multiply is concerned, we can now do
16 million digits of pi with 32m, rather than  only  8m.)   I  tried
this  back in my experimental v1.5 days, but I removed it because it
wasn't worth the effort.  There  was  no savings in time or increase
in maximum range for the FFT multiplication.  However, with  my  new
AGM  style, it's worthwhile because now, in the entire program, I am
only doing  one  FULL  sized  multiplication  per  iteration, and it
happens to be a square.  Before, I was having  to  do  a  two  value
multiply.   (Looking  through Ooura's program is what reminded me of
it.  Of course, Ooura doesn't  use  it.   It's just dead code in his
program.)

I  added  an .ini  management  file  that  I wrote last year for the
Portable D-Flat project (D-Flat  was  Al  Steven's Dr. Dobbs Journal
CUA text based interface.  I spent some time reworking it and making
it much more portable etc.  Although I fixed  many  areas,  I  never
finished it.)  That file is  in  the  public domain.  It allows nice
and descriptive  labels,  sections,  comments,  etc.   And  you  can
finally enter 32 megabytes as 32m instead of 33554432.

I  changed the disk based numbers from using one single file, to the
use of a seperate  file  for  each  big  variable.  I wrote the v2.x
program for my own use and to up to 'only' 32 million  digits.   And
all the variables combined easily fit into one file.  However, there
have  been  a few people semi-seriously talking about computing 256m
digits and beyond with this  program!   Although the ntt586 tops out
at 256m, the fractal mul could let that go higher.  And the basic 32
bit eight 8 prime NTT can go all the way to 1 billion  digits.   And
the  C language limits the fseek() offseting to only 31 bits.  Since
there were 8 variables of  'Digits/2'  bytes  long, plus a few extra
for overhead (or 10 'Digits/2' long plus overhead, if the FractalMul
kicked in), the program was  limited  by C's file I/O ability.  (DOS
has a file limit of 4gb-1, so that could be a problem, but C's  file
limitation comes first.)  (You might  suggest  that I simply use C's
'fpos_t' type which is guaranteed to  contain  all  the  information
needed to access any position in  a  file.  The catch is that fpos_t
can _still_ be a regular signed integer.  So it wouldn't help.)

By changing the  program  to  use  seperate  files,  each file could
easily hold 1 billion decimal  digits.  That would only consume 512m
each.  If, by some chance, the FractalMul() kicked  in,  that  would
only be at most Digits*2, or a max of 1gb.  Since each NTT works out
to  be Digits/4 bytes, each part would only be 256m.  BUT, I put all
8 into a single file, so  that  file will total 2gb!  That shouldn't
cause a problem, though, because I won't be fseek()'ing the end, and
based on my reading of the C standard, the fseek()'ing is  the  part
that causes the problem.  Simply reading up to 2gb should be safe.

So to allow for all those things, say just (did I say 'Just'??!)  1
billion  (1024m)  digits.   That  should  certainly  be enough for a
while!  After all, that will  take  a  minimum of 256m of memory for
the NTT.

[Actually, I'm not sure  that  the  rest  of  the program would work
properly!  Those ranges are so far beyond what I was expecting  when
I  wrote  the  prorgram,  that  I'm  just not sure.  I might have to
implement an entire  file  I/O  system  based  on multiple files and
positions and lengths that are 'double'  instead  of  'size_t'.   It
could  probably be done, but maaaaaan, it'd be better to just switch
to a system dependant file  size  type,  or the new ANSI/ISO C which
does 64 bits  for  the  file  sizes  and positions.  Those solutions
would be fairly simple.  Just a matter  of  changing  the  file  I/O
functions  and  the few places where they are called with a position
or length that would be larger than old ANSI/ISO C limits.  Although
I think it would work up to 1 billion digits,  without  encountering
any  C  limitations,  I  can't guarantee it.  As I said, ranges like
this are so far beyond  what  I  originally planed that I'm just not
sure.]

I added some code  to  attempt  to  detect  a decent setting for the
cache size.  If you specify a '0' as the cache size setting  in  the
pi.ini file, the program will run some simple tests, and then update
the  pi.ini file with what it thinks is a good guess.  Of course, it
never  does  this if you specify a size other than 0.  This will not
be optimal, just a 'good guess'.  This may  take  a  minute  or  so.
Also,  what is actually the best may vary depending upon whether you
are using the NTT or FFT.



