
improve iterations:
- sqrt_iteration() costs 3 mults, schoenhage states
  that sqrt can be done with only 2.33 mults 
- inverse_iteration() costs 2.66 mults
  implement inverse that costs only 2 mults
- ? do the 2.33 and 2 include fft-cacheing ?
- improve iroot_iteration() (inverse root)
- possible return also root in iroot_iteration()

division better than inv & mult

implement rprimesum() with schoenhage's agm

implement fft cache:
- reusing fft
- enlarging ffts
- special version of pow()

better multiply for numbers of different length

for faster iroot_iteration():
version of pow(a, aprec, ex, b, bprec)

streamlined version of agm(1,rx^-n)

fused multiply-add:  h1 = h1*h2*(i/j)+h3

ratio-add: h = h + (i/j) == (j*h + i)/j

temporary hfloats:
- build a temporary stack and funcs get_tmp_hf(), let_tmp_hf()
- letting the programmer explicitly register temporarily 
  unused hfloats a candidates for temporaries
- (possibly dangerous) temp hfloats in workspace

logarithm with one agm only (pi and log 2 precalculated) cf. salamin:
b := # of bits
log(2) = pi/(2*m*agm(1,4/2^m))         where m>b/2
log(t) = pi/(2*agm(1,4/s)) - m*log(2)  where s = t*2^m > 2^(b/2)

precision:
- some excess LIMBs to generally enhance precision

avoid excessive copying

avoid shifting (? pad LIMBs?)

implement binary splitting with accelerated arctan-series (cf. arith)

implement toom-cook as mass storage multiply
