Below is a comparison  of the performance 
of the fht_fft() vs. pfa_fft().

Run was on a i486/100, results for other processor types
(especially those with more registers) might be VERY different.
(Note added: on a pentium/133 the pfa_fft() looks even worse) 

The pfa_fft() is up to 20% faster in some cases, but the performance 
doesn't seem to be better in general.
Note the bad performance for some small lengths.
The figures lead me to the decision not to use the pfa_fft()
for the hfloat package.

dt is the time elapsed for the calls.
mct is the number of calls of the fft routines that were made for timing.

----------------------------------------------------------
 ------------- j=16: -------------- 
 dt= 2.23  mct=  32768     fft.............:     14694 Hz , (6.8054e-05 sec/op)
 dt= 5.25  mct=  32768     pfa_fft.........:    6241.5 Hz , (0.00016022 sec/op)
 npfa()=16
 ------------- j=32: -------------- 
 dt= 2.66  mct=  16384     fft.............:    6159.4 Hz , (0.00016235 sec/op)
 dt= 9.99  mct=  16384     pfa_fft.........:      1640 Hz , (0.00060974 sec/op)
 npfa()=33
 ------------- j=64: -------------- 
 dt= 3.34  mct=   8192     fft.............:    2452.7 Hz , (0.00040771 sec/op)
 dt=   12  mct=   8192     pfa_fft.........:    681.53 Hz , (0.0014673 sec/op)
 npfa()=65
 ------------- j=128: -------------- 
 dt= 3.75  mct=   4096     fft.............:    1092.3 Hz , (0.00091553 sec/op)
 dt= 12.9  mct=   4096     pfa_fft.........:    316.29 Hz , (0.0031616 sec/op)
 npfa()=130
 ------------- j=256: -------------- 
 dt= 4.36  mct=   2048     fft.............:    469.72 Hz , (0.0021289 sec/op)
 dt= 12.9  mct=   2048     pfa_fft.........:    158.15 Hz , (0.0063232 sec/op)
 npfa()=260
 ------------- j=512: -------------- 
 dt= 4.74  mct=   1024     fft.............:    216.03 Hz , (0.0046289 sec/op)
 dt= 13.1  mct=   1024     pfa_fft.........:    77.989 Hz , (0.012822 sec/op)
 npfa()=520
 ------------- j=1024: -------------- 
 dt= 5.49  mct=    512     fft.............:     93.26 Hz , (0.010723 sec/op)
 dt= 11.8  mct=    512     pfa_fft.........:     43.39 Hz , (0.023047 sec/op)
 npfa()=1040
 ------------- j=2048: -------------- 
 dt= 6.32  mct=    256     fft.............:    40.506 Hz , (0.024688 sec/op)
 dt= 9.95  mct=    256     pfa_fft.........:    25.729 Hz , (0.038867 sec/op)
 npfa()=2145
 ------------- j=4096: -------------- 
 dt= 8.19  mct=    128     fft.............:    15.629 Hz , (0.063984 sec/op)
 dt= 7.56  mct=    128     pfa_fft.........:    16.931 Hz , (0.059062 sec/op)
 npfa()=4290
 ------------- j=8192: -------------- 
 dt= 8.58  mct=     64     fft.............:    7.4592 Hz , ( 0.13406 sec/op)
 dt= 7.74  mct=     64     pfa_fft.........:    8.2687 Hz , ( 0.12094 sec/op)
 npfa()=8580
 ------------- j=16384: -------------- 
 dt= 9.51  mct=     32     fft.............:    3.3649 Hz , ( 0.29719 sec/op)
 dt= 8.03  mct=     32     pfa_fft.........:    3.9851 Hz , ( 0.25094 sec/op)
 npfa()=17160
 ------------- j=32768: -------------- 
 dt= 9.94  mct=     16     fft.............:    1.6097 Hz , ( 0.62125 sec/op)
 dt=  8.3  mct=     16     pfa_fft.........:    1.9277 Hz , ( 0.51875 sec/op)
 npfa()=34320
 ------------- j=65536: -------------- 
 dt= 10.9  mct=      8     fft.............:   0.73665 Hz , (  1.3575 sec/op)
 dt= 9.48  mct=      8     pfa_fft.........:   0.84388 Hz , (   1.185 sec/op)
 npfa()=72072
 ------------- j=131072: -------------- 
 dt= 11.3  mct=      4     fft.............:   0.35492 Hz , (  2.8175 sec/op)
 dt=  9.8  mct=      4     pfa_fft.........:   0.40816 Hz , (    2.45 sec/op)
 npfa()=144144
 ------------- j=262144: -------------- 
 dt= 12.2  mct=      2     fft.............:   0.16434 Hz , (   6.085 sec/op)
 dt= 13.5  mct=      2     pfa_fft.........:   0.14837 Hz , (    6.74 sec/op)
 npfa()=360360
 ------------- j=524288: -------------- 
 dt= 12.6  mct=      1     fft.............:  0.079491 Hz , (   12.58 sec/op)
 dt=   14  mct=      1     pfa_fft.........:  0.071633 Hz , (   13.96 sec/op)
 npfa()=720720
