This guide summarizes a series of programs for analyzing frequency spectra using LNRE models. These programs are available under the GNU general public license (GPL) in the hope that they will be useful but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.
The programs documented below are written in C and can be run from the command line in Linux and UNIX systems. Preprocessing is done by spectrum, which takes a text as input, and outputs various files including a word frequency list, a file with the developmental profiles of the text, and a frequency spectrum. These files, in various combinations, form the input for the LNRE analyses. In case only a word frequency list is available, wlf2spc can be used to derive the frequency spectrum that is the point of departure for all LNRE models. The adjusted models described in Chapter 5, however, cannot be applied in this case, as they crucially depend on the empirical developmental structure of the text.
All output files, except summary files, are in table format with column headers, so that they can be easily serve as input to the statistical programming environments R and Splus.
A graphical user interface to most C programs is provided by the tcl/tk program lexstats, designed for Linux and UNIX systems. LEXSTATS facilitates parameter estimation, notably so for mixture models, for which as yet no reliable automated procedures are available. It provides the possibility to calculate the multavariate X2 and the MSE (lnreChi2), and it also allows the user to inspect the goodness of fit graphically by means of a plotting routine for the frequency spectrum and the developmental spectrum of the vocabulary size and the spectrum elements. Not accessible from LEXSTATS are the programs mcprofile and mcdisp, which implement Monte Carlo methods for the observed developmental profile of a text and its lexical dispersion characteristics as described in sections 1.4 and 5.1.1.
The following three tables summarize the programs by group, the output file types by group, and the file type extensions. Links are provided to the documents describing the use of the individual programs.
INDEX OF PROGRAMS BY GROUP
BASIC ANALYSES
| spectrum | calculate spectrum and word frequency list from text |
| wfl2spc | build frequency spectrum from a word frequency list |
MONTE CARLO METHODS
| mcdisp | dispersion analysis |
| mcprofile | calculate empirical and theoretical developmental profiles |
NON-PARAMETRIC INTERPOLATION
| binomint | binomial interpolation |
| labhub | partition-adjusted binomial interpolation (Labbe-Hubert) |
SPECTRUM SMOOTHING
| spectfit | Naranan-Balasubrahmanyan fit and Good-Turing estimation |
LNRE MODELS
| standard models | |
| lnreZipf | Zipf |
| lnreYuSi | Yule-Simon |
| lnreCarr | lognormal |
| lnreSich | generalized inverse Gauss-Poisson (gamma=-0.5) |
| lnreSgam | generalized inverse Gauss-Poisson (gamma free) |
| partition-adjusted models | |
| ad2Zipf | partition-adjusted Zipf's law |
| ad2YuSi | partition-adjusted Yule-Simon model |
| ad2Carr | partition-adjusted lognormal model |
| ad2Sich | partition-adjusted generalized inverse Gauss-Poisson with gamma=-0.5 |
| ad2Sgam | partition-adjusted generalized inverse Gauss-Poisson with gamma free |
| parameter-adjusted models | |
| adjZipf | parameter-adjusted Zipf's law |
| adjSich | parameter-adjusted generalized inverse Gauss-Poisson with gamma = -0.5 |
| mixture models | |
| lexstats | the GUI facilitates combining two standard LNRE models in a mixture model |
EVALUATING GOODNESS OF FIT
| lnreChi2 | multavariate chi-squared test, and mean squared error |
INDEX OF OUTPUT FILE TYPES BY GROUP
| BASIC ANALYSES | ||
| .obs | empirical developmental profile | spectrum |
| .spc | empirical frequency spectrum | spectrum, wfl2spc |
| .sum | summary of statistics | spectrum, wfl2spc |
| .wfl | empirical word frequency list | spectrum |
| .zrk | empirical rank-frequency list | spectrum |
| .zvc | text in vector form with Zipf ranks | spectrum |
| MONTE CARLO METHODS | ||
| .mch | upper 95% confidence interval | mcprofile |
| .mcl | lower 95% confidence interval | mcprofile |
| .mcm | mean profile | mcprofile |
| .mco | observed empirical profile | mcprofile |
| .mcd | observed and expected dispersion | mcdisp |
| .fik | dispersion frequency table | mcdisp |
| NON-PARAMETRIC INTERPOLATION | ||
| .bin | binomial interpolation profiles | binomint |
| .lhu | binomial interpolation profiles with partition-based adjustment | labhub |
| SPECTRUM SMOOTHING | ||
| _N.spc | expected spectrum Naranan-Balasubrahmanyan fit and Good-Turing estimates | spectfit |
| _N.sum | summary statistics and mse | spectfit |
| LNRE MODELS | ||
| _xX.spc | observed and expected spectrum | all models |
| _xX.fsp | expected spectrum for extended range | all models |
| _xX.sp2 | expected spectrum at 2N_0 | all models |
| _xX.ev2 | V(N_0), E[V(N_0)], E[V(2N_0)] | all models |
| _xX.int | interpolated spectrum and E[V(N)] | all models |
| _xX.ext | extrapolated spectrum and E[V(N)] | all models |
| _aX.fit | fitted data points | adjZipf, adjSich |
| _aX.sta | coefficients of the fit | adjZipf, adjSich |
| _xX.sum | summary statistics and parameters | all models |
| SIGNIFICANCE TESTING | ||
| _xX.chi | goodness-of-fit statistics | lnreChi2 |
| _xX.cov | covariance matrix for lnre model | lnreChi2 |
FILE EXTENSIONS OF LNRE MODELS
| SPECTRUM SMOOTHING | |
| _N.xxx | Naranan-Balasubrahmanyan |
| STANDARD LNRE MODELS | |
| _Z.xxx | Zipf |
| _Y.xxx | Yule-Simon |
| _C.xxx | lognormal |
| _S.xxx | generalized inverse Gauss-Poisson (gamma=-0.5) |
| _G.xxx | generalized inverse Gauss-Poisson (gamma free) |
| PARAMETER ADJUSTED LNRE MODELS | |
| _aZ.xxx | Zipf |
| _aS.xxx | generalized inverse Gauss-Poisson (gamma=-0.5) |
| PARTITION ADJUSTED LNRE MODELS | |
| _bZ.xxx | Zipf |
| _bY.xxx | Yule-Simon |
| _bC.xxx | lognormal |
| _bS.xxx | generalized inverse Gauss-Poisson (gamma=-0.5) |
| _bG.xxx | generalized inverse Gauss-Poisson (gamma free) |
| MIXTURE MODELS | |
| _XY.xxx | X: base component, Y: complement component |