Software

This guide summarizes a series of programs for analyzing frequency spectra using LNRE models. These programs are available under the GNU general public license (GPL) in the hope that they will be useful but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.

The programs documented below are written in C and can be run from the command line in Linux and UNIX systems. Preprocessing is done by spectrum, which takes a text as input, and outputs various files including a word frequency list, a file with the developmental profiles of the text, and a frequency spectrum. These files, in various combinations, form the input for the LNRE analyses. In case only a word frequency list is available, wlf2spc can be used to derive the frequency spectrum that is the point of departure for all LNRE models. The adjusted models described in Chapter 5, however, cannot be applied in this case, as they crucially depend on the empirical developmental structure of the text.

All output files, except summary files, are in table format with column headers, so that they can be easily serve as input to the statistical programming environments R and Splus.

A graphical user interface to most C programs is provided by the tcl/tk program lexstats, designed for Linux and UNIX systems. LEXSTATS facilitates parameter estimation, notably so for mixture models, for which as yet no reliable automated procedures are available. It provides the possibility to calculate the multavariate X2 and the MSE (lnreChi2), and it also allows the user to inspect the goodness of fit graphically by means of a plotting routine for the frequency spectrum and the developmental spectrum of the vocabulary size and the spectrum elements. Not accessible from LEXSTATS are the programs mcprofile and mcdisp, which implement Monte Carlo methods for the observed developmental profile of a text and its lexical dispersion characteristics as described in sections 1.4 and 5.1.1.

The following three tables summarize the programs by group, the output file types by group, and the file type extensions. Links are provided to the documents describing the use of the individual programs.


INDEX OF PROGRAMS BY GROUP


BASIC ANALYSES

spectrum calculate spectrum and word frequency list from text
wfl2spc build frequency spectrum from a word frequency list

MONTE CARLO METHODS

mcdisp dispersion analysis
mcprofile calculate empirical and theoretical developmental profiles

NON-PARAMETRIC INTERPOLATION

binomint binomial interpolation
labhub partition-adjusted binomial interpolation (Labbe-Hubert)

SPECTRUM SMOOTHING

spectfit Naranan-Balasubrahmanyan fit and Good-Turing estimation

LNRE MODELS

standard models
lnreZipf Zipf
lnreYuSi Yule-Simon
lnreCarr lognormal
lnreSich generalized inverse Gauss-Poisson (gamma=-0.5)
lnreSgam generalized inverse Gauss-Poisson (gamma free)
partition-adjusted models
ad2Zipf partition-adjusted Zipf's law
ad2YuSi partition-adjusted Yule-Simon model
ad2Carr partition-adjusted lognormal model
ad2Sich partition-adjusted generalized inverse Gauss-Poisson with gamma=-0.5
ad2Sgam partition-adjusted generalized inverse Gauss-Poisson with gamma free
parameter-adjusted models
adjZipf parameter-adjusted Zipf's law
adjSich parameter-adjusted generalized inverse Gauss-Poisson with gamma = -0.5
mixture models
lexstats the GUI facilitates combining two standard LNRE models in a mixture model

EVALUATING GOODNESS OF FIT

lnreChi2 multavariate chi-squared test, and mean squared error


INDEX OF OUTPUT FILE TYPES BY GROUP


BASIC ANALYSES
.obs empirical developmental profile spectrum
.spc empirical frequency spectrum spectrum, wfl2spc
.sum summary of statistics spectrum, wfl2spc
.wfl empirical word frequency list spectrum
.zrk empirical rank-frequency list spectrum
.zvc text in vector form with Zipf ranks spectrum
MONTE CARLO METHODS
.mch upper 95% confidence interval mcprofile
.mcl lower 95% confidence interval mcprofile
.mcm mean profile mcprofile
.mco observed empirical profile mcprofile
.mcd observed and expected dispersion mcdisp
.fik dispersion frequency table mcdisp
NON-PARAMETRIC INTERPOLATION
.bin binomial interpolation profiles binomint
.lhu binomial interpolation profiles with partition-based adjustment labhub
SPECTRUM SMOOTHING
_N.spc expected spectrum Naranan-Balasubrahmanyan fit and Good-Turing estimates spectfit
_N.sum summary statistics and mse spectfit
LNRE MODELS
_xX.spc observed and expected spectrum all models
_xX.fsp expected spectrum for extended range all models
_xX.sp2 expected spectrum at 2N_0 all models
_xX.ev2 V(N_0), E[V(N_0)], E[V(2N_0)] all models
_xX.int interpolated spectrum and E[V(N)] all models
_xX.ext extrapolated spectrum and E[V(N)] all models
_aX.fit fitted data points adjZipf, adjSich
_aX.sta coefficients of the fit adjZipf, adjSich
_xX.sum summary statistics and parameters all models
SIGNIFICANCE TESTING
_xX.chi goodness-of-fit statistics lnreChi2
_xX.cov covariance matrix for lnre model lnreChi2

FILE EXTENSIONS OF LNRE MODELS


SPECTRUM SMOOTHING
_N.xxx Naranan-Balasubrahmanyan
STANDARD LNRE MODELS
_Z.xxx Zipf
_Y.xxx Yule-Simon
_C.xxx lognormal
_S.xxx generalized inverse Gauss-Poisson (gamma=-0.5)
_G.xxx generalized inverse Gauss-Poisson (gamma free)
PARAMETER ADJUSTED LNRE MODELS
_aZ.xxx Zipf
_aS.xxx generalized inverse Gauss-Poisson (gamma=-0.5)
PARTITION ADJUSTED LNRE MODELS
_bZ.xxx Zipf
_bY.xxx Yule-Simon
_bC.xxx lognormal
_bS.xxx generalized inverse Gauss-Poisson (gamma=-0.5)
_bG.xxx generalized inverse Gauss-Poisson (gamma free)
MIXTURE MODELS
_XY.xxx X: base component, Y: complement component