Word frequency distributions (Q5943510)

From MaRDI portal
scientific article; zbMATH DE number 1652010
Language Label Description Also known as
English
Word frequency distributions
scientific article; zbMATH DE number 1652010

    Statements

    Word frequency distributions (English)
    0 references
    0 references
    27 September 2001
    0 references
    This book is an introduction to the statistical analysis of word frequency distributions, i.e., the quantitative aspects of lexical structure. Word frequency distributions are characterized by very large numbers of rare words. This property leads to strange statistical phenomena such as mean frequencies that systematically keep changing as the number of observations is increased, relative frequencies that even in large samples are not fully reliable estimators of population probabilities, and model parameters that emerge as functions of the text size. The aim of this monograph is to make statistical techniques for the analysis of distributions with large numbers of rare events (LNRE distributions) more accessible for non-specialist. First, basic concepts of lexical statistics (e.g., sample size, randomness of occurrence of words, frequency spectrum, Zipf's rank-frequency model, lognormal distributions) and their notations are introduced. Then, non-parametric methods for the analysis of word frequency distributions are discussed, such as the binomial model and its Poisson approximation, LNRE zone (the range of sample sizes where the sample relative frequencies are not good estimates of the corresponding population probabilities), Good-Turing estimates (which adjust sample relative frequencies for the non-negligible frequency weight of the unseen words). The next chapter describes in detail three parametric models, viz. the lognormal model, the Yule-Simon Zipfian model, and the generalized inverse Gauss-Poisson model. Furthermore, the concept of mixture distributions is introduced and illustrated by examples of mixture analyses of morphological data. Also, the effect of non-randomness in word use on the accuracy of the non-parametric and parametric models is explored, all of which are based on the assumption that words occur independently and randomly in texts. The final chapter presents various examples of applications such as a study of distributional properties of the lexicon and morphological productivity. Throughout the book, concepts of probability theory and statistics necessary to understand the analysis of word frequency distributions are carefully introduced. Appendices are included, which give solutions to exercises at the end of various chapters, and contain a documentation of C programs for carrying out statistical analyses of word frequencies (LEXSTATS running under LINUX).
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    statistical analysis of word frequency distributions
    0 references