Statistical modelling in biostatistics and bioinformatics. Selected papers partly based on the 3rd international workshop on correlated data modelling, Limerick, Ireland and on the science foundation Ireland's biostatistics and bioinformatics research programme, BIO-SI, Limerick, Ireland (Q2441282)

From MaRDI portal
scientific article
Language Label Description Also known as
English
Statistical modelling in biostatistics and bioinformatics. Selected papers partly based on the 3rd international workshop on correlated data modelling, Limerick, Ireland and on the science foundation Ireland's biostatistics and bioinformatics research programme, BIO-SI, Limerick, Ireland
scientific article

    Statements

    Statistical modelling in biostatistics and bioinformatics. Selected papers partly based on the 3rd international workshop on correlated data modelling, Limerick, Ireland and on the science foundation Ireland's biostatistics and bioinformatics research programme, BIO-SI, Limerick, Ireland (English)
    0 references
    24 March 2014
    0 references
    The book under review consists of four parts covering survival analysis (Chapters 1--4), longitudinal modeling and time series (Chapters 5--7), statistical model development (Chapters 8--11) and applied statistical modeling (Chapters 12--14). The first part begins with a chapter focused on multivariate interval censored survival data. Following a brief description of univariate interval censored data, the authors move on to multivariate data and frailty models. The chapter continues with multivariate interval censored data and a brief description of the inspection pattern and concludes with an analysis of parametric, non-parametric, semi-parametric and regression models. The second chapter is focused on multivariate survival models based on the generalized time-dependent logistic (GTDL). Following an introduction of GTDL models, the authors present the GTDL regression models and continue with an extension that includes random effects in two different ways: the non-PH frailty model and the non-PH random effects model. Next, the H-likelihood estimation is discussed on both models described previously. The chapter concludes with two examples on kidney infection data and chronic granulomatous disease (CGD). The third chapter revolves around frailty models with structural dispersion. First, the regression models with frailty are described with focus on the Weibull model and the GTDL model. Goodness of fit measures are also discussed. Next, the concept of structural dispersion is introduced and the chapter concludes with an example of the incidence of breast cancer, which is analyzed and discussed in detail. The fourth chapter presents random effects ordinal time models for grouped toxicological data from biological control assay. More precisely, discrete survival times are considered as ordered multi-categorical data and the continuation-ratio model is preferred due to its property of structuring a multi-normal distribution into a succession of hierarchical binomial models. The random effects, for which a normal distribution is assumed, are incorporated into the linear predictor; an EM algorithm is used for their estimation. An example for this approach is presented on a fungus infecting a termite, which is a pest for sugar cane fields in Brazil. The second part of the book, describing longitudinal modeling and time series, commences with a chapter on seasonality and structural breaks using as example the monthly short-term visitor arrival time series to New Zealand after the 9/11 terrorist attacks. Following a proof of poor performance for the existing methods to model multiple structural breaks, the authors introduce a new approach based on iterative estimation. Its efficiency is presented on simulated data as well as real data. The sixth chapter illustrates an application of generalized linear models to forecast the risk of insolvency among customers of an automotive financial service. First, the best set of predictors is identified using sample logit, generalized additive models and univariate logistic regression. Next, the assumptions are checked; the authors propose Wald statistics for testing the significance of coefficients, the likelihood ratio for the goodness of fit and the odds ratio for the interpretation of coefficients. The chapter concludes with the classification tables and the ROC curve with their interpretations. The seventh chapter focuses on a data-driven method for joint modeling of intra-subject constrained mean and covariance structures in longitudinal data. The approach is based on an iterative least squares estimation algorithm, where key asymptotic properties of the estimates are given. The authors present in detail the constrained mean covariance model and the joint model, and promote their results on diabetic patient data comparatively analyzed using the traditional methods too. The chapter concludes with simulation studies underlining that a correct choice of the covariant matrix is crucial for minimizing the bias and the variance in estimating the constrained mean component. The third part of the book presents statistical model developments and starts with hierarchical generalized nonlinear models. These are used in practice to model non-normal data for which several sources of error variation can be identified and which allow nonlinear parameters to be added to linear predictors. The author shows, using a practical example in R, how the fitting algorithms for the generalized nonlinear models work using nested optimization. On the same example, he also defines the hierarchical generalized nonlinear model underlining the common principles with generalized nonlinear models. The ninth chapter focuses on cluster detection using robust regression estimates. First, the authors show that regression models based on the Huber M estimator are unstable for large datasets which contain a substantial number of outliers. Next, they present results obtained using the \(L_2\) estimates and use a novel Monte Carlo significance test to compare the two approaches. The chapter concludes with an extensive case study that backs up the theoretical concepts presented earlier. The tenth chapter discusses SNP data which can be clustered using a finite mixture model. For this approach each component of the mixture distribution is associated with a cluster and a wide range of statistical models can be used to describe the data in each cluster. Following an in-depth description of finite mixture models, the authors introduce the finite mixtures with least squares and orthogonal regression lines. The applicability of the method is shown on sugarcane contig examples. The eleventh chapter discusses, for categorical regression models, the choice of the reference subclass. The authors propose a method to obtain an optimal allocation of observations to subclasses and measure the discrepancy between optimal and suboptimal, natural occurring clusters, using a statistic based on generalized variance. Following a general description of the ideal allocation, the authors conclude with simulation studies on one or more categorical variables. The drawbacks of \(\chi^2\) for complicated GLMs are also presented and the chapter includes two appendices, one with the R code for generating random compositions and the second containing the details for the derivation of optimal allocation with two binary covariates. The fourth part of the book, ``Applied statistical modelling'', commences with a chapter on statistical methods for detecting selective sweeps, i.e. a positive mutation that spreads through a population, e.g. the lactose mutation in Northern Europe compared to the lack of it in Africa. The author starts with the genetic data and the Wright-Fisher model for mutations. Next, the coalescent trees are introduced and mutation and recombination events are discussed in this context. The author also presents classical tests for selective sweeps, including the nucleotide diversity and the Tajima test for selection. The chapter concludes with an overview of empirical methods and composite likelihood methods for which both advantages and disadvantages are discussed. The thirteenth chapter discusses reproductive allocation (RA) and the use of mixture models and bootstrap analysis to assess it. First, the authors predict the RA on the original scale and discuss issues such as zero values or plant batch effect that requires complex modeling. Next, they introduce a two-component finite-mixture model framework for which they also present an example. The final chapter focuses on model selection algorithms for log-linear models and multidimensional contingency tables. Following an introduction of the basic notions of contingency table, bijective mapping, flat tables and loglinear modelling, the authors proceed with the discussion of sparseness, goodness of fit and residuals. Next, the model selection methods are presented and include the classical stepwise search algorithms, the penalized likelihood and the smooth LASSO. The chapter concludes with applications and examples for these models. Although written in an accessible format, the book requires extensive prior knowledge of linear and nonlinear models, survival analysis, time series and clustering. Nevertheless, the variety of topics make it a must-have for a computational biology/bioinformatics lab.
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    interval-censored survival data
    0 references
    frailty models
    0 references
    model choice
    0 references
    parametric models
    0 references
    semi-parametric models
    0 references
    non-parametric models
    0 references
    generalized time-dependent logistic (GTDL)
    0 references
    H-likelihood
    0 references
    non-PH model
    0 references
    random effects
    0 references
    clustered data
    0 references
    multi-categorical data
    0 references
    ordinal regression
    0 references
    0 references