Distributed estimation of principal eigenspaces (Q2284361)

From MaRDI portal
scientific article
Language Label Description Also known as
English
Distributed estimation of principal eigenspaces
scientific article

    Statements

    Distributed estimation of principal eigenspaces (English)
    0 references
    0 references
    0 references
    15 January 2020
    0 references
    The authors deal with the problem of large data sets being scattered across distant places. Massive datasets are nowadays ubiquitous and interesting examples are motivating the authors, as the data recorded by IT companies from all around the world (which cannot be stored in a single data center) or health records that are scattered across many hospitals or countries. The fusion or aggregation of such data sets is extremely difficult due to communication cost, privacy, data security, ownerships and other factors. A typical approach is based on distributed statistical/regression methods that first calculate local statistics based on each subdataset and then combine all the subsample-based statistics to produce an aggregated statistic. Principal component analysis (PCA) as a tool in statistical machine learning deals in the recent literature with a certain sparsity on top eigenvectors imposed to overcome the noise accumulation. Distributed PCA needs to handle data that are partitioned and stored across multiple servers. The paper contains a distributed algorithm for estimating the top eigenvectors, the statistical error rates of the aggregated estimator and simulation results to validate the theories, under sub-Gaussian assumptions of the data (where the tails are dominated by the tails of a Gaussian). Further interesting research directions are mentioned, for instance to investigate the possibility to use heavy-tailed distributions (heavier tails than sub-Gaussian tails), when to establish statistical rate with exponential deviation require shrinkage of the data and the control of the induced bias.
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    scattered data sets
    0 references
    machine learning, regression method
    0 references
    distributed algorithm
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references
    0 references